Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-09 Thread Dave Chinner
On Fri, May 08, 2015 at 11:02:28PM -0400, Rik van Riel wrote:
> On 05/08/2015 09:14 PM, Linus Torvalds wrote:
> > On Fri, May 8, 2015 at 9:59 AM, Rik van Riel  wrote:
> >>
> >> However, for persistent memory, all of the files will be "in memory".
> > 
> > Yes. However, I doubt you will find a very sane rw filesystem that
> > then also makes them contiguous and aligns them at 2MB boundaries.
> > 
> > Anything is possible, I guess, but things like that are *hard*. The
> > fragmentation issues etc cause it to a really challenging thing.
> 
> The TLB performance bonus of accessing the large files with
> large pages may make it worthwhile to solve that hard problem.

FWIW, for DAX ththe filesystem allocation side is already mostly
solved - this is just an allocation alignment hint, analogous to
RAID stripe alignment.  We don't need to reinvent the wheel here.
i.e. On XFS, use a 2MB stripe unit for the fs, a 2MB extent size
hint for files you want to use large pages on and you'll get 2MB
sized and aligned allocations from the filesystem for as long as
there are such freespace regions available.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-09 Thread Dave Chinner
On Fri, May 08, 2015 at 11:02:28PM -0400, Rik van Riel wrote:
 On 05/08/2015 09:14 PM, Linus Torvalds wrote:
  On Fri, May 8, 2015 at 9:59 AM, Rik van Riel r...@redhat.com wrote:
 
  However, for persistent memory, all of the files will be in memory.
  
  Yes. However, I doubt you will find a very sane rw filesystem that
  then also makes them contiguous and aligns them at 2MB boundaries.
  
  Anything is possible, I guess, but things like that are *hard*. The
  fragmentation issues etc cause it to a really challenging thing.
 
 The TLB performance bonus of accessing the large files with
 large pages may make it worthwhile to solve that hard problem.

FWIW, for DAX ththe filesystem allocation side is already mostly
solved - this is just an allocation alignment hint, analogous to
RAID stripe alignment.  We don't need to reinvent the wheel here.
i.e. On XFS, use a 2MB stripe unit for the fs, a 2MB extent size
hint for files you want to use large pages on and you'll get 2MB
sized and aligned allocations from the filesystem for as long as
there are such freespace regions available.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread Linus Torvalds
On Fri, May 8, 2015 at 8:02 PM, Rik van Riel  wrote:
>
> The TLB performance bonus of accessing the large files with
> large pages may make it worthwhile to solve that hard problem.

Very few people can actually measure that TLB advantage on systems
with good TLB's.

It's largely a myth, fed by some truly crappy TLB fill systems
(particularly sw-filled TLB's on some early RISC CPU's, but even
"modern" CPU's sometimes have glass jaws here because they cant'
prefetch TLB entries or do concurrent page table walks etc).

There are *very* few loads that actually have the kinds of access
patterns where TLB accesses dominate - or are even noticeable -
compared to the normal memory access costs.

That is doubly true with file-backed storage. The main reason you get
TLB costs to be noticeable is with very sparse access patterns, where
you hit as many TLB entries as you hit pages. That simply doesn't
happen with file mappings.

Really. The whole thing about TLB advantages of hugepages is this
almost entirely made-up stupid myth. You almost have to make up the
benchmark for it (_that_ part is easy) to even see it.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread Rik van Riel
On 05/08/2015 09:14 PM, Linus Torvalds wrote:
> On Fri, May 8, 2015 at 9:59 AM, Rik van Riel  wrote:
>>
>> However, for persistent memory, all of the files will be "in memory".
> 
> Yes. However, I doubt you will find a very sane rw filesystem that
> then also makes them contiguous and aligns them at 2MB boundaries.
> 
> Anything is possible, I guess, but things like that are *hard*. The
> fragmentation issues etc cause it to a really challenging thing.

The TLB performance bonus of accessing the large files with
large pages may make it worthwhile to solve that hard problem.

> And if they aren't aligned big contiguous allocations, then they
> aren't relevant from any largepage cases. You'll still have to map
> them 4k at a time etc.

Absolutely, but we only need the 4k struct pages when the
files are mapped. I suspect a lot of the files will just
sit around idle, without being used.

I am not convinced that the idea I wrote down earlier in
this thread is worthwhile now, but it may turn out to be
at some point in the future. It all depends on how much
data people store on DAX filesystems, and how many files
they have open at once.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread Linus Torvalds
On Fri, May 8, 2015 at 9:59 AM, Rik van Riel  wrote:
>
> However, for persistent memory, all of the files will be "in memory".

Yes. However, I doubt you will find a very sane rw filesystem that
then also makes them contiguous and aligns them at 2MB boundaries.

Anything is possible, I guess, but things like that are *hard*. The
fragmentation issues etc cause it to a really challenging thing.

And if they aren't aligned big contiguous allocations, then they
aren't relevant from any largepage cases. You'll still have to map
them 4k at a time etc.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread John Stoffel
> "Linus" == Linus Torvalds  writes:

Linus> On Fri, May 8, 2015 at 7:40 AM, John Stoffel  wrote:
>> 
>> Now go and look at your /home or /data/ or /work areas, where the
>> endusers are actually keeping their day to day work.  Photos, mp3,
>> design files, source code, object code littered around, etc.

Linus> However, the big files in that list are almost immaterial from a
Linus> caching standpoint.

Linus> Caching source code is a big deal - just try not doing it and
Linus> you'll figure it out. And the kernel C source files used to
Linus> have a median size around 4k.

Caching any files is a big deal, and if I'm doing batch edits of large
jpegs, won't they get cached as well?   

Linus> The big files in your home directory? Let me make an educated
Linus> guess.  Very few to *none* of them are actually in your page
Linus> cache right now.  And you'd never even care if they ever made
Linus> it into your page cache *at*all*. Much less whether you could
Linus> ever cache them using large pages using some very fancy cache.

Hmm... probably not honestly, since I'm not a home and not using the
system actively right now.  But I can see situations where being able
to mix different page sizes efficiently might be a good thing.  

Linus> There are big files that care about caches, but they tend to be
Linus> binaries, and for other reasons (things like randomization) you
Linus> would never want to use largepages for those anyway.

Or large design files, like my users at $WORK use, which can be 4Gb in
size for a large design, which is ASIC chip layout work.  So I'm a
little bit in the minority there.  

And yes I do have other users will millions of itty bitty files as
well.  

Linus> So from a page cache standpoint, I think the 4kB size still
Linus> matters. A *lot*. largepages are a complete red herring, and
Linus> will continue to be so pretty much forever (anonymous
Linus> largepages perhaps less so).

I think in the future, being able to efficiently mix page sizes will
become useful, if only to lower the memory overhead of keeping track
of large numbers of pages. 

John

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread Rik van Riel
On 05/08/2015 11:54 AM, Linus Torvalds wrote:
> On Fri, May 8, 2015 at 7:40 AM, John Stoffel  wrote:
>>
>> Now go and look at your /home or /data/ or /work areas, where the
>> endusers are actually keeping their day to day work.  Photos, mp3,
>> design files, source code, object code littered around, etc.
> 
> However, the big files in that list are almost immaterial from a
> caching standpoint.

> The big files in your home directory? Let me make an educated guess.
> Very few to *none* of them are actually in your page cache right now.
> And you'd never even care if they ever made it into your page cache
> *at*all*. Much less whether you could ever cache them using large
> pages using some very fancy cache.

However, for persistent memory, all of the files will be "in memory".

Not instantiating the 4kB struct pages for 2MB areas that are not
currently being accessed with small files may make a difference.
For dynamically allocated 4kB page structs, we need some way to
discover where they are. It may make sense, from a simplicity point
of view, to have one mechanism that works both for pmem and for
normal system memory.

I agree that 4kB granularity needs to continue to work pretty much
forever, though. As long as people continue creating text files,
they will just not be very large.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread Al Viro
On Fri, May 08, 2015 at 08:54:06AM -0700, Linus Torvalds wrote:
> However, the big files in that list are almost immaterial from a
> caching standpoint.

.git/objects/pack/* caching matters a lot, though...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread Linus Torvalds
On Fri, May 8, 2015 at 7:40 AM, John Stoffel  wrote:
>
> Now go and look at your /home or /data/ or /work areas, where the
> endusers are actually keeping their day to day work.  Photos, mp3,
> design files, source code, object code littered around, etc.

However, the big files in that list are almost immaterial from a
caching standpoint.

Caching source code is a big deal - just try not doing it and you'll
figure it out. And the kernel C source files used to have a median
size around 4k.

The big files in your home directory? Let me make an educated guess.
Very few to *none* of them are actually in your page cache right now.
And you'd never even care if they ever made it into your page cache
*at*all*. Much less whether you could ever cache them using large
pages using some very fancy cache.

There are big files that care about caches, but they tend to be
binaries, and for other reasons (things like randomization) you would
never want to use largepages for those anyway.

So from a page cache standpoint, I think the 4kB size still matters. A
*lot*. largepages are a complete red herring, and will continue to be
so pretty much forever (anonymous largepages perhaps less so).

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread Rik van Riel
On 05/08/2015 10:05 AM, Ingo Molnar wrote:
> * Rik van Riel  wrote:

>> Memory trends point in one direction, file size trends in another.
>>
>> For persistent memory, we would not need 4kB page struct pages 
>> unless memory from a particular area was in small files AND those 
>> files were being actively accessed. [...]
> 
> Average file size on my system's /usr is 12.5K:
> 
> triton:/usr> ( echo -n $(echo $(find . -type f -printf "%s\n") | sed 's/ 
> /+/g' | bc); echo -n "/"; find . -type f -printf "%s\n" | wc -l; ) | bc
> 12502
> 
>> [...] Large files (mapped in 2MB chunks) or inactive small files 
>> would not need the 4kB page structs around.
> 
> ... they are the utter uncommon case. 4K is here to stay, and for a 
> very long time - until humans use computers I suspect.

There's a bit of an 80/20 thing going on, though.

The average file size may be small, but most data is used by
large files.

Additionally, a 2MB pmem area that has no small files on it that
are currently open will also not need 4kB page structs.

A system with 2TB of pmem might still only have a few thousand
small files open at any point in time. The rest of the memory
is either in large files, or in small files that have not been
opened recently. We can reclaim the struct pages of 4kB pages
that are not currently in use.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread John Stoffel
> "Ingo" == Ingo Molnar  writes:

Ingo> * Rik van Riel  wrote:

>> The disadvantage is pretty obvious too: 4kB pages would no longer be 
>> the fast case, with an indirection. I do not know how much of an 
>> issue that would be, or whether it even makes sense for 4kB pages to 
>> continue being the fast case going forward.

Ingo> I strongly disagree that 4kB does not matter as much: it is _the_ 
Ingo> bread and butter of 99% of Linux usecases. 4kB isn't going away 
Ingo> anytime soon - THP might look nice in benchmarks, but it does not 
Ingo> matter nearly as much in practice and for filesystems and IO it's 
Ingo> absolutely crazy to think about 2MB granularity.

Ingo> Having said that, I don't think a single jump of indirection is a big 
Ingo> issue - except for the present case where all the pmem IO space is 
Ingo> mapped non-cacheable. Write-through caching patches are in the works 
Ingo> though, and that should make it plenty fast.

>> Memory trends point in one direction, file size trends in another.
>> 
>> For persistent memory, we would not need 4kB page struct pages 
>> unless memory from a particular area was in small files AND those 
>> files were being actively accessed. [...]

Ingo> Average file size on my system's /usr is 12.5K:

Ingo> triton:/usr> ( echo -n $(echo $(find . -type f -printf "%s\n") |
Ingo> sed 's/ /+/g' | bc); echo -n "/"; find . -type f -printf "%s\n"
Ingo> | wc -l; ) | bc 12502

Now go and look at your /home or /data/ or /work areas, where the
endusers are actually keeping their day to day work.  Photos, mp3,
design files, source code, object code littered around, etc.

Now I also have 12Tb filesystems with 30+ million files in them, which
just *suck* for backup, esp incrementals.  I have one monster with 85+
million files (time to get beat on users again ...) which needs to be
pruned.

So I'm not arguing against you, I'm just saying you need better more
representative numbers across more day to day work.  Running this
exact same command against my home directory gets:

528989

So I'm not arguing one way or another... just providing numbers.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread Ingo Molnar

* Rik van Riel  wrote:

> The disadvantage is pretty obvious too: 4kB pages would no longer be 
> the fast case, with an indirection. I do not know how much of an 
> issue that would be, or whether it even makes sense for 4kB pages to 
> continue being the fast case going forward.

I strongly disagree that 4kB does not matter as much: it is _the_ 
bread and butter of 99% of Linux usecases. 4kB isn't going away 
anytime soon - THP might look nice in benchmarks, but it does not 
matter nearly as much in practice and for filesystems and IO it's 
absolutely crazy to think about 2MB granularity.

Having said that, I don't think a single jump of indirection is a big 
issue - except for the present case where all the pmem IO space is 
mapped non-cacheable. Write-through caching patches are in the works 
though, and that should make it plenty fast.

> Memory trends point in one direction, file size trends in another.
> 
> For persistent memory, we would not need 4kB page struct pages 
> unless memory from a particular area was in small files AND those 
> files were being actively accessed. [...]

Average file size on my system's /usr is 12.5K:

triton:/usr> ( echo -n $(echo $(find . -type f -printf "%s\n") | sed 's/ /+/g' 
| bc); echo -n "/"; find . -type f -printf "%s\n" | wc -l; ) | bc
12502

> [...] Large files (mapped in 2MB chunks) or inactive small files 
> would not need the 4kB page structs around.

... they are the utter uncommon case. 4K is here to stay, and for a 
very long time - until humans use computers I suspect.

But I don't think the 2MB metadata chunking is wrong per se.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread Rik van Riel
On 05/07/2015 03:11 PM, Ingo Molnar wrote:

> Stable, global page-struct descriptors are a given for real RAM, where 
> we allocate a struct page for every page in nice, large, mostly linear 
> arrays.
> 
> We'd really need that for pmem too, to get the full power of struct 
> page: and that means allocating them in nice, large, predictable 
> places - such as on the device itself ...
> 
> It might even be 'scattered' across the device, with 64 byte struct 
> page size we can pack 64 descriptors into a single page, so every 65 
> pages we could have a page-struct page.
> 
> Finding a pmem page's struct page would thus involve rounding it 
> modulo 65 and reading that page.
> 
> The problem with that is fourfold:
> 
>  - that we now turn a very kernel internal API and data structure into 
>an ABI. If struct page grows beyond 64 bytes it's a problem.
> 
>  - on bootup (or device discovery time) we'd have to initialize all 
>the page structs. We could probably do this in a hierarchical way, 
>by dividing continuous pmem ranges into power-of-two groups of 
>blocks, and organizing them like the buddy allocator does.
> 
>  - 1.5% of storage space lost.
> 
>  - will wear-leveling properly migrate these 'hot' pages around?

MST and I have been doing some thinking about how to address some of
the issues above.

One way could be to invert the PG_compound logic we have today, by
allocating one struct page for every PMD / THP sized area (2MB on
x86), and dynamically allocating struct pages for the 4kB pages
inside only if the area gets split. They can be freed again when
the area is not being accessed in 4kB chunks.

That way we would always look at the struct page for the 2MB area
first, and if the PG_split bit is set, we look at the array of
dynamically allocated struct pages for this area.

The advantages are obvious: boot time memory overhead and
initialization time are reduced by a factor 512. CPUs could also
take a whole 2MB area in order to do CPU-local 4kB allocations,
defragmentation policies may become a little clearer, etc...

The disadvantage is pretty obvious too: 4kB pages would no longer
be the fast case, with an indirection. I do not know how much of
an issue that would be, or whether it even makes sense for 4kB
pages to continue being the fast case going forward.

Memory trends point in one direction, file size trends in another.

For persistent memory, we would not need 4kB page struct pages unless
memory from a particular area was in small files AND those files were
being actively accessed. Large files (mapped in 2MB chunks) or inactive
small files would not need the 4kB page structs around.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread Al Viro
On Fri, May 08, 2015 at 11:26:01AM +0200, Ingo Molnar wrote:
> 
> * Al Viro  wrote:
> 
> > On Fri, May 08, 2015 at 07:37:59AM +0200, Ingo Molnar wrote:
> > 
> > > So if code does iov_iter_get_pages_alloc() on a user address that 
> > > has a real struct page behind it - and some other code does a 
> > > regular get_user_pages() on it, we'll have two sets of struct page 
> > > descriptors, the 'real' one, and a fake allocated one, right?
> > 
> > Huh?  iov_iter_get_pages() is given an array of pointers to struct 
> > page, which it fills with what it finds.  iov_iter_get_pages_alloc() 
> > *allocates* such an array, fills that with what it finds and gives 
> > the allocated array to caller.
> > 
> > We are not allocating any struct page instances in either of those.
> 
> Ah, stupid me - thanks for the explanation!

My fault, actually - this "pages array" should've been either
"'pages' array" or "array of pointers to struct page".
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread Ingo Molnar

* Al Viro  wrote:

> On Fri, May 08, 2015 at 07:37:59AM +0200, Ingo Molnar wrote:
> 
> > So if code does iov_iter_get_pages_alloc() on a user address that 
> > has a real struct page behind it - and some other code does a 
> > regular get_user_pages() on it, we'll have two sets of struct page 
> > descriptors, the 'real' one, and a fake allocated one, right?
> 
> Huh?  iov_iter_get_pages() is given an array of pointers to struct 
> page, which it fills with what it finds.  iov_iter_get_pages_alloc() 
> *allocates* such an array, fills that with what it finds and gives 
> the allocated array to caller.
> 
> We are not allocating any struct page instances in either of those.

Ah, stupid me - thanks for the explanation!

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread Al Viro
On Fri, May 08, 2015 at 07:37:59AM +0200, Ingo Molnar wrote:

> same as iov_iter_get_pages(), except that pages array is allocated
> (kmalloc if possible, vmalloc if that fails) and left for caller to
> free.  Lustre and NFS ->direct_IO() switched to it.
> 
> Signed-off-by: Al Viro 
> 
> So if code does iov_iter_get_pages_alloc() on a user address that has 
> a real struct page behind it - and some other code does a regular 
> get_user_pages() on it, we'll have two sets of struct page 
> descriptors, the 'real' one, and a fake allocated one, right?

Huh?  iov_iter_get_pages() is given an array of pointers to struct page,
which it fills with what it finds.  iov_iter_get_pages_alloc() *allocates*
such an array, fills that with what it finds and gives the allocated array
to caller.

We are not allocating any struct page instances in either of those.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread Al Viro
On Fri, May 08, 2015 at 07:37:59AM +0200, Ingo Molnar wrote:

 same as iov_iter_get_pages(), except that pages array is allocated
 (kmalloc if possible, vmalloc if that fails) and left for caller to
 free.  Lustre and NFS -direct_IO() switched to it.
 
 Signed-off-by: Al Viro v...@zeniv.linux.org.uk
 
 So if code does iov_iter_get_pages_alloc() on a user address that has 
 a real struct page behind it - and some other code does a regular 
 get_user_pages() on it, we'll have two sets of struct page 
 descriptors, the 'real' one, and a fake allocated one, right?

Huh?  iov_iter_get_pages() is given an array of pointers to struct page,
which it fills with what it finds.  iov_iter_get_pages_alloc() *allocates*
such an array, fills that with what it finds and gives the allocated array
to caller.

We are not allocating any struct page instances in either of those.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread Ingo Molnar

* Al Viro v...@zeniv.linux.org.uk wrote:

 On Fri, May 08, 2015 at 07:37:59AM +0200, Ingo Molnar wrote:
 
  So if code does iov_iter_get_pages_alloc() on a user address that 
  has a real struct page behind it - and some other code does a 
  regular get_user_pages() on it, we'll have two sets of struct page 
  descriptors, the 'real' one, and a fake allocated one, right?
 
 Huh?  iov_iter_get_pages() is given an array of pointers to struct 
 page, which it fills with what it finds.  iov_iter_get_pages_alloc() 
 *allocates* such an array, fills that with what it finds and gives 
 the allocated array to caller.
 
 We are not allocating any struct page instances in either of those.

Ah, stupid me - thanks for the explanation!

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread Al Viro
On Fri, May 08, 2015 at 11:26:01AM +0200, Ingo Molnar wrote:
 
 * Al Viro v...@zeniv.linux.org.uk wrote:
 
  On Fri, May 08, 2015 at 07:37:59AM +0200, Ingo Molnar wrote:
  
   So if code does iov_iter_get_pages_alloc() on a user address that 
   has a real struct page behind it - and some other code does a 
   regular get_user_pages() on it, we'll have two sets of struct page 
   descriptors, the 'real' one, and a fake allocated one, right?
  
  Huh?  iov_iter_get_pages() is given an array of pointers to struct 
  page, which it fills with what it finds.  iov_iter_get_pages_alloc() 
  *allocates* such an array, fills that with what it finds and gives 
  the allocated array to caller.
  
  We are not allocating any struct page instances in either of those.
 
 Ah, stupid me - thanks for the explanation!

My fault, actually - this pages array should've been either
'pages' array or array of pointers to struct page.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread Rik van Riel
On 05/08/2015 10:05 AM, Ingo Molnar wrote:
 * Rik van Riel r...@redhat.com wrote:

 Memory trends point in one direction, file size trends in another.

 For persistent memory, we would not need 4kB page struct pages 
 unless memory from a particular area was in small files AND those 
 files were being actively accessed. [...]
 
 Average file size on my system's /usr is 12.5K:
 
 triton:/usr ( echo -n $(echo $(find . -type f -printf %s\n) | sed 's/ 
 /+/g' | bc); echo -n /; find . -type f -printf %s\n | wc -l; ) | bc
 12502
 
 [...] Large files (mapped in 2MB chunks) or inactive small files 
 would not need the 4kB page structs around.
 
 ... they are the utter uncommon case. 4K is here to stay, and for a 
 very long time - until humans use computers I suspect.

There's a bit of an 80/20 thing going on, though.

The average file size may be small, but most data is used by
large files.

Additionally, a 2MB pmem area that has no small files on it that
are currently open will also not need 4kB page structs.

A system with 2TB of pmem might still only have a few thousand
small files open at any point in time. The rest of the memory
is either in large files, or in small files that have not been
opened recently. We can reclaim the struct pages of 4kB pages
that are not currently in use.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread Ingo Molnar

* Rik van Riel r...@redhat.com wrote:

 The disadvantage is pretty obvious too: 4kB pages would no longer be 
 the fast case, with an indirection. I do not know how much of an 
 issue that would be, or whether it even makes sense for 4kB pages to 
 continue being the fast case going forward.

I strongly disagree that 4kB does not matter as much: it is _the_ 
bread and butter of 99% of Linux usecases. 4kB isn't going away 
anytime soon - THP might look nice in benchmarks, but it does not 
matter nearly as much in practice and for filesystems and IO it's 
absolutely crazy to think about 2MB granularity.

Having said that, I don't think a single jump of indirection is a big 
issue - except for the present case where all the pmem IO space is 
mapped non-cacheable. Write-through caching patches are in the works 
though, and that should make it plenty fast.

 Memory trends point in one direction, file size trends in another.
 
 For persistent memory, we would not need 4kB page struct pages 
 unless memory from a particular area was in small files AND those 
 files were being actively accessed. [...]

Average file size on my system's /usr is 12.5K:

triton:/usr ( echo -n $(echo $(find . -type f -printf %s\n) | sed 's/ /+/g' 
| bc); echo -n /; find . -type f -printf %s\n | wc -l; ) | bc
12502

 [...] Large files (mapped in 2MB chunks) or inactive small files 
 would not need the 4kB page structs around.

... they are the utter uncommon case. 4K is here to stay, and for a 
very long time - until humans use computers I suspect.

But I don't think the 2MB metadata chunking is wrong per se.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread John Stoffel
 Ingo == Ingo Molnar mi...@kernel.org writes:

Ingo * Rik van Riel r...@redhat.com wrote:

 The disadvantage is pretty obvious too: 4kB pages would no longer be 
 the fast case, with an indirection. I do not know how much of an 
 issue that would be, or whether it even makes sense for 4kB pages to 
 continue being the fast case going forward.

Ingo I strongly disagree that 4kB does not matter as much: it is _the_ 
Ingo bread and butter of 99% of Linux usecases. 4kB isn't going away 
Ingo anytime soon - THP might look nice in benchmarks, but it does not 
Ingo matter nearly as much in practice and for filesystems and IO it's 
Ingo absolutely crazy to think about 2MB granularity.

Ingo Having said that, I don't think a single jump of indirection is a big 
Ingo issue - except for the present case where all the pmem IO space is 
Ingo mapped non-cacheable. Write-through caching patches are in the works 
Ingo though, and that should make it plenty fast.

 Memory trends point in one direction, file size trends in another.
 
 For persistent memory, we would not need 4kB page struct pages 
 unless memory from a particular area was in small files AND those 
 files were being actively accessed. [...]

Ingo Average file size on my system's /usr is 12.5K:

Ingo triton:/usr ( echo -n $(echo $(find . -type f -printf %s\n) |
Ingo sed 's/ /+/g' | bc); echo -n /; find . -type f -printf %s\n
Ingo | wc -l; ) | bc 12502

Now go and look at your /home or /data/ or /work areas, where the
endusers are actually keeping their day to day work.  Photos, mp3,
design files, source code, object code littered around, etc.

Now I also have 12Tb filesystems with 30+ million files in them, which
just *suck* for backup, esp incrementals.  I have one monster with 85+
million files (time to get beat on users again ...) which needs to be
pruned.

So I'm not arguing against you, I'm just saying you need better more
representative numbers across more day to day work.  Running this
exact same command against my home directory gets:

528989

So I'm not arguing one way or another... just providing numbers.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread Rik van Riel
On 05/07/2015 03:11 PM, Ingo Molnar wrote:

 Stable, global page-struct descriptors are a given for real RAM, where 
 we allocate a struct page for every page in nice, large, mostly linear 
 arrays.
 
 We'd really need that for pmem too, to get the full power of struct 
 page: and that means allocating them in nice, large, predictable 
 places - such as on the device itself ...
 
 It might even be 'scattered' across the device, with 64 byte struct 
 page size we can pack 64 descriptors into a single page, so every 65 
 pages we could have a page-struct page.
 
 Finding a pmem page's struct page would thus involve rounding it 
 modulo 65 and reading that page.
 
 The problem with that is fourfold:
 
  - that we now turn a very kernel internal API and data structure into 
an ABI. If struct page grows beyond 64 bytes it's a problem.
 
  - on bootup (or device discovery time) we'd have to initialize all 
the page structs. We could probably do this in a hierarchical way, 
by dividing continuous pmem ranges into power-of-two groups of 
blocks, and organizing them like the buddy allocator does.
 
  - 1.5% of storage space lost.
 
  - will wear-leveling properly migrate these 'hot' pages around?

MST and I have been doing some thinking about how to address some of
the issues above.

One way could be to invert the PG_compound logic we have today, by
allocating one struct page for every PMD / THP sized area (2MB on
x86), and dynamically allocating struct pages for the 4kB pages
inside only if the area gets split. They can be freed again when
the area is not being accessed in 4kB chunks.

That way we would always look at the struct page for the 2MB area
first, and if the PG_split bit is set, we look at the array of
dynamically allocated struct pages for this area.

The advantages are obvious: boot time memory overhead and
initialization time are reduced by a factor 512. CPUs could also
take a whole 2MB area in order to do CPU-local 4kB allocations,
defragmentation policies may become a little clearer, etc...

The disadvantage is pretty obvious too: 4kB pages would no longer
be the fast case, with an indirection. I do not know how much of
an issue that would be, or whether it even makes sense for 4kB
pages to continue being the fast case going forward.

Memory trends point in one direction, file size trends in another.

For persistent memory, we would not need 4kB page struct pages unless
memory from a particular area was in small files AND those files were
being actively accessed. Large files (mapped in 2MB chunks) or inactive
small files would not need the 4kB page structs around.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread Al Viro
On Fri, May 08, 2015 at 08:54:06AM -0700, Linus Torvalds wrote:
 However, the big files in that list are almost immaterial from a
 caching standpoint.

.git/objects/pack/* caching matters a lot, though...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread Rik van Riel
On 05/08/2015 11:54 AM, Linus Torvalds wrote:
 On Fri, May 8, 2015 at 7:40 AM, John Stoffel j...@stoffel.org wrote:

 Now go and look at your /home or /data/ or /work areas, where the
 endusers are actually keeping their day to day work.  Photos, mp3,
 design files, source code, object code littered around, etc.
 
 However, the big files in that list are almost immaterial from a
 caching standpoint.

 The big files in your home directory? Let me make an educated guess.
 Very few to *none* of them are actually in your page cache right now.
 And you'd never even care if they ever made it into your page cache
 *at*all*. Much less whether you could ever cache them using large
 pages using some very fancy cache.

However, for persistent memory, all of the files will be in memory.

Not instantiating the 4kB struct pages for 2MB areas that are not
currently being accessed with small files may make a difference.
For dynamically allocated 4kB page structs, we need some way to
discover where they are. It may make sense, from a simplicity point
of view, to have one mechanism that works both for pmem and for
normal system memory.

I agree that 4kB granularity needs to continue to work pretty much
forever, though. As long as people continue creating text files,
they will just not be very large.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread Linus Torvalds
On Fri, May 8, 2015 at 7:40 AM, John Stoffel j...@stoffel.org wrote:

 Now go and look at your /home or /data/ or /work areas, where the
 endusers are actually keeping their day to day work.  Photos, mp3,
 design files, source code, object code littered around, etc.

However, the big files in that list are almost immaterial from a
caching standpoint.

Caching source code is a big deal - just try not doing it and you'll
figure it out. And the kernel C source files used to have a median
size around 4k.

The big files in your home directory? Let me make an educated guess.
Very few to *none* of them are actually in your page cache right now.
And you'd never even care if they ever made it into your page cache
*at*all*. Much less whether you could ever cache them using large
pages using some very fancy cache.

There are big files that care about caches, but they tend to be
binaries, and for other reasons (things like randomization) you would
never want to use largepages for those anyway.

So from a page cache standpoint, I think the 4kB size still matters. A
*lot*. largepages are a complete red herring, and will continue to be
so pretty much forever (anonymous largepages perhaps less so).

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread John Stoffel
 Linus == Linus Torvalds torva...@linux-foundation.org writes:

Linus On Fri, May 8, 2015 at 7:40 AM, John Stoffel j...@stoffel.org wrote:
 
 Now go and look at your /home or /data/ or /work areas, where the
 endusers are actually keeping their day to day work.  Photos, mp3,
 design files, source code, object code littered around, etc.

Linus However, the big files in that list are almost immaterial from a
Linus caching standpoint.

Linus Caching source code is a big deal - just try not doing it and
Linus you'll figure it out. And the kernel C source files used to
Linus have a median size around 4k.

Caching any files is a big deal, and if I'm doing batch edits of large
jpegs, won't they get cached as well?   

Linus The big files in your home directory? Let me make an educated
Linus guess.  Very few to *none* of them are actually in your page
Linus cache right now.  And you'd never even care if they ever made
Linus it into your page cache *at*all*. Much less whether you could
Linus ever cache them using large pages using some very fancy cache.

Hmm... probably not honestly, since I'm not a home and not using the
system actively right now.  But I can see situations where being able
to mix different page sizes efficiently might be a good thing.  

Linus There are big files that care about caches, but they tend to be
Linus binaries, and for other reasons (things like randomization) you
Linus would never want to use largepages for those anyway.

Or large design files, like my users at $WORK use, which can be 4Gb in
size for a large design, which is ASIC chip layout work.  So I'm a
little bit in the minority there.  

And yes I do have other users will millions of itty bitty files as
well.  

Linus So from a page cache standpoint, I think the 4kB size still
Linus matters. A *lot*. largepages are a complete red herring, and
Linus will continue to be so pretty much forever (anonymous
Linus largepages perhaps less so).

I think in the future, being able to efficiently mix page sizes will
become useful, if only to lower the memory overhead of keeping track
of large numbers of pages. 

John

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread Rik van Riel
On 05/08/2015 09:14 PM, Linus Torvalds wrote:
 On Fri, May 8, 2015 at 9:59 AM, Rik van Riel r...@redhat.com wrote:

 However, for persistent memory, all of the files will be in memory.
 
 Yes. However, I doubt you will find a very sane rw filesystem that
 then also makes them contiguous and aligns them at 2MB boundaries.
 
 Anything is possible, I guess, but things like that are *hard*. The
 fragmentation issues etc cause it to a really challenging thing.

The TLB performance bonus of accessing the large files with
large pages may make it worthwhile to solve that hard problem.

 And if they aren't aligned big contiguous allocations, then they
 aren't relevant from any largepage cases. You'll still have to map
 them 4k at a time etc.

Absolutely, but we only need the 4k struct pages when the
files are mapped. I suspect a lot of the files will just
sit around idle, without being used.

I am not convinced that the idea I wrote down earlier in
this thread is worthwhile now, but it may turn out to be
at some point in the future. It all depends on how much
data people store on DAX filesystems, and how many files
they have open at once.

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread Linus Torvalds
On Fri, May 8, 2015 at 8:02 PM, Rik van Riel r...@redhat.com wrote:

 The TLB performance bonus of accessing the large files with
 large pages may make it worthwhile to solve that hard problem.

Very few people can actually measure that TLB advantage on systems
with good TLB's.

It's largely a myth, fed by some truly crappy TLB fill systems
(particularly sw-filled TLB's on some early RISC CPU's, but even
modern CPU's sometimes have glass jaws here because they cant'
prefetch TLB entries or do concurrent page table walks etc).

There are *very* few loads that actually have the kinds of access
patterns where TLB accesses dominate - or are even noticeable -
compared to the normal memory access costs.

That is doubly true with file-backed storage. The main reason you get
TLB costs to be noticeable is with very sparse access patterns, where
you hit as many TLB entries as you hit pages. That simply doesn't
happen with file mappings.

Really. The whole thing about TLB advantages of hugepages is this
almost entirely made-up stupid myth. You almost have to make up the
benchmark for it (_that_ part is easy) to even see it.

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-08 Thread Linus Torvalds
On Fri, May 8, 2015 at 9:59 AM, Rik van Riel r...@redhat.com wrote:

 However, for persistent memory, all of the files will be in memory.

Yes. However, I doubt you will find a very sane rw filesystem that
then also makes them contiguous and aligns them at 2MB boundaries.

Anything is possible, I guess, but things like that are *hard*. The
fragmentation issues etc cause it to a really challenging thing.

And if they aren't aligned big contiguous allocations, then they
aren't relevant from any largepage cases. You'll still have to map
them 4k at a time etc.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Ingo Molnar

Al,

I was wondering about the struct page rules of 
iov_iter_get_pages_alloc(), used in various places. There's no 
documentation whatsoever in lib/iov_iter.c, nor in 
include/linux/uio.h, and the changelog that introduced it only says:

 commit 91f79c43d1b54d7154b118860d81b39bad07dfff
 Author: Al Viro 
 Date:   Fri Mar 21 04:58:33 2014 -0400

new helper: iov_iter_get_pages_alloc()

same as iov_iter_get_pages(), except that pages array is allocated
(kmalloc if possible, vmalloc if that fails) and left for caller to
free.  Lustre and NFS ->direct_IO() switched to it.

Signed-off-by: Al Viro 

So if code does iov_iter_get_pages_alloc() on a user address that has 
a real struct page behind it - and some other code does a regular 
get_user_pages() on it, we'll have two sets of struct page 
descriptors, the 'real' one, and a fake allocated one, right?

How does that work? Nobody else can ever discover these fake page 
structs, so they don't really serve any 'real' synchronization purpose 
other than the limited role of IO completion.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Jerome Glisse
On Thu, May 07, 2015 at 09:53:13PM +0200, Ingo Molnar wrote:
> 
> * Ingo Molnar  wrote:
> 
> > > Is handling kernel pagefault on the vmemmap completely out of the 
> > > picture ? So we would carveout a chunck of kernel address space 
> > > for those pfn and use it for vmemmap and handle pagefault on it.
> > 
> > That's pretty clever. The page fault doesn't even have to do remote 
> > TLB shootdown, because it only establishes mappings - so it's pretty 
> > atomic, a bit like the minor vmalloc() area faults we are doing.
> > 
> > Some sort of LRA (least recently allocated) scheme could unmap the 
> > area in chunks if it's beyond a certain size, to keep a limit on 
> > size. Done from the same context and would use remote TLB shootdown.
> > 
> > The only limitation I can see is that such faults would have to be 
> > able to sleep, to do the allocation. So pfn_to_page() could not be 
> > used in arbitrary contexts.
> 
> So another complication would be that we cannot just unmap such pages 
> when we want to recycle them, because the struct page in them might be 
> in use - so all struct page uses would have to refcount the underlying 
> page. We don't really do that today: code just looks up struct pages 
> and assumes they never go away.

I still think this is doable, like i said in another email, i think we
should introduce a special pfn_to_page_dev|pmem|waffle|somethingyoulike()
to place that are allowed to allocate the underlying struct page.

For instance we can use a default page to backup all this special vmem
range with some specialy crafted struct page that says that its is
invalid memory (make this mapping read only so all write to this
special struct page is forbidden).

Now once an authorized user comes along and need a real struct page it
trigger a page allocation that replace the page full of fake invalid
struct page with a page with correct valid struct page that can be
manipulated by other part of the kernel.

So regular pfn_to_page() would test against special vmemmap and if
special test the content of struct page for some flag. If it's the
invalid page flag it returns 0.

But once a proper struct page is allocated then pfn_page would return
the struct page as expected.

That way you will catch all invalid user of such page ie user that use
the page after its lifetime is done. You will also limit the creation
of the underlying proper struct page to only code that are legitimate
to ask for a proper struct page for given pfn.

Also you would get kernel write fault on the page full of fake struct
page and that would allow to catch further wrong use.

Anyway this is how i envision this and i think it would work for my
usecase too (GPU it is for me :))

Cheers,
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Dan Williams
On Thu, May 7, 2015 at 10:43 AM, Linus Torvalds
 wrote:
> On Thu, May 7, 2015 at 9:03 AM, Dan Williams  wrote:
>>
>> Ok, I'll keep thinking about this and come back when we have a better
>> story about passing mmap'd persistent memory around in userspace.
>
> Ok. And if we do decide to go with your kind of "__pfn" type, I'd
> probably prefer that we encode the type in the low bits of the word
> rather than compare against PAGE_OFFSET. On some architectures
> PAGE_OFFSET is zero (admittedly probably not ones you'd care about),
> but even on x86 it's a *lot* cheaper to test the low bit than it is to
> compare against a big constant.
>
> We know "struct page *" is supposed to be at least aligned to at least
> "unsigned long", so you'd have two bits of type information (and we
> could easily make it three). With "0" being a real pointer, so that
> you can use the pointer itself without masking.
>
> And the "hide type in low bits of pointer" is something we've done
> quite a lot, so it's more "kernel coding style" anyway.

Ok.  Although __pfn_t also stores pfn values directly which will
consume those 2 bits so we'll need to shift pfns up when storing.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Ingo Molnar

* Ingo Molnar  wrote:

> > Is handling kernel pagefault on the vmemmap completely out of the 
> > picture ? So we would carveout a chunck of kernel address space 
> > for those pfn and use it for vmemmap and handle pagefault on it.
> 
> That's pretty clever. The page fault doesn't even have to do remote 
> TLB shootdown, because it only establishes mappings - so it's pretty 
> atomic, a bit like the minor vmalloc() area faults we are doing.
> 
> Some sort of LRA (least recently allocated) scheme could unmap the 
> area in chunks if it's beyond a certain size, to keep a limit on 
> size. Done from the same context and would use remote TLB shootdown.
> 
> The only limitation I can see is that such faults would have to be 
> able to sleep, to do the allocation. So pfn_to_page() could not be 
> used in arbitrary contexts.

So another complication would be that we cannot just unmap such pages 
when we want to recycle them, because the struct page in them might be 
in use - so all struct page uses would have to refcount the underlying 
page. We don't really do that today: code just looks up struct pages 
and assumes they never go away.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Ingo Molnar

* Jerome Glisse  wrote:

> > So I think the main value of struct page is if everyone on the 
> > system sees the same struct page for the same pfn - not just the 
> > temporary IO instance.
> > 
> > The idea of having very temporary struct page arrays misses the 
> > point I think: if struct page is used as essentially an IO sglist 
> > then most of the synchronization properties are lost: then we 
> > might as well use the real deal in that case and skip the dynamic 
> > allocation and use pfns directly and avoid the dynamic allocation 
> > overhead.
> > 
> > Stable, global page-struct descriptors are a given for real RAM, 
> > where we allocate a struct page for every page in nice, large, 
> > mostly linear arrays.
> > 
> > We'd really need that for pmem too, to get the full power of 
> > struct page: and that means allocating them in nice, large, 
> > predictable places - such as on the device itself ...
> 
> Is handling kernel pagefault on the vmemmap completely out of the 
> picture ? So we would carveout a chunck of kernel address space for 
> those pfn and use it for vmemmap and handle pagefault on it.

That's pretty clever. The page fault doesn't even have to do remote 
TLB shootdown, because it only establishes mappings - so it's pretty 
atomic, a bit like the minor vmalloc() area faults we are doing.

Some sort of LRA (least recently allocated) scheme could unmap the 
area in chunks if it's beyond a certain size, to keep a limit on size. 
Done from the same context and would use remote TLB shootdown.

The only limitation I can see is that such faults would have to be 
able to sleep, to do the allocation. So pfn_to_page() could not be 
used in arbitrary contexts.

> Again here i think that GPU folks would like a solution where they 
> can have a page struct but it would not be PMEM just device memory. 
> So if we can come up with something generic enough to server both 
> purpose that would be better in my view.

Yes.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Dan Williams
On Thu, May 7, 2015 at 11:40 AM, Ingo Molnar  wrote:
>
> * Dan Williams  wrote:
>
>> On Thu, May 7, 2015 at 9:18 AM, Christoph Hellwig  wrote:
>> > On Wed, May 06, 2015 at 05:19:48PM -0700, Linus Torvalds wrote:
>> >> What is the primary thing that is driving this need? Do we have a very
>> >> concrete example?
>> >
>> > FYI, I plan to to implement RAID acceleration using nvdimms, and I
>> > plan to ue pages for that.  The code just merge for 4.1 can easily
>> > support page backing, and I plan to use that for now.  This still
>> > leaves support for the gigantic intel nvdimms discovered over EFI
>> > out, but given that I don't have access to them, and I dont know
>> > of any publically available there's little I can do for now.  But
>> > adding on demand allocate struct pages for the seems like the
>> > easiest way forward.  Boaz already has code to allocate pages for
>> > them, although not on demand but at boot / plug in time.
>>
>> Hmmm, the capacities of persistent memory that would be assigned for
>> a raid accelerator would be limited by diminishing returns.  I.e.
>> there seems to be no point to assign more than 8GB or so to the
>> cache? [...]
>
> Why would that be the case?
>
> If it's not a temporary cache but a persistent cache that hosts all
> the data even after writeback completes then going to huge sizes will
> bring similar benefits to using a large, fast SSD disk on your
> desktop... The larger, the better. And it also persists across
> reboots.

True, that's more "dm-cache" than "RAID accelerator", but point taken.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Jerome Glisse
On Thu, May 07, 2015 at 09:11:07PM +0200, Ingo Molnar wrote:
> 
> * Dave Hansen  wrote:
> 
> > On 05/07/2015 10:42 AM, Dan Williams wrote:
> > > On Thu, May 7, 2015 at 10:36 AM, Ingo Molnar  wrote:
> > >> * Dan Williams  wrote:
> > >>
> > >> So is there anything fundamentally wrong about creating struct 
> > >> page backing at mmap() time (and making sure aliased mmaps share 
> > >> struct page arrays)?
> > > 
> > > Something like "get_user_pages() triggers memory hotplug for 
> > > persistent memory", so they are actual real struct pages?  Can we 
> > > do memory hotplug at that granularity?
> > 
> > We've traditionally limited them to SECTION_SIZE granularity, which 
> > is 128MB IIRC.  There are also assumptions in places that you can do 
> > page++ within a MAX_ORDER block if !CONFIG_HOLES_IN_ZONE.
> 
> I really don't think that's very practical: memory hotplug is slow, 
> it's really not on the same abstraction level as mmap(), and the zone 
> data structures are also fundamentally very coarse: not just because 
> RAM ranges are huge, but also so that the pfn->page transformation 
> stays relatively simple and fast.
> 
> > But, in all practicality, a lot of those places are in code like the 
> > buddy allocator.  If your PTEs all have _PAGE_SPECIAL set and we're 
> > not ever expecting these fake 'struct page's to hit these code 
> > paths, it probably doesn't matter.
> > 
> > You can probably get away with just allocating PAGE_SIZE worth of 
> > 'struct page' (which is 64) and mapping it in to vmemmap[].  The 
> > worst case is that you'll eat 1 page of space for each outstanding 
> > page of I/O.  That's a lot better than 2MB of temporary 'struct 
> > page' space per page of I/O that it would take with a traditional 
> > hotplug operation.
> 
> So I think the main value of struct page is if everyone on the system 
> sees the same struct page for the same pfn - not just the temporary IO 
> instance.
> 
> The idea of having very temporary struct page arrays misses the point 
> I think: if struct page is used as essentially an IO sglist then most 
> of the synchronization properties are lost: then we might as well use 
> the real deal in that case and skip the dynamic allocation and use 
> pfns directly and avoid the dynamic allocation overhead.
> 
> Stable, global page-struct descriptors are a given for real RAM, where 
> we allocate a struct page for every page in nice, large, mostly linear 
> arrays.
> 
> We'd really need that for pmem too, to get the full power of struct 
> page: and that means allocating them in nice, large, predictable 
> places - such as on the device itself ...

Is handling kernel pagefault on the vmemmap completely out of the
picture ? So we would carveout a chunck of kernel address space for
those pfn and use it for vmemmap and handle pagefault on it.

Again here i think that GPU folks would like a solution where they can
have a page struct but it would not be PMEM just device memory. So if
we can come up with something generic enough to server both purpose
that would be better in my view.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Ingo Molnar

* Dave Hansen  wrote:

> On 05/07/2015 10:42 AM, Dan Williams wrote:
> > On Thu, May 7, 2015 at 10:36 AM, Ingo Molnar  wrote:
> >> * Dan Williams  wrote:
> >>
> >> So is there anything fundamentally wrong about creating struct 
> >> page backing at mmap() time (and making sure aliased mmaps share 
> >> struct page arrays)?
> > 
> > Something like "get_user_pages() triggers memory hotplug for 
> > persistent memory", so they are actual real struct pages?  Can we 
> > do memory hotplug at that granularity?
> 
> We've traditionally limited them to SECTION_SIZE granularity, which 
> is 128MB IIRC.  There are also assumptions in places that you can do 
> page++ within a MAX_ORDER block if !CONFIG_HOLES_IN_ZONE.

I really don't think that's very practical: memory hotplug is slow, 
it's really not on the same abstraction level as mmap(), and the zone 
data structures are also fundamentally very coarse: not just because 
RAM ranges are huge, but also so that the pfn->page transformation 
stays relatively simple and fast.

> But, in all practicality, a lot of those places are in code like the 
> buddy allocator.  If your PTEs all have _PAGE_SPECIAL set and we're 
> not ever expecting these fake 'struct page's to hit these code 
> paths, it probably doesn't matter.
> 
> You can probably get away with just allocating PAGE_SIZE worth of 
> 'struct page' (which is 64) and mapping it in to vmemmap[].  The 
> worst case is that you'll eat 1 page of space for each outstanding 
> page of I/O.  That's a lot better than 2MB of temporary 'struct 
> page' space per page of I/O that it would take with a traditional 
> hotplug operation.

So I think the main value of struct page is if everyone on the system 
sees the same struct page for the same pfn - not just the temporary IO 
instance.

The idea of having very temporary struct page arrays misses the point 
I think: if struct page is used as essentially an IO sglist then most 
of the synchronization properties are lost: then we might as well use 
the real deal in that case and skip the dynamic allocation and use 
pfns directly and avoid the dynamic allocation overhead.

Stable, global page-struct descriptors are a given for real RAM, where 
we allocate a struct page for every page in nice, large, mostly linear 
arrays.

We'd really need that for pmem too, to get the full power of struct 
page: and that means allocating them in nice, large, predictable 
places - such as on the device itself ...

It might even be 'scattered' across the device, with 64 byte struct 
page size we can pack 64 descriptors into a single page, so every 65 
pages we could have a page-struct page.

Finding a pmem page's struct page would thus involve rounding it 
modulo 65 and reading that page.

The problem with that is fourfold:

 - that we now turn a very kernel internal API and data structure into 
   an ABI. If struct page grows beyond 64 bytes it's a problem.

 - on bootup (or device discovery time) we'd have to initialize all 
   the page structs. We could probably do this in a hierarchical way, 
   by dividing continuous pmem ranges into power-of-two groups of 
   blocks, and organizing them like the buddy allocator does.

 - 1.5% of storage space lost.

 - will wear-leveling properly migrate these 'hot' pages around?

The alternative would be some global interval-rbtree of struct page 
backed pmem ranges.

Beyond the synchronization problems of such a data structure (which 
looks like a nightmare) I don't think it's even feasible: especially 
if there's a filesystem on the pmem device then the block allocations 
could be physically fragmented (and there's no fundamental reason why 
they couldn't be fragmented), so a continuous mmap() of a file on it 
will yield wildly fragmented device-pfn ranges, exploding the rbtree. 
Think 1 million node interval-rbtree with an average depth of 20: 
cachemiss country for even simple lookups - not to mention the 
freeing/recycling complexity of unused struct pages to not allow it to 
grow too large.

I might be wrong though about all this :)

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Ingo Molnar

* Dan Williams  wrote:

> On Thu, May 7, 2015 at 9:18 AM, Christoph Hellwig  wrote:
> > On Wed, May 06, 2015 at 05:19:48PM -0700, Linus Torvalds wrote:
> >> What is the primary thing that is driving this need? Do we have a very
> >> concrete example?
> >
> > FYI, I plan to to implement RAID acceleration using nvdimms, and I 
> > plan to ue pages for that.  The code just merge for 4.1 can easily 
> > support page backing, and I plan to use that for now.  This still 
> > leaves support for the gigantic intel nvdimms discovered over EFI 
> > out, but given that I don't have access to them, and I dont know 
> > of any publically available there's little I can do for now.  But 
> > adding on demand allocate struct pages for the seems like the 
> > easiest way forward.  Boaz already has code to allocate pages for 
> > them, although not on demand but at boot / plug in time.
> 
> Hmmm, the capacities of persistent memory that would be assigned for 
> a raid accelerator would be limited by diminishing returns.  I.e. 
> there seems to be no point to assign more than 8GB or so to the 
> cache? [...]

Why would that be the case?

If it's not a temporary cache but a persistent cache that hosts all 
the data even after writeback completes then going to huge sizes will 
bring similar benefits to using a large, fast SSD disk on your 
desktop... The larger, the better. And it also persists across 
reboots.

It could also host the RAID write intent bitmap (the dirty 
stripes/chunks bitmap) for extra speedups. (This bitmap is pretty 
small, but important to speed up resyncs after crashes or power loss.)

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Dave Hansen
On 05/07/2015 10:42 AM, Dan Williams wrote:
> On Thu, May 7, 2015 at 10:36 AM, Ingo Molnar  wrote:
>> * Dan Williams  wrote:
>> So is there anything fundamentally wrong about creating struct page
>> backing at mmap() time (and making sure aliased mmaps share struct
>> page arrays)?
> 
> Something like "get_user_pages() triggers memory hotplug for
> persistent memory", so they are actual real struct pages?  Can we do
> memory hotplug at that granularity?

We've traditionally limited them to SECTION_SIZE granularity, which is
128MB IIRC.  There are also assumptions in places that you can do page++
within a MAX_ORDER block if !CONFIG_HOLES_IN_ZONE.

But, in all practicality, a lot of those places are in code like the
buddy allocator.  If your PTEs all have _PAGE_SPECIAL set and we're not
ever expecting these fake 'struct page's to hit these code paths, it
probably doesn't matter.

You can probably get away with just allocating PAGE_SIZE worth of
'struct page' (which is 64) and mapping it in to vmemmap[].  The worst
case is that you'll eat 1 page of space for each outstanding page of
I/O.  That's a lot better than 2MB of temporary 'struct page' space per
page of I/O that it would take with a traditional hotplug operation.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Ingo Molnar

* Dan Williams  wrote:

> > That looks like a layering violation and a mistake to me. If we 
> > want to do direct (sector_t -> sector_t) IO, with no serialization 
> > worries, it should have its own (simple) API - which things like 
> > hierarchical RAID or RDMA APIs could use.
> 
> I'm wrapped around the idea that __pfn_t *is* that simple api for 
> the tiered storage driver use case. [...]

I agree. (see my previous mail)

> [...] For RDMA I think we need struct page because I assume that 
> would be coordinated through a filesystem an truncate() is back in 
> play.

So I don't think RDMA is necessarily special, it's just a weirdly 
programmed DMA request:

 - If it is used internally by an exclusively managed complex storage
   driver, then it can use low level block APIs and pfn_t.

 - If RDMA is exposed all the way to user-space (do we have such 
   APIs?), allowing users to initiate RDMA IO into user buffers, then 
   (the user visible) buffer needs struct page backing. (which in turn 
   will then at some lower level convert to pfns.)

   That's true for both regular RAM pages and mmap()-ed persistent RAM 
   pages as well.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Linus Torvalds
On Thu, May 7, 2015 at 9:03 AM, Dan Williams  wrote:
>
> Ok, I'll keep thinking about this and come back when we have a better
> story about passing mmap'd persistent memory around in userspace.

Ok. And if we do decide to go with your kind of "__pfn" type, I'd
probably prefer that we encode the type in the low bits of the word
rather than compare against PAGE_OFFSET. On some architectures
PAGE_OFFSET is zero (admittedly probably not ones you'd care about),
but even on x86 it's a *lot* cheaper to test the low bit than it is to
compare against a big constant.

We know "struct page *" is supposed to be at least aligned to at least
"unsigned long", so you'd have two bits of type information (and we
could easily make it three). With "0" being a real pointer, so that
you can use the pointer itself without masking.

And the "hide type in low bits of pointer" is something we've done
quite a lot, so it's more "kernel coding style" anyway.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Dan Williams
On Thu, May 7, 2015 at 10:36 AM, Ingo Molnar  wrote:
>
> * Dan Williams  wrote:
>
>> > Anyway, I did want to say that while I may not be convinced about
>> > the approach, I think the patches themselves don't look horrible.
>> > I actually like your "__pfn_t". So while I (very obviously) have
>> > some doubts about this approach, it may be that the most
>> > convincing argument is just in the code.
>>
>> Ok, I'll keep thinking about this and come back when we have a
>> better story about passing mmap'd persistent memory around in
>> userspace.
>
> So is there anything fundamentally wrong about creating struct page
> backing at mmap() time (and making sure aliased mmaps share struct
> page arrays)?

Something like "get_user_pages() triggers memory hotplug for
persistent memory", so they are actual real struct pages?  Can we do
memory hotplug at that granularity?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Ingo Molnar

* Dan Williams  wrote:

> > Anyway, I did want to say that while I may not be convinced about 
> > the approach, I think the patches themselves don't look horrible. 
> > I actually like your "__pfn_t". So while I (very obviously) have 
> > some doubts about this approach, it may be that the most 
> > convincing argument is just in the code.
> 
> Ok, I'll keep thinking about this and come back when we have a 
> better story about passing mmap'd persistent memory around in 
> userspace.

So is there anything fundamentally wrong about creating struct page 
backing at mmap() time (and making sure aliased mmaps share struct 
page arrays)?

Because if that is done, then the DMA agent won't even know about the 
memory being persistent RAM. It's just a regular struct page, that 
happens to point to persistent RAM. Same goes for all the high level 
VM APIs, futexes, etc. Everything will Just Work.

It will also be relatively fast: mmap() is a relative slowpath, 
comparatively.

As far as RAID is concerned: that's a relatively easy situation, as 
there's only a single user of the devices, the RAID context that 
manages all component devices exclusively. Device to device DMA can 
use the block layer directly, i.e. most of the patches you've got here 
in this series, except:

74287   C May 06 Dan Williams( 232) ├─>[PATCH v2 09/10] dax: convert to 
__pfn_t

I think DAX mmap()s need struct page backing.

I think there's a simple rule: if a page is visible to user-space via 
the MMU then it needs struct page backing. If it's "hidden", like 
behind a RAID abstraction, it probably doesn't.

With the remaining patches a high level RAID driver ought to be able 
to send pfn-to-sector and sector-to-pfn requests to other block 
drivers, without any unnecessary struct page allocation overhead, 
right?

As long as the pfn concept remains a clever way to reuse our 
ram<->sector interfaces to implement sector<->sector IO, in the cases 
where the IO has no serialization or MMU concerns, not using struct 
page and using pfn_t looks natural.

The moment it starts reaching user space APIs, like in the DAX case, 
and especially if it becomes user-MMU visible, it's a mistake to not 
have struct page backing, I think.

(In that sense the current DAX mmap() code is already a partial 
mistake.)

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Jerome Glisse
On Thu, May 07, 2015 at 06:18:07PM +0200, Christoph Hellwig wrote:
> On Wed, May 06, 2015 at 05:19:48PM -0700, Linus Torvalds wrote:
> > What is the primary thing that is driving this need? Do we have a very
> > concrete example?
> 
> FYI, I plan to to implement RAID acceleration using nvdimms, and I plan to
> ue pages for that.  The code just merge for 4.1 can easily support page
> backing, and I plan to use that for now.  This still leaves support
> for the gigantic intel nvdimms discovered over EFI out, but given that
> I don't have access to them, and I dont know of any publically available
> there's little I can do for now.  But adding on demand allocate struct
> pages for the seems like the easiest way forward.  Boaz already has
> code to allocate pages for them, although not on demand but at boot / plug in
> time.

I think here other folks might be interested, i am ccing Paul. But for GPU
we are facing similar issue of trying to present the GPU memory to the kernel
in a coherent way (coherent from the design and linux kernel concept POV).

For this dynamicaly allocated struct page might effectively be a solution that
could be share btw persistent memory and GPU folks. We can even enforce thing
like VMEMMAP and have special region carveout where we can dynamicly map/unmap
backing page for range of device pfn. This would also allow to catch people
trying to access such page, we could add a set of new helper like :
get_page_dev()/put_page_dev() ... and only the _dev version would works on
this new kind of memory, regular get_page()/put_page() would throw error.
This should allow to make sure only legitimate users are referencing such
page.

Issue might be that we can run out of kernel address space with 48bits but if
such monstruous computer ever see the light of day they might consider using
CPU with more bits.

Another issue is that we might care for the 32bits platform too, but that's
solvable at a small cost.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Dan Williams
On Thu, May 7, 2015 at 9:18 AM, Christoph Hellwig  wrote:
> On Wed, May 06, 2015 at 05:19:48PM -0700, Linus Torvalds wrote:
>> What is the primary thing that is driving this need? Do we have a very
>> concrete example?
>
> FYI, I plan to to implement RAID acceleration using nvdimms, and I plan to
> ue pages for that.  The code just merge for 4.1 can easily support page
> backing, and I plan to use that for now.  This still leaves support
> for the gigantic intel nvdimms discovered over EFI out, but given that
> I don't have access to them, and I dont know of any publically available
> there's little I can do for now.  But adding on demand allocate struct
> pages for the seems like the easiest way forward.  Boaz already has
> code to allocate pages for them, although not on demand but at boot / plug in
> time.

Hmmm, the capacities of persistent memory that would be assigned for a
raid accelerator would be limited by diminishing returns.  I.e. there
seems to be no point to assign more than 8GB or so to the cache?  If
that's the case the capacity argument loses some teeth, just
"blk_get(FMODE_EXCL) + memory_hotplug a small capacity" and be done.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Christoph Hellwig
On Wed, May 06, 2015 at 05:19:48PM -0700, Linus Torvalds wrote:
> What is the primary thing that is driving this need? Do we have a very
> concrete example?

FYI, I plan to to implement RAID acceleration using nvdimms, and I plan to
ue pages for that.  The code just merge for 4.1 can easily support page
backing, and I plan to use that for now.  This still leaves support
for the gigantic intel nvdimms discovered over EFI out, but given that
I don't have access to them, and I dont know of any publically available
there's little I can do for now.  But adding on demand allocate struct
pages for the seems like the easiest way forward.  Boaz already has
code to allocate pages for them, although not on demand but at boot / plug in
time.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Dan Williams
On Thu, May 7, 2015 at 8:58 AM, Linus Torvalds
 wrote:
> On Thu, May 7, 2015 at 8:40 AM, Dan Williams  wrote:
>>
>> blkdev_get(FMODE_EXCL) is the protection in this case.
>
> Ugh. That looks like a horrible nasty big hammer that will bite us
> badly some day. Since you'd have to hold it for the whole IO. But I
> guess it at least works.

Oh no, that wouldn't be per-I/O that would be permanent at
configuration set up time just like a raid member device.

Something like:
mdadm --create /dev/md0 --cache=/dev/pmem0p1 --storage=/dev/sda

> Anyway, I did want to say that while I may not be convinced about the
> approach, I think the patches themselves don't look horrible. I
> actually like your "__pfn_t". So while I (very obviously) have some
> doubts about this approach, it may be that the most convincing
> argument is just in the code.

Ok, I'll keep thinking about this and come back when we have a better
story about passing mmap'd persistent memory around in userspace.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Linus Torvalds
On Thu, May 7, 2015 at 8:40 AM, Dan Williams  wrote:
>
> blkdev_get(FMODE_EXCL) is the protection in this case.

Ugh. That looks like a horrible nasty big hammer that will bite us
badly some day. Since you'd have to hold it for the whole IO. But I
guess it at least works.

Anyway, I did want to say that while I may not be convinced about the
approach, I think the patches themselves don't look horrible. I
actually like your "__pfn_t". So while I (very obviously) have some
doubts about this approach, it may be that the most convincing
argument is just in the code.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Dan Williams
On Thu, May 7, 2015 at 7:42 AM, Ingo Molnar  wrote:
>
> * Ingo Molnar  wrote:
>
>> [...]
>>
>> For anything more complex, that maps any of this storage to
>> user-space, or exposes it to higher level struct page based APIs,
>> etc., where references matter and it's more of a cache with
>> potentially multiple users, not an IO space, the natural API is
>> struct page.
>
> Let me walk back on this:
>
>> I'd say that this particular series mostly addresses the 'pfn as
>> sector_t' side of the equation, where persistent memory is IO space,
>> not memory space, and as such it is the more natural and thus also
>> the cheaper/faster approach.
>
> ... but that does not appear to be the case: this series replaces a
> 'struct page' interface with a pure pfn interface for the express
> purpose of being able to DMA to/from 'memory areas' that are not
> struct page backed.
>
>> Linus probably disagrees? :-)
>
> [ and he'd disagree rightfully ;-) ]
>
> So what this patch set tries to achieve is (sector_t -> sector_t) IO
> between storage devices (i.e. a rare and somewhat weird usecase), and
> does it by squeezing one device's storage address into our formerly
> struct page backed descriptor, via a pfn.
>
> That looks like a layering violation and a mistake to me. If we want
> to do direct (sector_t -> sector_t) IO, with no serialization worries,
> it should have its own (simple) API - which things like hierarchical
> RAID or RDMA APIs could use.

I'm wrapped around the idea that __pfn_t *is* that simple api for the
tiered storage driver use case.  For RDMA I think we need struct page
because I assume that would be coordinated through a filesystem an
truncate() is back in play.

What does an alternative API look like?

> If what we want to do is to support say an mmap() of a file on
> persistent storage, and then read() into that file from another device
> via DMA, then I think we should have allocated struct page backing at
> mmap() time already, and all regular syscall APIs would 'just work'
> from that point on - far above what page-less, pfn-based APIs can do.
>
> The temporary struct page backing can then be freed at munmap() time.

Yes, passing around mmap()'d (DAX) persistent memory will need more
than a __pfn_t.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Dan Williams
On Thu, May 7, 2015 at 8:00 AM, Linus Torvalds
 wrote:
> On Wed, May 6, 2015 at 7:36 PM, Dan Williams  wrote:
>>
>> My pet concrete example is covered by __pfn_t.  Referencing persistent
>> memory in an md/dm hierarchical storage configuration.  Setting aside
>> the thrash to get existing block users to do "bvec_set_page(page)"
>> instead of "bvec->page = page" the onus is on that md/dm
>> implementation and backing storage device driver to operate on
>> __pfn_t.  That use case is simple because there is no use of page
>> locking or refcounting in that path, just dma_map_page() and
>> kmap_atomic().
>
> So clarify for me: are you trying to make the IO stack in general be
> able to use the persistent memory as a source (or destination) for IO
> to _other_ devices, or are you talking about just internally shuffling
> things around for something like RAID on top of persistent memory?
>
> Because I think those are two very different things.

Yes, they are, and I am referring to the former, persistent memory as
a source/destination to other devices.

> For example, one of the things I worry about is for people doing IO
> from persistent memory directly to some "slow stable storage" (aka
> disk). That was what I thought you were aiming for: infrastructure so
> that you can make a bio for a *disk* device contain a page list that
> is the persistent memory.
>
> And I think that is a very dangerous operation to do, because the
> persistent memory itself is going to have some filesystem on it, so
> anything that looks up the persistent memory pages is *not* going to
> have a stable pfn: the pfn will point to a fixed part of the
> persistent memory, but the file that was there may be deleted and the
> memory reassigned to something else.

Indeed, truncate() in the absence of struct page has been a major
hurdle for persistent memory enabling.  But it does not impact this
specific md/dm use case.  md/dm will have taken an exclusive claim on
an entire pmem block device (or partition), so there will be no
competing with a filesystem.

> That's the kind of thing that "struct page" helps with for normal IO
> devices. It's both a source of serialization and indirection, so that
> when somebody does a "truncate()" on a file, we don't end up doing IO
> to random stale locations on the disk that got reassigned to another
> file.
>
> So "struct page" is very fundamental. It's *not* just a "this is the
> physical source/drain of the data you are doing IO on".
>
> So if you are looking at some kind of "zero-copy IO", where you can do
> IO from a filesystem on persistent storage to *another* filesystem on
> (say, a big rotational disk used for long-term storage) by just doing
> a bo that targets the disk, but has the persistent memory as the
> source memory, I really want to understand how you are going to
> serialize this.
>
> So *that* is what I meant by "What is the primary thing that is
> driving this need? Do we have a very concrete example?"
>
> I abvsolutely do *not* want to teach the bio subsystem to just
> randomly be able to take the source/destination of the IO as being
> some random pfn without knowing what the actual uses are and how these
> IO's are generated in the first place.

blkdev_get(FMODE_EXCL) is the protection in this case.

> I was assuming that you wanted to do something where you mmap() the
> persistent memory, and then write it out to another device (possibly
> using aio_write()). But that really does require some kind of
> serialization at a higher level, because you can't just look up the
> pfn's in the page table and assume they are stable: they are *not*
> stable.

We want to get there eventually, but this patchset does not address that case.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Linus Torvalds
On Wed, May 6, 2015 at 7:36 PM, Dan Williams  wrote:
>
> My pet concrete example is covered by __pfn_t.  Referencing persistent
> memory in an md/dm hierarchical storage configuration.  Setting aside
> the thrash to get existing block users to do "bvec_set_page(page)"
> instead of "bvec->page = page" the onus is on that md/dm
> implementation and backing storage device driver to operate on
> __pfn_t.  That use case is simple because there is no use of page
> locking or refcounting in that path, just dma_map_page() and
> kmap_atomic().

So clarify for me: are you trying to make the IO stack in general be
able to use the persistent memory as a source (or destination) for IO
to _other_ devices, or are you talking about just internally shuffling
things around for something like RAID on top of persistent memory?

Because I think those are two very different things.

For example, one of the things I worry about is for people doing IO
from persistent memory directly to some "slow stable storage" (aka
disk). That was what I thought you were aiming for: infrastructure so
that you can make a bio for a *disk* device contain a page list that
is the persistent memory.

And I think that is a very dangerous operation to do, because the
persistent memory itself is going to have some filesystem on it, so
anything that looks up the persistent memory pages is *not* going to
have a stable pfn: the pfn will point to a fixed part of the
persistent memory, but the file that was there may be deleted and the
memory reassigned to something else.

That's the kind of thing that "struct page" helps with for normal IO
devices. It's both a source of serialization and indirection, so that
when somebody does a "truncate()" on a file, we don't end up doing IO
to random stale locations on the disk that got reassigned to another
file.

So "struct page" is very fundamental. It's *not* just a "this is the
physical source/drain of the data you are doing IO on".

So if you are looking at some kind of "zero-copy IO", where you can do
IO from a filesystem on persistent storage to *another* filesystem on
(say, a big rotational disk used for long-term storage) by just doing
a bo that targets the disk, but has the persistent memory as the
source memory, I really want to understand how you are going to
serialize this.

So *that* is what I meant by "What is the primary thing that is
driving this need? Do we have a very concrete example?"

I abvsolutely do *not* want to teach the bio subsystem to just
randomly be able to take the source/destination of the IO as being
some random pfn without knowing what the actual uses are and how these
IO's are generated in the first place.

I was assuming that you wanted to do something where you mmap() the
persistent memory, and then write it out to another device (possibly
using aio_write()). But that really does require some kind of
serialization at a higher level, because you can't just look up the
pfn's in the page table and assume they are stable: they are *not*
stable.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Ingo Molnar

* Ingo Molnar  wrote:

> [...]
>
> For anything more complex, that maps any of this storage to 
> user-space, or exposes it to higher level struct page based APIs, 
> etc., where references matter and it's more of a cache with 
> potentially multiple users, not an IO space, the natural API is 
> struct page.

Let me walk back on this:

> I'd say that this particular series mostly addresses the 'pfn as 
> sector_t' side of the equation, where persistent memory is IO space, 
> not memory space, and as such it is the more natural and thus also 
> the cheaper/faster approach.

... but that does not appear to be the case: this series replaces a 
'struct page' interface with a pure pfn interface for the express 
purpose of being able to DMA to/from 'memory areas' that are not 
struct page backed.

> Linus probably disagrees? :-)

[ and he'd disagree rightfully ;-) ]

So what this patch set tries to achieve is (sector_t -> sector_t) IO 
between storage devices (i.e. a rare and somewhat weird usecase), and 
does it by squeezing one device's storage address into our formerly 
struct page backed descriptor, via a pfn.

That looks like a layering violation and a mistake to me. If we want 
to do direct (sector_t -> sector_t) IO, with no serialization worries, 
it should have its own (simple) API - which things like hierarchical 
RAID or RDMA APIs could use.

If what we want to do is to support say an mmap() of a file on 
persistent storage, and then read() into that file from another device 
via DMA, then I think we should have allocated struct page backing at 
mmap() time already, and all regular syscall APIs would 'just work' 
from that point on - far above what page-less, pfn-based APIs can do.

The temporary struct page backing can then be freed at munmap() time.

And if the usage is pure fd based, we don't really have fd-to-fd APIs 
beyond the rarely used splice variants (and even those don't do pure 
cross-IO, they use a pipe as an intermediary), so there's no problem 
to solve I suspect.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Ingo Molnar

* Dan Williams  wrote:

> > What is the primary thing that is driving this need? Do we have a 
> > very concrete example?
> 
> My pet concrete example is covered by __pfn_t.  Referencing 
> persistent memory in an md/dm hierarchical storage configuration.  
> Setting aside the thrash to get existing block users to do 
> "bvec_set_page(page)" instead of "bvec->page = page" the onus is on 
> that md/dm implementation and backing storage device driver to 
> operate on __pfn_t.  That use case is simple because there is no use 
> of page locking or refcounting in that path, just dma_map_page() and 
> kmap_atomic().  The more difficult use case is precisely what Al 
> picked up on, O_DIRECT and RDMA.  This patchset does nothing to 
> address those use cases outside of not needing a struct page when 
> they eventually craft a bio.

So why not do a dual approach?

There are code paths where the 'pfn' of a persistent device is mostly 
used as a sector_t equivalent of terabytes of storage, not as an index 
of a memory object.

It's not an address to a cache, it's an index into a huge storage 
space - which happens to be (flash) RAM. For them using pfn_t seems 
natural and using struct page * is a strained (not to mention 
expensive) model.

For more complex facilities, where persistent memory is used as a 
memory object, especially where the underlying device is true, 
unfinitely writable RAM (not flash), treating it as a memory zone, or 
setting up dynamic struct page would be the natural approach. (with 
the inevitable cost of setup/teardown in the latter case)

I'd say that for anything where the dynamic struct page is torn down 
unconditionally after completion of only a single use, the natural API 
is probably pfn_t, not struct page. Any synchronization is already 
handled at the block request layer already, and it's storage op 
synchronization, not memory access synchronization really.

For anything more complex, that maps any of this storage to 
user-space, or exposes it to higher level struct page based APIs, 
etc., where references matter and it's more of a cache with 
potentially multiple users, not an IO space, the natural API is struct 
page.

I'd say that this particular series mostly addresses the 'pfn as 
sector_t' side of the equation, where persistent memory is IO space, 
not memory space, and as such it is the more natural and thus also the 
cheaper/faster approach.

Linus probably disagrees? :-)

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Ingo Molnar

* Dan Williams dan.j.willi...@intel.com wrote:

  What is the primary thing that is driving this need? Do we have a 
  very concrete example?
 
 My pet concrete example is covered by __pfn_t.  Referencing 
 persistent memory in an md/dm hierarchical storage configuration.  
 Setting aside the thrash to get existing block users to do 
 bvec_set_page(page) instead of bvec-page = page the onus is on 
 that md/dm implementation and backing storage device driver to 
 operate on __pfn_t.  That use case is simple because there is no use 
 of page locking or refcounting in that path, just dma_map_page() and 
 kmap_atomic().  The more difficult use case is precisely what Al 
 picked up on, O_DIRECT and RDMA.  This patchset does nothing to 
 address those use cases outside of not needing a struct page when 
 they eventually craft a bio.

So why not do a dual approach?

There are code paths where the 'pfn' of a persistent device is mostly 
used as a sector_t equivalent of terabytes of storage, not as an index 
of a memory object.

It's not an address to a cache, it's an index into a huge storage 
space - which happens to be (flash) RAM. For them using pfn_t seems 
natural and using struct page * is a strained (not to mention 
expensive) model.

For more complex facilities, where persistent memory is used as a 
memory object, especially where the underlying device is true, 
unfinitely writable RAM (not flash), treating it as a memory zone, or 
setting up dynamic struct page would be the natural approach. (with 
the inevitable cost of setup/teardown in the latter case)

I'd say that for anything where the dynamic struct page is torn down 
unconditionally after completion of only a single use, the natural API 
is probably pfn_t, not struct page. Any synchronization is already 
handled at the block request layer already, and it's storage op 
synchronization, not memory access synchronization really.

For anything more complex, that maps any of this storage to 
user-space, or exposes it to higher level struct page based APIs, 
etc., where references matter and it's more of a cache with 
potentially multiple users, not an IO space, the natural API is struct 
page.

I'd say that this particular series mostly addresses the 'pfn as 
sector_t' side of the equation, where persistent memory is IO space, 
not memory space, and as such it is the more natural and thus also the 
cheaper/faster approach.

Linus probably disagrees? :-)

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Dan Williams
On Thu, May 7, 2015 at 10:36 AM, Ingo Molnar mi...@kernel.org wrote:

 * Dan Williams dan.j.willi...@intel.com wrote:

  Anyway, I did want to say that while I may not be convinced about
  the approach, I think the patches themselves don't look horrible.
  I actually like your __pfn_t. So while I (very obviously) have
  some doubts about this approach, it may be that the most
  convincing argument is just in the code.

 Ok, I'll keep thinking about this and come back when we have a
 better story about passing mmap'd persistent memory around in
 userspace.

 So is there anything fundamentally wrong about creating struct page
 backing at mmap() time (and making sure aliased mmaps share struct
 page arrays)?

Something like get_user_pages() triggers memory hotplug for
persistent memory, so they are actual real struct pages?  Can we do
memory hotplug at that granularity?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Ingo Molnar

* Ingo Molnar mi...@kernel.org wrote:

  Is handling kernel pagefault on the vmemmap completely out of the 
  picture ? So we would carveout a chunck of kernel address space 
  for those pfn and use it for vmemmap and handle pagefault on it.
 
 That's pretty clever. The page fault doesn't even have to do remote 
 TLB shootdown, because it only establishes mappings - so it's pretty 
 atomic, a bit like the minor vmalloc() area faults we are doing.
 
 Some sort of LRA (least recently allocated) scheme could unmap the 
 area in chunks if it's beyond a certain size, to keep a limit on 
 size. Done from the same context and would use remote TLB shootdown.
 
 The only limitation I can see is that such faults would have to be 
 able to sleep, to do the allocation. So pfn_to_page() could not be 
 used in arbitrary contexts.

So another complication would be that we cannot just unmap such pages 
when we want to recycle them, because the struct page in them might be 
in use - so all struct page uses would have to refcount the underlying 
page. We don't really do that today: code just looks up struct pages 
and assumes they never go away.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Jerome Glisse
On Thu, May 07, 2015 at 09:11:07PM +0200, Ingo Molnar wrote:
 
 * Dave Hansen dave.han...@linux.intel.com wrote:
 
  On 05/07/2015 10:42 AM, Dan Williams wrote:
   On Thu, May 7, 2015 at 10:36 AM, Ingo Molnar mi...@kernel.org wrote:
   * Dan Williams dan.j.willi...@intel.com wrote:
  
   So is there anything fundamentally wrong about creating struct 
   page backing at mmap() time (and making sure aliased mmaps share 
   struct page arrays)?
   
   Something like get_user_pages() triggers memory hotplug for 
   persistent memory, so they are actual real struct pages?  Can we 
   do memory hotplug at that granularity?
  
  We've traditionally limited them to SECTION_SIZE granularity, which 
  is 128MB IIRC.  There are also assumptions in places that you can do 
  page++ within a MAX_ORDER block if !CONFIG_HOLES_IN_ZONE.
 
 I really don't think that's very practical: memory hotplug is slow, 
 it's really not on the same abstraction level as mmap(), and the zone 
 data structures are also fundamentally very coarse: not just because 
 RAM ranges are huge, but also so that the pfn-page transformation 
 stays relatively simple and fast.
 
  But, in all practicality, a lot of those places are in code like the 
  buddy allocator.  If your PTEs all have _PAGE_SPECIAL set and we're 
  not ever expecting these fake 'struct page's to hit these code 
  paths, it probably doesn't matter.
  
  You can probably get away with just allocating PAGE_SIZE worth of 
  'struct page' (which is 64) and mapping it in to vmemmap[].  The 
  worst case is that you'll eat 1 page of space for each outstanding 
  page of I/O.  That's a lot better than 2MB of temporary 'struct 
  page' space per page of I/O that it would take with a traditional 
  hotplug operation.
 
 So I think the main value of struct page is if everyone on the system 
 sees the same struct page for the same pfn - not just the temporary IO 
 instance.
 
 The idea of having very temporary struct page arrays misses the point 
 I think: if struct page is used as essentially an IO sglist then most 
 of the synchronization properties are lost: then we might as well use 
 the real deal in that case and skip the dynamic allocation and use 
 pfns directly and avoid the dynamic allocation overhead.
 
 Stable, global page-struct descriptors are a given for real RAM, where 
 we allocate a struct page for every page in nice, large, mostly linear 
 arrays.
 
 We'd really need that for pmem too, to get the full power of struct 
 page: and that means allocating them in nice, large, predictable 
 places - such as on the device itself ...

Is handling kernel pagefault on the vmemmap completely out of the
picture ? So we would carveout a chunck of kernel address space for
those pfn and use it for vmemmap and handle pagefault on it.

Again here i think that GPU folks would like a solution where they can
have a page struct but it would not be PMEM just device memory. So if
we can come up with something generic enough to server both purpose
that would be better in my view.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Dan Williams
On Thu, May 7, 2015 at 11:40 AM, Ingo Molnar mi...@kernel.org wrote:

 * Dan Williams dan.j.willi...@intel.com wrote:

 On Thu, May 7, 2015 at 9:18 AM, Christoph Hellwig h...@lst.de wrote:
  On Wed, May 06, 2015 at 05:19:48PM -0700, Linus Torvalds wrote:
  What is the primary thing that is driving this need? Do we have a very
  concrete example?
 
  FYI, I plan to to implement RAID acceleration using nvdimms, and I
  plan to ue pages for that.  The code just merge for 4.1 can easily
  support page backing, and I plan to use that for now.  This still
  leaves support for the gigantic intel nvdimms discovered over EFI
  out, but given that I don't have access to them, and I dont know
  of any publically available there's little I can do for now.  But
  adding on demand allocate struct pages for the seems like the
  easiest way forward.  Boaz already has code to allocate pages for
  them, although not on demand but at boot / plug in time.

 Hmmm, the capacities of persistent memory that would be assigned for
 a raid accelerator would be limited by diminishing returns.  I.e.
 there seems to be no point to assign more than 8GB or so to the
 cache? [...]

 Why would that be the case?

 If it's not a temporary cache but a persistent cache that hosts all
 the data even after writeback completes then going to huge sizes will
 bring similar benefits to using a large, fast SSD disk on your
 desktop... The larger, the better. And it also persists across
 reboots.

True, that's more dm-cache than RAID accelerator, but point taken.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Dan Williams
On Thu, May 7, 2015 at 10:43 AM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Thu, May 7, 2015 at 9:03 AM, Dan Williams dan.j.willi...@intel.com wrote:

 Ok, I'll keep thinking about this and come back when we have a better
 story about passing mmap'd persistent memory around in userspace.

 Ok. And if we do decide to go with your kind of __pfn type, I'd
 probably prefer that we encode the type in the low bits of the word
 rather than compare against PAGE_OFFSET. On some architectures
 PAGE_OFFSET is zero (admittedly probably not ones you'd care about),
 but even on x86 it's a *lot* cheaper to test the low bit than it is to
 compare against a big constant.

 We know struct page * is supposed to be at least aligned to at least
 unsigned long, so you'd have two bits of type information (and we
 could easily make it three). With 0 being a real pointer, so that
 you can use the pointer itself without masking.

 And the hide type in low bits of pointer is something we've done
 quite a lot, so it's more kernel coding style anyway.

Ok.  Although __pfn_t also stores pfn values directly which will
consume those 2 bits so we'll need to shift pfns up when storing.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Jerome Glisse
On Thu, May 07, 2015 at 09:53:13PM +0200, Ingo Molnar wrote:
 
 * Ingo Molnar mi...@kernel.org wrote:
 
   Is handling kernel pagefault on the vmemmap completely out of the 
   picture ? So we would carveout a chunck of kernel address space 
   for those pfn and use it for vmemmap and handle pagefault on it.
  
  That's pretty clever. The page fault doesn't even have to do remote 
  TLB shootdown, because it only establishes mappings - so it's pretty 
  atomic, a bit like the minor vmalloc() area faults we are doing.
  
  Some sort of LRA (least recently allocated) scheme could unmap the 
  area in chunks if it's beyond a certain size, to keep a limit on 
  size. Done from the same context and would use remote TLB shootdown.
  
  The only limitation I can see is that such faults would have to be 
  able to sleep, to do the allocation. So pfn_to_page() could not be 
  used in arbitrary contexts.
 
 So another complication would be that we cannot just unmap such pages 
 when we want to recycle them, because the struct page in them might be 
 in use - so all struct page uses would have to refcount the underlying 
 page. We don't really do that today: code just looks up struct pages 
 and assumes they never go away.

I still think this is doable, like i said in another email, i think we
should introduce a special pfn_to_page_dev|pmem|waffle|somethingyoulike()
to place that are allowed to allocate the underlying struct page.

For instance we can use a default page to backup all this special vmem
range with some specialy crafted struct page that says that its is
invalid memory (make this mapping read only so all write to this
special struct page is forbidden).

Now once an authorized user comes along and need a real struct page it
trigger a page allocation that replace the page full of fake invalid
struct page with a page with correct valid struct page that can be
manipulated by other part of the kernel.

So regular pfn_to_page() would test against special vmemmap and if
special test the content of struct page for some flag. If it's the
invalid page flag it returns 0.

But once a proper struct page is allocated then pfn_page would return
the struct page as expected.

That way you will catch all invalid user of such page ie user that use
the page after its lifetime is done. You will also limit the creation
of the underlying proper struct page to only code that are legitimate
to ask for a proper struct page for given pfn.

Also you would get kernel write fault on the page full of fake struct
page and that would allow to catch further wrong use.

Anyway this is how i envision this and i think it would work for my
usecase too (GPU it is for me :))

Cheers,
Jérôme
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Dave Hansen
On 05/07/2015 10:42 AM, Dan Williams wrote:
 On Thu, May 7, 2015 at 10:36 AM, Ingo Molnar mi...@kernel.org wrote:
 * Dan Williams dan.j.willi...@intel.com wrote:
 So is there anything fundamentally wrong about creating struct page
 backing at mmap() time (and making sure aliased mmaps share struct
 page arrays)?
 
 Something like get_user_pages() triggers memory hotplug for
 persistent memory, so they are actual real struct pages?  Can we do
 memory hotplug at that granularity?

We've traditionally limited them to SECTION_SIZE granularity, which is
128MB IIRC.  There are also assumptions in places that you can do page++
within a MAX_ORDER block if !CONFIG_HOLES_IN_ZONE.

But, in all practicality, a lot of those places are in code like the
buddy allocator.  If your PTEs all have _PAGE_SPECIAL set and we're not
ever expecting these fake 'struct page's to hit these code paths, it
probably doesn't matter.

You can probably get away with just allocating PAGE_SIZE worth of
'struct page' (which is 64) and mapping it in to vmemmap[].  The worst
case is that you'll eat 1 page of space for each outstanding page of
I/O.  That's a lot better than 2MB of temporary 'struct page' space per
page of I/O that it would take with a traditional hotplug operation.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Ingo Molnar

* Dan Williams dan.j.willi...@intel.com wrote:

 On Thu, May 7, 2015 at 9:18 AM, Christoph Hellwig h...@lst.de wrote:
  On Wed, May 06, 2015 at 05:19:48PM -0700, Linus Torvalds wrote:
  What is the primary thing that is driving this need? Do we have a very
  concrete example?
 
  FYI, I plan to to implement RAID acceleration using nvdimms, and I 
  plan to ue pages for that.  The code just merge for 4.1 can easily 
  support page backing, and I plan to use that for now.  This still 
  leaves support for the gigantic intel nvdimms discovered over EFI 
  out, but given that I don't have access to them, and I dont know 
  of any publically available there's little I can do for now.  But 
  adding on demand allocate struct pages for the seems like the 
  easiest way forward.  Boaz already has code to allocate pages for 
  them, although not on demand but at boot / plug in time.
 
 Hmmm, the capacities of persistent memory that would be assigned for 
 a raid accelerator would be limited by diminishing returns.  I.e. 
 there seems to be no point to assign more than 8GB or so to the 
 cache? [...]

Why would that be the case?

If it's not a temporary cache but a persistent cache that hosts all 
the data even after writeback completes then going to huge sizes will 
bring similar benefits to using a large, fast SSD disk on your 
desktop... The larger, the better. And it also persists across 
reboots.

It could also host the RAID write intent bitmap (the dirty 
stripes/chunks bitmap) for extra speedups. (This bitmap is pretty 
small, but important to speed up resyncs after crashes or power loss.)

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Ingo Molnar

* Jerome Glisse j.gli...@gmail.com wrote:

  So I think the main value of struct page is if everyone on the 
  system sees the same struct page for the same pfn - not just the 
  temporary IO instance.
  
  The idea of having very temporary struct page arrays misses the 
  point I think: if struct page is used as essentially an IO sglist 
  then most of the synchronization properties are lost: then we 
  might as well use the real deal in that case and skip the dynamic 
  allocation and use pfns directly and avoid the dynamic allocation 
  overhead.
  
  Stable, global page-struct descriptors are a given for real RAM, 
  where we allocate a struct page for every page in nice, large, 
  mostly linear arrays.
  
  We'd really need that for pmem too, to get the full power of 
  struct page: and that means allocating them in nice, large, 
  predictable places - such as on the device itself ...
 
 Is handling kernel pagefault on the vmemmap completely out of the 
 picture ? So we would carveout a chunck of kernel address space for 
 those pfn and use it for vmemmap and handle pagefault on it.

That's pretty clever. The page fault doesn't even have to do remote 
TLB shootdown, because it only establishes mappings - so it's pretty 
atomic, a bit like the minor vmalloc() area faults we are doing.

Some sort of LRA (least recently allocated) scheme could unmap the 
area in chunks if it's beyond a certain size, to keep a limit on size. 
Done from the same context and would use remote TLB shootdown.

The only limitation I can see is that such faults would have to be 
able to sleep, to do the allocation. So pfn_to_page() could not be 
used in arbitrary contexts.

 Again here i think that GPU folks would like a solution where they 
 can have a page struct but it would not be PMEM just device memory. 
 So if we can come up with something generic enough to server both 
 purpose that would be better in my view.

Yes.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Linus Torvalds
On Thu, May 7, 2015 at 9:03 AM, Dan Williams dan.j.willi...@intel.com wrote:

 Ok, I'll keep thinking about this and come back when we have a better
 story about passing mmap'd persistent memory around in userspace.

Ok. And if we do decide to go with your kind of __pfn type, I'd
probably prefer that we encode the type in the low bits of the word
rather than compare against PAGE_OFFSET. On some architectures
PAGE_OFFSET is zero (admittedly probably not ones you'd care about),
but even on x86 it's a *lot* cheaper to test the low bit than it is to
compare against a big constant.

We know struct page * is supposed to be at least aligned to at least
unsigned long, so you'd have two bits of type information (and we
could easily make it three). With 0 being a real pointer, so that
you can use the pointer itself without masking.

And the hide type in low bits of pointer is something we've done
quite a lot, so it's more kernel coding style anyway.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Ingo Molnar

* Dan Williams dan.j.willi...@intel.com wrote:

  That looks like a layering violation and a mistake to me. If we 
  want to do direct (sector_t - sector_t) IO, with no serialization 
  worries, it should have its own (simple) API - which things like 
  hierarchical RAID or RDMA APIs could use.
 
 I'm wrapped around the idea that __pfn_t *is* that simple api for 
 the tiered storage driver use case. [...]

I agree. (see my previous mail)

 [...] For RDMA I think we need struct page because I assume that 
 would be coordinated through a filesystem an truncate() is back in 
 play.

So I don't think RDMA is necessarily special, it's just a weirdly 
programmed DMA request:

 - If it is used internally by an exclusively managed complex storage
   driver, then it can use low level block APIs and pfn_t.

 - If RDMA is exposed all the way to user-space (do we have such 
   APIs?), allowing users to initiate RDMA IO into user buffers, then 
   (the user visible) buffer needs struct page backing. (which in turn 
   will then at some lower level convert to pfns.)

   That's true for both regular RAM pages and mmap()-ed persistent RAM 
   pages as well.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Ingo Molnar

* Dave Hansen dave.han...@linux.intel.com wrote:

 On 05/07/2015 10:42 AM, Dan Williams wrote:
  On Thu, May 7, 2015 at 10:36 AM, Ingo Molnar mi...@kernel.org wrote:
  * Dan Williams dan.j.willi...@intel.com wrote:
 
  So is there anything fundamentally wrong about creating struct 
  page backing at mmap() time (and making sure aliased mmaps share 
  struct page arrays)?
  
  Something like get_user_pages() triggers memory hotplug for 
  persistent memory, so they are actual real struct pages?  Can we 
  do memory hotplug at that granularity?
 
 We've traditionally limited them to SECTION_SIZE granularity, which 
 is 128MB IIRC.  There are also assumptions in places that you can do 
 page++ within a MAX_ORDER block if !CONFIG_HOLES_IN_ZONE.

I really don't think that's very practical: memory hotplug is slow, 
it's really not on the same abstraction level as mmap(), and the zone 
data structures are also fundamentally very coarse: not just because 
RAM ranges are huge, but also so that the pfn-page transformation 
stays relatively simple and fast.

 But, in all practicality, a lot of those places are in code like the 
 buddy allocator.  If your PTEs all have _PAGE_SPECIAL set and we're 
 not ever expecting these fake 'struct page's to hit these code 
 paths, it probably doesn't matter.
 
 You can probably get away with just allocating PAGE_SIZE worth of 
 'struct page' (which is 64) and mapping it in to vmemmap[].  The 
 worst case is that you'll eat 1 page of space for each outstanding 
 page of I/O.  That's a lot better than 2MB of temporary 'struct 
 page' space per page of I/O that it would take with a traditional 
 hotplug operation.

So I think the main value of struct page is if everyone on the system 
sees the same struct page for the same pfn - not just the temporary IO 
instance.

The idea of having very temporary struct page arrays misses the point 
I think: if struct page is used as essentially an IO sglist then most 
of the synchronization properties are lost: then we might as well use 
the real deal in that case and skip the dynamic allocation and use 
pfns directly and avoid the dynamic allocation overhead.

Stable, global page-struct descriptors are a given for real RAM, where 
we allocate a struct page for every page in nice, large, mostly linear 
arrays.

We'd really need that for pmem too, to get the full power of struct 
page: and that means allocating them in nice, large, predictable 
places - such as on the device itself ...

It might even be 'scattered' across the device, with 64 byte struct 
page size we can pack 64 descriptors into a single page, so every 65 
pages we could have a page-struct page.

Finding a pmem page's struct page would thus involve rounding it 
modulo 65 and reading that page.

The problem with that is fourfold:

 - that we now turn a very kernel internal API and data structure into 
   an ABI. If struct page grows beyond 64 bytes it's a problem.

 - on bootup (or device discovery time) we'd have to initialize all 
   the page structs. We could probably do this in a hierarchical way, 
   by dividing continuous pmem ranges into power-of-two groups of 
   blocks, and organizing them like the buddy allocator does.

 - 1.5% of storage space lost.

 - will wear-leveling properly migrate these 'hot' pages around?

The alternative would be some global interval-rbtree of struct page 
backed pmem ranges.

Beyond the synchronization problems of such a data structure (which 
looks like a nightmare) I don't think it's even feasible: especially 
if there's a filesystem on the pmem device then the block allocations 
could be physically fragmented (and there's no fundamental reason why 
they couldn't be fragmented), so a continuous mmap() of a file on it 
will yield wildly fragmented device-pfn ranges, exploding the rbtree. 
Think 1 million node interval-rbtree with an average depth of 20: 
cachemiss country for even simple lookups - not to mention the 
freeing/recycling complexity of unused struct pages to not allow it to 
grow too large.

I might be wrong though about all this :)

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Ingo Molnar

Al,

I was wondering about the struct page rules of 
iov_iter_get_pages_alloc(), used in various places. There's no 
documentation whatsoever in lib/iov_iter.c, nor in 
include/linux/uio.h, and the changelog that introduced it only says:

 commit 91f79c43d1b54d7154b118860d81b39bad07dfff
 Author: Al Viro v...@zeniv.linux.org.uk
 Date:   Fri Mar 21 04:58:33 2014 -0400

new helper: iov_iter_get_pages_alloc()

same as iov_iter_get_pages(), except that pages array is allocated
(kmalloc if possible, vmalloc if that fails) and left for caller to
free.  Lustre and NFS -direct_IO() switched to it.

Signed-off-by: Al Viro v...@zeniv.linux.org.uk

So if code does iov_iter_get_pages_alloc() on a user address that has 
a real struct page behind it - and some other code does a regular 
get_user_pages() on it, we'll have two sets of struct page 
descriptors, the 'real' one, and a fake allocated one, right?

How does that work? Nobody else can ever discover these fake page 
structs, so they don't really serve any 'real' synchronization purpose 
other than the limited role of IO completion.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Ingo Molnar

* Ingo Molnar mi...@kernel.org wrote:

 [...]

 For anything more complex, that maps any of this storage to 
 user-space, or exposes it to higher level struct page based APIs, 
 etc., where references matter and it's more of a cache with 
 potentially multiple users, not an IO space, the natural API is 
 struct page.

Let me walk back on this:

 I'd say that this particular series mostly addresses the 'pfn as 
 sector_t' side of the equation, where persistent memory is IO space, 
 not memory space, and as such it is the more natural and thus also 
 the cheaper/faster approach.

... but that does not appear to be the case: this series replaces a 
'struct page' interface with a pure pfn interface for the express 
purpose of being able to DMA to/from 'memory areas' that are not 
struct page backed.

 Linus probably disagrees? :-)

[ and he'd disagree rightfully ;-) ]

So what this patch set tries to achieve is (sector_t - sector_t) IO 
between storage devices (i.e. a rare and somewhat weird usecase), and 
does it by squeezing one device's storage address into our formerly 
struct page backed descriptor, via a pfn.

That looks like a layering violation and a mistake to me. If we want 
to do direct (sector_t - sector_t) IO, with no serialization worries, 
it should have its own (simple) API - which things like hierarchical 
RAID or RDMA APIs could use.

If what we want to do is to support say an mmap() of a file on 
persistent storage, and then read() into that file from another device 
via DMA, then I think we should have allocated struct page backing at 
mmap() time already, and all regular syscall APIs would 'just work' 
from that point on - far above what page-less, pfn-based APIs can do.

The temporary struct page backing can then be freed at munmap() time.

And if the usage is pure fd based, we don't really have fd-to-fd APIs 
beyond the rarely used splice variants (and even those don't do pure 
cross-IO, they use a pipe as an intermediary), so there's no problem 
to solve I suspect.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Dan Williams
On Thu, May 7, 2015 at 7:42 AM, Ingo Molnar mi...@kernel.org wrote:

 * Ingo Molnar mi...@kernel.org wrote:

 [...]

 For anything more complex, that maps any of this storage to
 user-space, or exposes it to higher level struct page based APIs,
 etc., where references matter and it's more of a cache with
 potentially multiple users, not an IO space, the natural API is
 struct page.

 Let me walk back on this:

 I'd say that this particular series mostly addresses the 'pfn as
 sector_t' side of the equation, where persistent memory is IO space,
 not memory space, and as such it is the more natural and thus also
 the cheaper/faster approach.

 ... but that does not appear to be the case: this series replaces a
 'struct page' interface with a pure pfn interface for the express
 purpose of being able to DMA to/from 'memory areas' that are not
 struct page backed.

 Linus probably disagrees? :-)

 [ and he'd disagree rightfully ;-) ]

 So what this patch set tries to achieve is (sector_t - sector_t) IO
 between storage devices (i.e. a rare and somewhat weird usecase), and
 does it by squeezing one device's storage address into our formerly
 struct page backed descriptor, via a pfn.

 That looks like a layering violation and a mistake to me. If we want
 to do direct (sector_t - sector_t) IO, with no serialization worries,
 it should have its own (simple) API - which things like hierarchical
 RAID or RDMA APIs could use.

I'm wrapped around the idea that __pfn_t *is* that simple api for the
tiered storage driver use case.  For RDMA I think we need struct page
because I assume that would be coordinated through a filesystem an
truncate() is back in play.

What does an alternative API look like?

 If what we want to do is to support say an mmap() of a file on
 persistent storage, and then read() into that file from another device
 via DMA, then I think we should have allocated struct page backing at
 mmap() time already, and all regular syscall APIs would 'just work'
 from that point on - far above what page-less, pfn-based APIs can do.

 The temporary struct page backing can then be freed at munmap() time.

Yes, passing around mmap()'d (DAX) persistent memory will need more
than a __pfn_t.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Christoph Hellwig
On Wed, May 06, 2015 at 05:19:48PM -0700, Linus Torvalds wrote:
 What is the primary thing that is driving this need? Do we have a very
 concrete example?

FYI, I plan to to implement RAID acceleration using nvdimms, and I plan to
ue pages for that.  The code just merge for 4.1 can easily support page
backing, and I plan to use that for now.  This still leaves support
for the gigantic intel nvdimms discovered over EFI out, but given that
I don't have access to them, and I dont know of any publically available
there's little I can do for now.  But adding on demand allocate struct
pages for the seems like the easiest way forward.  Boaz already has
code to allocate pages for them, although not on demand but at boot / plug in
time.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Dan Williams
On Thu, May 7, 2015 at 8:00 AM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Wed, May 6, 2015 at 7:36 PM, Dan Williams dan.j.willi...@intel.com wrote:

 My pet concrete example is covered by __pfn_t.  Referencing persistent
 memory in an md/dm hierarchical storage configuration.  Setting aside
 the thrash to get existing block users to do bvec_set_page(page)
 instead of bvec-page = page the onus is on that md/dm
 implementation and backing storage device driver to operate on
 __pfn_t.  That use case is simple because there is no use of page
 locking or refcounting in that path, just dma_map_page() and
 kmap_atomic().

 So clarify for me: are you trying to make the IO stack in general be
 able to use the persistent memory as a source (or destination) for IO
 to _other_ devices, or are you talking about just internally shuffling
 things around for something like RAID on top of persistent memory?

 Because I think those are two very different things.

Yes, they are, and I am referring to the former, persistent memory as
a source/destination to other devices.

 For example, one of the things I worry about is for people doing IO
 from persistent memory directly to some slow stable storage (aka
 disk). That was what I thought you were aiming for: infrastructure so
 that you can make a bio for a *disk* device contain a page list that
 is the persistent memory.

 And I think that is a very dangerous operation to do, because the
 persistent memory itself is going to have some filesystem on it, so
 anything that looks up the persistent memory pages is *not* going to
 have a stable pfn: the pfn will point to a fixed part of the
 persistent memory, but the file that was there may be deleted and the
 memory reassigned to something else.

Indeed, truncate() in the absence of struct page has been a major
hurdle for persistent memory enabling.  But it does not impact this
specific md/dm use case.  md/dm will have taken an exclusive claim on
an entire pmem block device (or partition), so there will be no
competing with a filesystem.

 That's the kind of thing that struct page helps with for normal IO
 devices. It's both a source of serialization and indirection, so that
 when somebody does a truncate() on a file, we don't end up doing IO
 to random stale locations on the disk that got reassigned to another
 file.

 So struct page is very fundamental. It's *not* just a this is the
 physical source/drain of the data you are doing IO on.

 So if you are looking at some kind of zero-copy IO, where you can do
 IO from a filesystem on persistent storage to *another* filesystem on
 (say, a big rotational disk used for long-term storage) by just doing
 a bo that targets the disk, but has the persistent memory as the
 source memory, I really want to understand how you are going to
 serialize this.

 So *that* is what I meant by What is the primary thing that is
 driving this need? Do we have a very concrete example?

 I abvsolutely do *not* want to teach the bio subsystem to just
 randomly be able to take the source/destination of the IO as being
 some random pfn without knowing what the actual uses are and how these
 IO's are generated in the first place.

blkdev_get(FMODE_EXCL) is the protection in this case.

 I was assuming that you wanted to do something where you mmap() the
 persistent memory, and then write it out to another device (possibly
 using aio_write()). But that really does require some kind of
 serialization at a higher level, because you can't just look up the
 pfn's in the page table and assume they are stable: they are *not*
 stable.

We want to get there eventually, but this patchset does not address that case.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Linus Torvalds
On Thu, May 7, 2015 at 8:40 AM, Dan Williams dan.j.willi...@intel.com wrote:

 blkdev_get(FMODE_EXCL) is the protection in this case.

Ugh. That looks like a horrible nasty big hammer that will bite us
badly some day. Since you'd have to hold it for the whole IO. But I
guess it at least works.

Anyway, I did want to say that while I may not be convinced about the
approach, I think the patches themselves don't look horrible. I
actually like your __pfn_t. So while I (very obviously) have some
doubts about this approach, it may be that the most convincing
argument is just in the code.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Dan Williams
On Thu, May 7, 2015 at 8:58 AM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Thu, May 7, 2015 at 8:40 AM, Dan Williams dan.j.willi...@intel.com wrote:

 blkdev_get(FMODE_EXCL) is the protection in this case.

 Ugh. That looks like a horrible nasty big hammer that will bite us
 badly some day. Since you'd have to hold it for the whole IO. But I
 guess it at least works.

Oh no, that wouldn't be per-I/O that would be permanent at
configuration set up time just like a raid member device.

Something like:
mdadm --create /dev/md0 --cache=/dev/pmem0p1 --storage=/dev/sda

 Anyway, I did want to say that while I may not be convinced about the
 approach, I think the patches themselves don't look horrible. I
 actually like your __pfn_t. So while I (very obviously) have some
 doubts about this approach, it may be that the most convincing
 argument is just in the code.

Ok, I'll keep thinking about this and come back when we have a better
story about passing mmap'd persistent memory around in userspace.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Linus Torvalds
On Wed, May 6, 2015 at 7:36 PM, Dan Williams dan.j.willi...@intel.com wrote:

 My pet concrete example is covered by __pfn_t.  Referencing persistent
 memory in an md/dm hierarchical storage configuration.  Setting aside
 the thrash to get existing block users to do bvec_set_page(page)
 instead of bvec-page = page the onus is on that md/dm
 implementation and backing storage device driver to operate on
 __pfn_t.  That use case is simple because there is no use of page
 locking or refcounting in that path, just dma_map_page() and
 kmap_atomic().

So clarify for me: are you trying to make the IO stack in general be
able to use the persistent memory as a source (or destination) for IO
to _other_ devices, or are you talking about just internally shuffling
things around for something like RAID on top of persistent memory?

Because I think those are two very different things.

For example, one of the things I worry about is for people doing IO
from persistent memory directly to some slow stable storage (aka
disk). That was what I thought you were aiming for: infrastructure so
that you can make a bio for a *disk* device contain a page list that
is the persistent memory.

And I think that is a very dangerous operation to do, because the
persistent memory itself is going to have some filesystem on it, so
anything that looks up the persistent memory pages is *not* going to
have a stable pfn: the pfn will point to a fixed part of the
persistent memory, but the file that was there may be deleted and the
memory reassigned to something else.

That's the kind of thing that struct page helps with for normal IO
devices. It's both a source of serialization and indirection, so that
when somebody does a truncate() on a file, we don't end up doing IO
to random stale locations on the disk that got reassigned to another
file.

So struct page is very fundamental. It's *not* just a this is the
physical source/drain of the data you are doing IO on.

So if you are looking at some kind of zero-copy IO, where you can do
IO from a filesystem on persistent storage to *another* filesystem on
(say, a big rotational disk used for long-term storage) by just doing
a bo that targets the disk, but has the persistent memory as the
source memory, I really want to understand how you are going to
serialize this.

So *that* is what I meant by What is the primary thing that is
driving this need? Do we have a very concrete example?

I abvsolutely do *not* want to teach the bio subsystem to just
randomly be able to take the source/destination of the IO as being
some random pfn without knowing what the actual uses are and how these
IO's are generated in the first place.

I was assuming that you wanted to do something where you mmap() the
persistent memory, and then write it out to another device (possibly
using aio_write()). But that really does require some kind of
serialization at a higher level, because you can't just look up the
pfn's in the page table and assume they are stable: they are *not*
stable.

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Dan Williams
On Thu, May 7, 2015 at 9:18 AM, Christoph Hellwig h...@lst.de wrote:
 On Wed, May 06, 2015 at 05:19:48PM -0700, Linus Torvalds wrote:
 What is the primary thing that is driving this need? Do we have a very
 concrete example?

 FYI, I plan to to implement RAID acceleration using nvdimms, and I plan to
 ue pages for that.  The code just merge for 4.1 can easily support page
 backing, and I plan to use that for now.  This still leaves support
 for the gigantic intel nvdimms discovered over EFI out, but given that
 I don't have access to them, and I dont know of any publically available
 there's little I can do for now.  But adding on demand allocate struct
 pages for the seems like the easiest way forward.  Boaz already has
 code to allocate pages for them, although not on demand but at boot / plug in
 time.

Hmmm, the capacities of persistent memory that would be assigned for a
raid accelerator would be limited by diminishing returns.  I.e. there
seems to be no point to assign more than 8GB or so to the cache?  If
that's the case the capacity argument loses some teeth, just
blk_get(FMODE_EXCL) + memory_hotplug a small capacity and be done.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Jerome Glisse
On Thu, May 07, 2015 at 06:18:07PM +0200, Christoph Hellwig wrote:
 On Wed, May 06, 2015 at 05:19:48PM -0700, Linus Torvalds wrote:
  What is the primary thing that is driving this need? Do we have a very
  concrete example?
 
 FYI, I plan to to implement RAID acceleration using nvdimms, and I plan to
 ue pages for that.  The code just merge for 4.1 can easily support page
 backing, and I plan to use that for now.  This still leaves support
 for the gigantic intel nvdimms discovered over EFI out, but given that
 I don't have access to them, and I dont know of any publically available
 there's little I can do for now.  But adding on demand allocate struct
 pages for the seems like the easiest way forward.  Boaz already has
 code to allocate pages for them, although not on demand but at boot / plug in
 time.

I think here other folks might be interested, i am ccing Paul. But for GPU
we are facing similar issue of trying to present the GPU memory to the kernel
in a coherent way (coherent from the design and linux kernel concept POV).

For this dynamicaly allocated struct page might effectively be a solution that
could be share btw persistent memory and GPU folks. We can even enforce thing
like VMEMMAP and have special region carveout where we can dynamicly map/unmap
backing page for range of device pfn. This would also allow to catch people
trying to access such page, we could add a set of new helper like :
get_page_dev()/put_page_dev() ... and only the _dev version would works on
this new kind of memory, regular get_page()/put_page() would throw error.
This should allow to make sure only legitimate users are referencing such
page.

Issue might be that we can run out of kernel address space with 48bits but if
such monstruous computer ever see the light of day they might consider using
CPU with more bits.

Another issue is that we might care for the 32bits platform too, but that's
solvable at a small cost.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-07 Thread Ingo Molnar

* Dan Williams dan.j.willi...@intel.com wrote:

  Anyway, I did want to say that while I may not be convinced about 
  the approach, I think the patches themselves don't look horrible. 
  I actually like your __pfn_t. So while I (very obviously) have 
  some doubts about this approach, it may be that the most 
  convincing argument is just in the code.
 
 Ok, I'll keep thinking about this and come back when we have a 
 better story about passing mmap'd persistent memory around in 
 userspace.

So is there anything fundamentally wrong about creating struct page 
backing at mmap() time (and making sure aliased mmaps share struct 
page arrays)?

Because if that is done, then the DMA agent won't even know about the 
memory being persistent RAM. It's just a regular struct page, that 
happens to point to persistent RAM. Same goes for all the high level 
VM APIs, futexes, etc. Everything will Just Work.

It will also be relatively fast: mmap() is a relative slowpath, 
comparatively.

As far as RAID is concerned: that's a relatively easy situation, as 
there's only a single user of the devices, the RAID context that 
manages all component devices exclusively. Device to device DMA can 
use the block layer directly, i.e. most of the patches you've got here 
in this series, except:

74287   C May 06 Dan Williams( 232) ├─[PATCH v2 09/10] dax: convert to 
__pfn_t

I think DAX mmap()s need struct page backing.

I think there's a simple rule: if a page is visible to user-space via 
the MMU then it needs struct page backing. If it's hidden, like 
behind a RAID abstraction, it probably doesn't.

With the remaining patches a high level RAID driver ought to be able 
to send pfn-to-sector and sector-to-pfn requests to other block 
drivers, without any unnecessary struct page allocation overhead, 
right?

As long as the pfn concept remains a clever way to reuse our 
ram-sector interfaces to implement sector-sector IO, in the cases 
where the IO has no serialization or MMU concerns, not using struct 
page and using pfn_t looks natural.

The moment it starts reaching user space APIs, like in the DAX case, 
and especially if it becomes user-MMU visible, it's a mistake to not 
have struct page backing, I think.

(In that sense the current DAX mmap() code is already a partial 
mistake.)

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-06 Thread Dan Williams
On Wed, May 6, 2015 at 5:19 PM, Linus Torvalds
 wrote:
> On Wed, May 6, 2015 at 4:47 PM, Dan Williams  wrote:
>>
>> Conceptually better, but certainly more difficult to audit if the fake
>> struct page is initialized in a subtle way that breaks when/if it
>> leaks to some unwitting context.
>
> Maybe. It could go either way, though. In particular, with the
> "dynamically allocated struct page" approach, if somebody uses it past
> the supposed lifetime of the use, things like poisoning the temporary
> "struct page" could be fairly effective. You can't really poison the
> pfn - it's just a number, and if somebody uses it later than you think
> (and you have re-used that physical memory for something else), you'll
> never ever know.

True, but there's little need to poison a _pfn_t because it's
permanent once discovered via ->direct_access() on the hosting struct
block_device.  Sure, kmap_atomic_pfn_t() may fail when the pmem driver
unbinds from a device, but the __pfn_t is still valid.  Obviously, we
can only support atomic kmap(s) with this property, and it would be
nice to fault if someone continued to use the __pfn_t after the
hosting device was disabled.  To be clear, DAX has this same problem
today.  Nothing stops whomever called ->direct_access() to continue
using the pfn after the backing device has been disabled.

> I'd *assume* that most users of the dynamic "struct page" allocation
> have very clear lifetime rules. Those things would presumably normally
> get looked-up by some extended version of "get_user_pages()", and
> there's a clear use of the result, with no longer lifetime. Also, you
> do need to have some higher-level locking when you  do this, to make
> sure that the persistent pages don't magically get re-assigned. We're
> presumably talking about having a filesystem in that persistent
> memory, so we cannot be doing IO to the pages (from some other source
> - whether RDMA or some special zero-copy model) while the underlying
> filesystem is reassigning the storage because somebody deleted the
> file.
>
> IOW, there had better be other external rules about when - and how
> long - you can use a particular persistent page. No? So the whole
> "when/how to allocate the temporary 'struct page'" is just another
> detail in that whole thing.
>
> And yes, some uses may not ever actually see that. If the whole of
> persistent memory is just assigned to a database or something, and the
> DB just wants to do a "flush this range of persistent memory to
> long-term disk storage", then there may not be much of a "lifetime"
> issue for the persistent memory. But even then you're going to have IO
> completion callbacks etc to let the DB know that it has hit the disk,
> so..
>
> What is the primary thing that is driving this need? Do we have a very
> concrete example?

My pet concrete example is covered by __pfn_t.  Referencing persistent
memory in an md/dm hierarchical storage configuration.  Setting aside
the thrash to get existing block users to do "bvec_set_page(page)"
instead of "bvec->page = page" the onus is on that md/dm
implementation and backing storage device driver to operate on
__pfn_t.  That use case is simple because there is no use of page
locking or refcounting in that path, just dma_map_page() and
kmap_atomic().  The more difficult use case is precisely what Al
picked up on, O_DIRECT and RDMA.  This patchset does nothing to
address those use cases outside of not needing a struct page when they
eventually craft a bio.

I know Matthew Wilcox has explored the idea of "get_user_sg()" and let
the scatterlist hold the reference count and locks, but I'll let him
speak to that.

I still see __pfn_t as generally useful for the simple in-kernel
stacked-block-i/o use case.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-06 Thread Linus Torvalds
On Wed, May 6, 2015 at 4:47 PM, Dan Williams  wrote:
>
> Conceptually better, but certainly more difficult to audit if the fake
> struct page is initialized in a subtle way that breaks when/if it
> leaks to some unwitting context.

Maybe. It could go either way, though. In particular, with the
"dynamically allocated struct page" approach, if somebody uses it past
the supposed lifetime of the use, things like poisoning the temporary
"struct page" could be fairly effective. You can't really poison the
pfn - it's just a number, and if somebody uses it later than you think
(and you have re-used that physical memory for something else), you'll
never ever know.

I'd *assume* that most users of the dynamic "struct page" allocation
have very clear lifetime rules. Those things would presumably normally
get looked-up by some extended version of "get_user_pages()", and
there's a clear use of the result, with no longer lifetime. Also, you
do need to have some higher-level locking when you  do this, to make
sure that the persistent pages don't magically get re-assigned. We're
presumably talking about having a filesystem in that persistent
memory, so we cannot be doing IO to the pages (from some other source
- whether RDMA or some special zero-copy model) while the underlying
filesystem is reassigning the storage because somebody deleted the
file.

IOW, there had better be other external rules about when - and how
long - you can use a particular persistent page. No? So the whole
"when/how to allocate the temporary 'struct page'" is just another
detail in that whole thing.

And yes, some uses may not ever actually see that. If the whole of
persistent memory is just assigned to a database or something, and the
DB just wants to do a "flush this range of persistent memory to
long-term disk storage", then there may not be much of a "lifetime"
issue for the persistent memory. But even then you're going to have IO
completion callbacks etc to let the DB know that it has hit the disk,
so..

What is the primary thing that is driving this need? Do we have a very
concrete example?

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-06 Thread Dan Williams
On Wed, May 6, 2015 at 3:10 PM, Linus Torvalds
 wrote:
> On Wed, May 6, 2015 at 1:04 PM, Dan Williams  wrote:
>>
>> The motivation for this change is persistent memory and the desire to
>> use it not only via the pmem driver, but also as a memory target for I/O
>> (DAX, O_DIRECT, DMA, RDMA, etc) in other parts of the kernel.
>
> I detest this approach.
>

Hmm, yes, I can't argue against "put the onus on odd behavior where it
belongs."...

> I'd much rather go exactly the other way around, and do the dynamic
> "struct page" instead.
>
> Add a flag to "struct page"

Ok, given I had already precluded 32-bit systems in this __pfn_t
approach we should have flag space for this on 64-bit.

> to mark it as a fake entry and teach
> "page_to_pfn()" to look up the actual pfn some way (that union tha
> contains "index" looks like a good target to also contain 'pfn', for
> example).
>
> Especially if this is mainly for persistent storage, we'll never have
> issues with worrying about writing it back under memory pressure, so
> allocating a "struct page" for these things shouldn't be a problem.
> There's likely only a few paths that actually generate IO for those
> things.
>
> In other words, I'd really like our basic infrastructure to be for the
> *normal* case, and the "struct page" is about so much more than just
> "what's the target for IO". For normal IO, "struct page" is also what
> serializes the IO so that you have a consistent view of the end
> result, and there's obviously the reference count there too. So I
> really *really* think that "struct page" is the better entity for
> describing the actual IO, because it's the common and the generic
> thing, while a "pfn" is not actually *enough* for IO in general, and
> you now end up having to look up the "struct page" for the locking and
> refcounting etc.
>
> If you go the other way, and instead generate a "struct page" from the
> pfn for the few cases that need it, you put the onus on odd behavior
> where it belongs.
>
> Yes, it might not be any simpler in the end, but I think it would be
> conceptually much better.

Conceptually better, but certainly more difficult to audit if the fake
struct page is initialized in a subtle way that breaks when/if it
leaks to some unwitting context.  The one benefit I may need to
concede is a mechanism to opt-in to handle these fake pages to the few
paths that know what they are doing.  That was easy with __pfn_t, but
a struct page can go silently almost anywhere.  Certainly nothing is
prepared a for a given struct page pointer to change the pfn it points
to on the fly, which I think is what we would end up doing for
something like a raid cache.  Keep a pool of struct pages around and
point them at persistent memory pfns while I/O is in flight.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-06 Thread Linus Torvalds
On Wed, May 6, 2015 at 1:04 PM, Dan Williams  wrote:
>
> The motivation for this change is persistent memory and the desire to
> use it not only via the pmem driver, but also as a memory target for I/O
> (DAX, O_DIRECT, DMA, RDMA, etc) in other parts of the kernel.

I detest this approach.

I'd much rather go exactly the other way around, and do the dynamic
"struct page" instead.

Add a flag to "struct page" to mark it as a fake entry and teach
"page_to_pfn()" to look up the actual pfn some way (that union tha
contains "index" looks like a good target to also contain 'pfn', for
example).

Especially if this is mainly for persistent storage, we'll never have
issues with worrying about writing it back under memory pressure, so
allocating a "struct page" for these things shouldn't be a problem.
There's likely only a few paths that actually generate IO for those
things.

In other words, I'd really like our basic infrastructure to be for the
*normal* case, and the "struct page" is about so much more than just
"what's the target for IO". For normal IO, "struct page" is also what
serializes the IO so that you have a consistent view of the end
result, and there's obviously the reference count there too. So I
really *really* think that "struct page" is the better entity for
describing the actual IO, because it's the common and the generic
thing, while a "pfn" is not actually *enough* for IO in general, and
you now end up having to look up the "struct page" for the locking and
refcounting etc.

If you go the other way, and instead generate a "struct page" from the
pfn for the few cases that need it, you put the onus on odd behavior
where it belongs.

Yes, it might not be any simpler in the end, but I think it would be
conceptually much better.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-06 Thread Al Viro
On Wed, May 06, 2015 at 04:04:53PM -0400, Dan Williams wrote:
> Changes since v1 [1]:
> 
> 1/ added include/asm-generic/pfn.h for the __pfn_t definition and helpers.
> 
> 2/ added kmap_atomic_pfn_t()
> 
> 3/ rebased on v4.1-rc2
> 
> [1]: http://marc.info/?l=linux-kernel=142653770511970=2
> 
> ---
> 
> A lead in note, this looks scarier than it is.  Most of the code thrash
> is automated via Coccinelle.  Also the subtle differences behind an
> 'unsigned long pfn' and a '__pfn_t' are mitigated by type-safety and a
> Kconfig option (default disabled CONFIG_PMEM_IO) that globally controls
> whether a pfn and a __pfn_t are equivalent.
> 
> The motivation for this change is persistent memory and the desire to
> use it not only via the pmem driver, but also as a memory target for I/O
> (DAX, O_DIRECT, DMA, RDMA, etc) in other parts of the kernel.  Aside
> from the pmem driver and DAX, persistent memory is not able to be used
> in these I/O scenarios due to the lack of a backing struct page, i.e.
> persistent memory is not part of the memmap.  This patchset takes the
> position that the solution is to teach I/O paths that want to operate on
> persistent memory to do so by referencing a __pfn_t.  The alternatives
> are discussed in the changelog for "[PATCH v2 01/10] arch: introduce
> __pfn_t for persistent memory i/o", copied here:
> 
> Alternatives:
> 
> 1/ Provide struct page coverage for persistent memory in
>DRAM.  The expectation is that persistent memory capacities make
>this untenable in the long term.
> 
> 2/ Provide struct page coverage for persistent memory with
>persistent memory.  While persistent memory may have near DRAM
>performance characteristics it may not have the same
>write-endurance of DRAM.  Given the update frequency of struct
>page objects it may not be suitable for persistent memory.
> 
> 3/ Dynamically allocate struct page.  This appears to be on
>the order of the complexity of converting code paths to use
>__pfn_t references instead of struct page, and the amount of
>setup required to establish a valid struct page reference is
>mostly wasted when the only usage in the block stack is to
>perform a page_to_pfn() conversion for dma-mapping.  Instances
>of kmap() / kmap_atomic() usage appear to be the only occasions
>in the block stack where struct page is non-trivially used.  A
>new kmap_atomic_pfn_t() is proposed to handle those cases.

*grumble*

What are you going to do with things like iov_iter_get_pages()?  Long-term,
that is, after you go for "this pfn has no struct page for it"...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-06 Thread Linus Torvalds
On Wed, May 6, 2015 at 1:04 PM, Dan Williams dan.j.willi...@intel.com wrote:

 The motivation for this change is persistent memory and the desire to
 use it not only via the pmem driver, but also as a memory target for I/O
 (DAX, O_DIRECT, DMA, RDMA, etc) in other parts of the kernel.

I detest this approach.

I'd much rather go exactly the other way around, and do the dynamic
struct page instead.

Add a flag to struct page to mark it as a fake entry and teach
page_to_pfn() to look up the actual pfn some way (that union tha
contains index looks like a good target to also contain 'pfn', for
example).

Especially if this is mainly for persistent storage, we'll never have
issues with worrying about writing it back under memory pressure, so
allocating a struct page for these things shouldn't be a problem.
There's likely only a few paths that actually generate IO for those
things.

In other words, I'd really like our basic infrastructure to be for the
*normal* case, and the struct page is about so much more than just
what's the target for IO. For normal IO, struct page is also what
serializes the IO so that you have a consistent view of the end
result, and there's obviously the reference count there too. So I
really *really* think that struct page is the better entity for
describing the actual IO, because it's the common and the generic
thing, while a pfn is not actually *enough* for IO in general, and
you now end up having to look up the struct page for the locking and
refcounting etc.

If you go the other way, and instead generate a struct page from the
pfn for the few cases that need it, you put the onus on odd behavior
where it belongs.

Yes, it might not be any simpler in the end, but I think it would be
conceptually much better.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-06 Thread Dan Williams
On Wed, May 6, 2015 at 3:10 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Wed, May 6, 2015 at 1:04 PM, Dan Williams dan.j.willi...@intel.com wrote:

 The motivation for this change is persistent memory and the desire to
 use it not only via the pmem driver, but also as a memory target for I/O
 (DAX, O_DIRECT, DMA, RDMA, etc) in other parts of the kernel.

 I detest this approach.


Hmm, yes, I can't argue against put the onus on odd behavior where it
belongs

 I'd much rather go exactly the other way around, and do the dynamic
 struct page instead.

 Add a flag to struct page

Ok, given I had already precluded 32-bit systems in this __pfn_t
approach we should have flag space for this on 64-bit.

 to mark it as a fake entry and teach
 page_to_pfn() to look up the actual pfn some way (that union tha
 contains index looks like a good target to also contain 'pfn', for
 example).

 Especially if this is mainly for persistent storage, we'll never have
 issues with worrying about writing it back under memory pressure, so
 allocating a struct page for these things shouldn't be a problem.
 There's likely only a few paths that actually generate IO for those
 things.

 In other words, I'd really like our basic infrastructure to be for the
 *normal* case, and the struct page is about so much more than just
 what's the target for IO. For normal IO, struct page is also what
 serializes the IO so that you have a consistent view of the end
 result, and there's obviously the reference count there too. So I
 really *really* think that struct page is the better entity for
 describing the actual IO, because it's the common and the generic
 thing, while a pfn is not actually *enough* for IO in general, and
 you now end up having to look up the struct page for the locking and
 refcounting etc.

 If you go the other way, and instead generate a struct page from the
 pfn for the few cases that need it, you put the onus on odd behavior
 where it belongs.

 Yes, it might not be any simpler in the end, but I think it would be
 conceptually much better.

Conceptually better, but certainly more difficult to audit if the fake
struct page is initialized in a subtle way that breaks when/if it
leaks to some unwitting context.  The one benefit I may need to
concede is a mechanism to opt-in to handle these fake pages to the few
paths that know what they are doing.  That was easy with __pfn_t, but
a struct page can go silently almost anywhere.  Certainly nothing is
prepared a for a given struct page pointer to change the pfn it points
to on the fly, which I think is what we would end up doing for
something like a raid cache.  Keep a pool of struct pages around and
point them at persistent memory pfns while I/O is in flight.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-06 Thread Al Viro
On Wed, May 06, 2015 at 04:04:53PM -0400, Dan Williams wrote:
 Changes since v1 [1]:
 
 1/ added include/asm-generic/pfn.h for the __pfn_t definition and helpers.
 
 2/ added kmap_atomic_pfn_t()
 
 3/ rebased on v4.1-rc2
 
 [1]: http://marc.info/?l=linux-kernelm=142653770511970w=2
 
 ---
 
 A lead in note, this looks scarier than it is.  Most of the code thrash
 is automated via Coccinelle.  Also the subtle differences behind an
 'unsigned long pfn' and a '__pfn_t' are mitigated by type-safety and a
 Kconfig option (default disabled CONFIG_PMEM_IO) that globally controls
 whether a pfn and a __pfn_t are equivalent.
 
 The motivation for this change is persistent memory and the desire to
 use it not only via the pmem driver, but also as a memory target for I/O
 (DAX, O_DIRECT, DMA, RDMA, etc) in other parts of the kernel.  Aside
 from the pmem driver and DAX, persistent memory is not able to be used
 in these I/O scenarios due to the lack of a backing struct page, i.e.
 persistent memory is not part of the memmap.  This patchset takes the
 position that the solution is to teach I/O paths that want to operate on
 persistent memory to do so by referencing a __pfn_t.  The alternatives
 are discussed in the changelog for [PATCH v2 01/10] arch: introduce
 __pfn_t for persistent memory i/o, copied here:
 
 Alternatives:
 
 1/ Provide struct page coverage for persistent memory in
DRAM.  The expectation is that persistent memory capacities make
this untenable in the long term.
 
 2/ Provide struct page coverage for persistent memory with
persistent memory.  While persistent memory may have near DRAM
performance characteristics it may not have the same
write-endurance of DRAM.  Given the update frequency of struct
page objects it may not be suitable for persistent memory.
 
 3/ Dynamically allocate struct page.  This appears to be on
the order of the complexity of converting code paths to use
__pfn_t references instead of struct page, and the amount of
setup required to establish a valid struct page reference is
mostly wasted when the only usage in the block stack is to
perform a page_to_pfn() conversion for dma-mapping.  Instances
of kmap() / kmap_atomic() usage appear to be the only occasions
in the block stack where struct page is non-trivially used.  A
new kmap_atomic_pfn_t() is proposed to handle those cases.

*grumble*

What are you going to do with things like iov_iter_get_pages()?  Long-term,
that is, after you go for this pfn has no struct page for it...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-06 Thread Dan Williams
On Wed, May 6, 2015 at 5:19 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Wed, May 6, 2015 at 4:47 PM, Dan Williams dan.j.willi...@intel.com wrote:

 Conceptually better, but certainly more difficult to audit if the fake
 struct page is initialized in a subtle way that breaks when/if it
 leaks to some unwitting context.

 Maybe. It could go either way, though. In particular, with the
 dynamically allocated struct page approach, if somebody uses it past
 the supposed lifetime of the use, things like poisoning the temporary
 struct page could be fairly effective. You can't really poison the
 pfn - it's just a number, and if somebody uses it later than you think
 (and you have re-used that physical memory for something else), you'll
 never ever know.

True, but there's little need to poison a _pfn_t because it's
permanent once discovered via -direct_access() on the hosting struct
block_device.  Sure, kmap_atomic_pfn_t() may fail when the pmem driver
unbinds from a device, but the __pfn_t is still valid.  Obviously, we
can only support atomic kmap(s) with this property, and it would be
nice to fault if someone continued to use the __pfn_t after the
hosting device was disabled.  To be clear, DAX has this same problem
today.  Nothing stops whomever called -direct_access() to continue
using the pfn after the backing device has been disabled.

 I'd *assume* that most users of the dynamic struct page allocation
 have very clear lifetime rules. Those things would presumably normally
 get looked-up by some extended version of get_user_pages(), and
 there's a clear use of the result, with no longer lifetime. Also, you
 do need to have some higher-level locking when you  do this, to make
 sure that the persistent pages don't magically get re-assigned. We're
 presumably talking about having a filesystem in that persistent
 memory, so we cannot be doing IO to the pages (from some other source
 - whether RDMA or some special zero-copy model) while the underlying
 filesystem is reassigning the storage because somebody deleted the
 file.

 IOW, there had better be other external rules about when - and how
 long - you can use a particular persistent page. No? So the whole
 when/how to allocate the temporary 'struct page' is just another
 detail in that whole thing.

 And yes, some uses may not ever actually see that. If the whole of
 persistent memory is just assigned to a database or something, and the
 DB just wants to do a flush this range of persistent memory to
 long-term disk storage, then there may not be much of a lifetime
 issue for the persistent memory. But even then you're going to have IO
 completion callbacks etc to let the DB know that it has hit the disk,
 so..

 What is the primary thing that is driving this need? Do we have a very
 concrete example?

My pet concrete example is covered by __pfn_t.  Referencing persistent
memory in an md/dm hierarchical storage configuration.  Setting aside
the thrash to get existing block users to do bvec_set_page(page)
instead of bvec-page = page the onus is on that md/dm
implementation and backing storage device driver to operate on
__pfn_t.  That use case is simple because there is no use of page
locking or refcounting in that path, just dma_map_page() and
kmap_atomic().  The more difficult use case is precisely what Al
picked up on, O_DIRECT and RDMA.  This patchset does nothing to
address those use cases outside of not needing a struct page when they
eventually craft a bio.

I know Matthew Wilcox has explored the idea of get_user_sg() and let
the scatterlist hold the reference count and locks, but I'll let him
speak to that.

I still see __pfn_t as generally useful for the simple in-kernel
stacked-block-i/o use case.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 00/10] evacuate struct page from the block layer, introduce __pfn_t

2015-05-06 Thread Linus Torvalds
On Wed, May 6, 2015 at 4:47 PM, Dan Williams dan.j.willi...@intel.com wrote:

 Conceptually better, but certainly more difficult to audit if the fake
 struct page is initialized in a subtle way that breaks when/if it
 leaks to some unwitting context.

Maybe. It could go either way, though. In particular, with the
dynamically allocated struct page approach, if somebody uses it past
the supposed lifetime of the use, things like poisoning the temporary
struct page could be fairly effective. You can't really poison the
pfn - it's just a number, and if somebody uses it later than you think
(and you have re-used that physical memory for something else), you'll
never ever know.

I'd *assume* that most users of the dynamic struct page allocation
have very clear lifetime rules. Those things would presumably normally
get looked-up by some extended version of get_user_pages(), and
there's a clear use of the result, with no longer lifetime. Also, you
do need to have some higher-level locking when you  do this, to make
sure that the persistent pages don't magically get re-assigned. We're
presumably talking about having a filesystem in that persistent
memory, so we cannot be doing IO to the pages (from some other source
- whether RDMA or some special zero-copy model) while the underlying
filesystem is reassigning the storage because somebody deleted the
file.

IOW, there had better be other external rules about when - and how
long - you can use a particular persistent page. No? So the whole
when/how to allocate the temporary 'struct page' is just another
detail in that whole thing.

And yes, some uses may not ever actually see that. If the whole of
persistent memory is just assigned to a database or something, and the
DB just wants to do a flush this range of persistent memory to
long-term disk storage, then there may not be much of a lifetime
issue for the persistent memory. But even then you're going to have IO
completion callbacks etc to let the DB know that it has hit the disk,
so..

What is the primary thing that is driving this need? Do we have a very
concrete example?

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/