Re: [PATCH 3/6] x86, pmem: add PMEM API for persistent memory

2015-05-29 Thread Dan Williams
On Fri, May 29, 2015 at 5:07 AM, Ross Zwisler  wrote:
> On Thu, 2015-05-28 at 16:20 -0700, H. Peter Anvin wrote:
>> On 05/28/2015 03:35 PM, Ross Zwisler wrote:
>> > Add a new PMEM API to x86, and allow for architectures that do not
>> > implement this API.  Architectures that implement the PMEM API
>> > should
>> > define ARCH_HAS_PMEM_API in their kernel configuration and must
>> > provide
>> > implementations for persistent_copy(), persistent_flush() and
>> > persistent_sync().
>>
>> >
>> >  void clflush_cache_range(void *addr, unsigned int size);
>> >
>>
>> No, no, no, no, no.  Include the proper header file.
>
> I'm confused - I did inlcude  in pmem.h?  The line
> you're quoting above was an unmodified line from asm/cacheflush.h - I
> didn't redefine the prototype for clflush_cache_range() anywhere.
>
> Or does this comment mean that you think we shouldn't have an
> architecture agnostic PMEM API, and that you think the PMEM and ND_BLK
> drivers should just directly include asm/cacheflush.h and use the x86
> APIs directly?
>
>> > +static inline void arch_persistent_flush(void *vaddr, size_t size)
>> > +{
>> > +   clflush_cache_range(vaddr, size);
>> > +}
>>
>> Shouldn't this really be using clwb() -- we really need a
>> clwb_cache_range() I guess?
>
> I think we will need a clwb_cache_range() for DAX, for when it responds
> to a msync() or fsync() call and needs to rip through a bunch of
> memory, writing it back to the DIMMs.  I just didn't add it yet because
> I didn't have a consumer.
>
> It turns out that for the block aperture I/O case we really do need a
> flush instead of a writeback, though, so clflush_cache_range() is
> perfect.  Here's the flow, which is a read from a block window
> aperture:
>
> 1) The nd_blk driver gets a read request, and selects a block window to
> use.  It's entirely possible that this block window's aperture has
> clean cache lines associated with it in the processor cache hierarchy.
>  It shouldn't be possible that it has dirty cache lines - we either
> just did a read, or we did a write and would have used NT stores.
>
> 2) Write a new address into the block window control register.  The
> memory backing the aperture moves to the new address.  Any clean lines
> held in the processor cache are now out of sync.
>
> 3) Flush the cache lines associated with the aperture.  The lines are
> guaranteed to be clean, so the flush will just discard them and no
> writebacks will occur.
>
> 4) Read the contents of the block aperture, servicing the read.
>
> This is the read flow outlined it the "driver writer's guide":
>
> http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
>
>> Incidentally, clflush_cache_range() seems to have the same flaw as
>> the
>> proposed use case for clwb() had... if the buffer is aligned it will
>> needlessly flush the last line twice.  It should really look
>> something
>> like this (which would be a good standalone patch):
>>
>> void clflush_cache_range(void *vaddr, unsigned int size)
>> {
>> void *vend = vaddr + size - 1;
>>
>> mb();
>>
>>   vaddr = (void *)
>>   ((unsigned long)vaddr
>>& ~(boot_cpu_data.x86_clflush_size - 1));
>>
>> for (; vaddr < vend; vaddr += boot_cpu_data.x86_clflush_size)
>> clflushopt(vaddr);
>>
>> mb();
>> }
>> EXPORT_SYMBOL_GPL(clflush_cache_range);
>
> Ah, yep, I saw the same thing and already submitted patches to fix.  I
> think this change should be in the TIP tree:
>
> https://lkml.org/lkml/2015/5/11/336
>
>
>> I also note that with your implementation we have a wmb() in
>> arch_persistent_sync() and an mb() in arch_persistent_flush()...
>> surely one is redundant?
>
> Actually, the way that we need to use arch_persistent_flush() for our
> block window read case, the fencing works out so that nothing is
> redundant.  We never actually use both a persistent_sync() call and a
> persistent_flush() call during the same I/O.  Reads use
> persistent_flush() to invalidate obsolete lines in the cache before
> reading real data from the aperture of ete DIMM, and writes use a bunch
> of NT stores followed by a persistent_sync().
>
> The PMEM driver doesn't use persistent_flush() at all - this API is
> only needed for the block window read case.

Then that's not a "persistence flush", that's a shootdown of the
previous mmio block window setting.  If it's only for BLK reads I
think we need to use clflush_cache_range() directly.  Given that BLK
mode already depends on ACPI I think it's fine for now to make BLK
mode depend on x86.  Otherwise, we need a new cross-arch generic cache
flush primitive like io_flush_cache_range() and have BLK mode depend
on ARCH_HAS_IO_FLUSH_CACHE_RANGE.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/6] x86, pmem: add PMEM API for persistent memory

2015-05-29 Thread Ross Zwisler
On Thu, 2015-05-28 at 21:19 -0700, H. Peter Anvin wrote:
> On 05/28/2015 05:02 PM, Dan Williams wrote:
> > 
> > Hmm, yes, but I believe Ross (on vacation now) was following the
> > precedent set by commit cd8ddf1a2800 "x86: clflush_page_range needs
> > mfence" whereby the api handles all necessary fencing internally.
> > Shall we introduce something like __unordered_clflush_cache_range()
> > for arch_persistent_flush() to use with the understanding it will 
> > be
> > following up with the wmb() in arch_persistent_sync()?
> > 
> 
> Are we ever going to have arch_persistent_sync() without
> arch_persistent_flush()?
> 
> However, thinking about it, it would be more efficient to do all 
> flushes
> first and then have a single barrier.

Yep, we have arch_persistent_sync() without arch_persistent_flush() in
both our PMEM and ND_BLK write paths.  These use arch_persistent_copy() to get 
NT stores, so they don't need to manually flush/write-back 
before doing a persistent_sync().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/6] x86, pmem: add PMEM API for persistent memory

2015-05-29 Thread Ross Zwisler
On Thu, 2015-05-28 at 16:20 -0700, H. Peter Anvin wrote:
> On 05/28/2015 03:35 PM, Ross Zwisler wrote:
> > Add a new PMEM API to x86, and allow for architectures that do not
> > implement this API.  Architectures that implement the PMEM API 
> > should
> > define ARCH_HAS_PMEM_API in their kernel configuration and must 
> > provide
> > implementations for persistent_copy(), persistent_flush() and
> > persistent_sync().
> 
> >  
> >  void clflush_cache_range(void *addr, unsigned int size);
> >  
> 
> No, no, no, no, no.  Include the proper header file.

I'm confused - I did inlcude  in pmem.h?  The line
you're quoting above was an unmodified line from asm/cacheflush.h - I
didn't redefine the prototype for clflush_cache_range() anywhere.

Or does this comment mean that you think we shouldn't have an
architecture agnostic PMEM API, and that you think the PMEM and ND_BLK
drivers should just directly include asm/cacheflush.h and use the x86
APIs directly?

> > +static inline void arch_persistent_flush(void *vaddr, size_t size)
> > +{
> > +   clflush_cache_range(vaddr, size);
> > +}
> 
> Shouldn't this really be using clwb() -- we really need a
> clwb_cache_range() I guess?

I think we will need a clwb_cache_range() for DAX, for when it responds
to a msync() or fsync() call and needs to rip through a bunch of
memory, writing it back to the DIMMs.  I just didn't add it yet because
I didn't have a consumer.

It turns out that for the block aperture I/O case we really do need a
flush instead of a writeback, though, so clflush_cache_range() is
perfect.  Here's the flow, which is a read from a block window
aperture:

1) The nd_blk driver gets a read request, and selects a block window to
use.  It's entirely possible that this block window's aperture has
clean cache lines associated with it in the processor cache hierarchy. 
 It shouldn't be possible that it has dirty cache lines - we either
just did a read, or we did a write and would have used NT stores.

2) Write a new address into the block window control register.  The
memory backing the aperture moves to the new address.  Any clean lines
held in the processor cache are now out of sync.

3) Flush the cache lines associated with the aperture.  The lines are
guaranteed to be clean, so the flush will just discard them and no
writebacks will occur.

4) Read the contents of the block aperture, servicing the read.

This is the read flow outlined it the "driver writer's guide":

http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf  

> Incidentally, clflush_cache_range() seems to have the same flaw as 
> the
> proposed use case for clwb() had... if the buffer is aligned it will
> needlessly flush the last line twice.  It should really look 
> something
> like this (which would be a good standalone patch):
> 
> void clflush_cache_range(void *vaddr, unsigned int size)
> {
> void *vend = vaddr + size - 1;
> 
> mb();
> 
>   vaddr = (void *)
>   ((unsigned long)vaddr
>& ~(boot_cpu_data.x86_clflush_size - 1));
> 
> for (; vaddr < vend; vaddr += boot_cpu_data.x86_clflush_size)
> clflushopt(vaddr);
> 
> mb();
> }
> EXPORT_SYMBOL_GPL(clflush_cache_range);

Ah, yep, I saw the same thing and already submitted patches to fix.  I
think this change should be in the TIP tree:

https://lkml.org/lkml/2015/5/11/336


> I also note that with your implementation we have a wmb() in
> arch_persistent_sync() and an mb() in arch_persistent_flush()... 
> surely one is redundant?

Actually, the way that we need to use arch_persistent_flush() for our
block window read case, the fencing works out so that nothing is
redundant.  We never actually use both a persistent_sync() call and a
persistent_flush() call during the same I/O.  Reads use
persistent_flush() to invalidate obsolete lines in the cache before
reading real data from the aperture of ete DIMM, and writes use a bunch
of NT stores followed by a persistent_sync().

The PMEM driver doesn't use persistent_flush() at all - this API is
only needed for the block window read case.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/6] x86, pmem: add PMEM API for persistent memory

2015-05-29 Thread Ross Zwisler
On Thu, 2015-05-28 at 16:20 -0700, H. Peter Anvin wrote:
 On 05/28/2015 03:35 PM, Ross Zwisler wrote:
  Add a new PMEM API to x86, and allow for architectures that do not
  implement this API.  Architectures that implement the PMEM API 
  should
  define ARCH_HAS_PMEM_API in their kernel configuration and must 
  provide
  implementations for persistent_copy(), persistent_flush() and
  persistent_sync().
 
   
   void clflush_cache_range(void *addr, unsigned int size);
   
 
 No, no, no, no, no.  Include the proper header file.

I'm confused - I did inlcude asm/cacheflush.h in pmem.h?  The line
you're quoting above was an unmodified line from asm/cacheflush.h - I
didn't redefine the prototype for clflush_cache_range() anywhere.

Or does this comment mean that you think we shouldn't have an
architecture agnostic PMEM API, and that you think the PMEM and ND_BLK
drivers should just directly include asm/cacheflush.h and use the x86
APIs directly?

  +static inline void arch_persistent_flush(void *vaddr, size_t size)
  +{
  +   clflush_cache_range(vaddr, size);
  +}
 
 Shouldn't this really be using clwb() -- we really need a
 clwb_cache_range() I guess?

I think we will need a clwb_cache_range() for DAX, for when it responds
to a msync() or fsync() call and needs to rip through a bunch of
memory, writing it back to the DIMMs.  I just didn't add it yet because
I didn't have a consumer.

It turns out that for the block aperture I/O case we really do need a
flush instead of a writeback, though, so clflush_cache_range() is
perfect.  Here's the flow, which is a read from a block window
aperture:

1) The nd_blk driver gets a read request, and selects a block window to
use.  It's entirely possible that this block window's aperture has
clean cache lines associated with it in the processor cache hierarchy. 
 It shouldn't be possible that it has dirty cache lines - we either
just did a read, or we did a write and would have used NT stores.

2) Write a new address into the block window control register.  The
memory backing the aperture moves to the new address.  Any clean lines
held in the processor cache are now out of sync.

3) Flush the cache lines associated with the aperture.  The lines are
guaranteed to be clean, so the flush will just discard them and no
writebacks will occur.

4) Read the contents of the block aperture, servicing the read.

This is the read flow outlined it the driver writer's guide:

http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf  

 Incidentally, clflush_cache_range() seems to have the same flaw as 
 the
 proposed use case for clwb() had... if the buffer is aligned it will
 needlessly flush the last line twice.  It should really look 
 something
 like this (which would be a good standalone patch):
 
 void clflush_cache_range(void *vaddr, unsigned int size)
 {
 void *vend = vaddr + size - 1;
 
 mb();
 
   vaddr = (void *)
   ((unsigned long)vaddr
 ~(boot_cpu_data.x86_clflush_size - 1));
 
 for (; vaddr  vend; vaddr += boot_cpu_data.x86_clflush_size)
 clflushopt(vaddr);
 
 mb();
 }
 EXPORT_SYMBOL_GPL(clflush_cache_range);

Ah, yep, I saw the same thing and already submitted patches to fix.  I
think this change should be in the TIP tree:

https://lkml.org/lkml/2015/5/11/336


 I also note that with your implementation we have a wmb() in
 arch_persistent_sync() and an mb() in arch_persistent_flush()... 
 surely one is redundant?

Actually, the way that we need to use arch_persistent_flush() for our
block window read case, the fencing works out so that nothing is
redundant.  We never actually use both a persistent_sync() call and a
persistent_flush() call during the same I/O.  Reads use
persistent_flush() to invalidate obsolete lines in the cache before
reading real data from the aperture of ete DIMM, and writes use a bunch
of NT stores followed by a persistent_sync().

The PMEM driver doesn't use persistent_flush() at all - this API is
only needed for the block window read case.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/6] x86, pmem: add PMEM API for persistent memory

2015-05-29 Thread Ross Zwisler
On Thu, 2015-05-28 at 21:19 -0700, H. Peter Anvin wrote:
 On 05/28/2015 05:02 PM, Dan Williams wrote:
  
  Hmm, yes, but I believe Ross (on vacation now) was following the
  precedent set by commit cd8ddf1a2800 x86: clflush_page_range needs
  mfence whereby the api handles all necessary fencing internally.
  Shall we introduce something like __unordered_clflush_cache_range()
  for arch_persistent_flush() to use with the understanding it will 
  be
  following up with the wmb() in arch_persistent_sync()?
  
 
 Are we ever going to have arch_persistent_sync() without
 arch_persistent_flush()?
 
 However, thinking about it, it would be more efficient to do all 
 flushes
 first and then have a single barrier.

Yep, we have arch_persistent_sync() without arch_persistent_flush() in
both our PMEM and ND_BLK write paths.  These use arch_persistent_copy() to get 
NT stores, so they don't need to manually flush/write-back 
before doing a persistent_sync().
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/6] x86, pmem: add PMEM API for persistent memory

2015-05-29 Thread Dan Williams
On Fri, May 29, 2015 at 5:07 AM, Ross Zwisler zwis...@gmail.com wrote:
 On Thu, 2015-05-28 at 16:20 -0700, H. Peter Anvin wrote:
 On 05/28/2015 03:35 PM, Ross Zwisler wrote:
  Add a new PMEM API to x86, and allow for architectures that do not
  implement this API.  Architectures that implement the PMEM API
  should
  define ARCH_HAS_PMEM_API in their kernel configuration and must
  provide
  implementations for persistent_copy(), persistent_flush() and
  persistent_sync().

 
   void clflush_cache_range(void *addr, unsigned int size);
 

 No, no, no, no, no.  Include the proper header file.

 I'm confused - I did inlcude asm/cacheflush.h in pmem.h?  The line
 you're quoting above was an unmodified line from asm/cacheflush.h - I
 didn't redefine the prototype for clflush_cache_range() anywhere.

 Or does this comment mean that you think we shouldn't have an
 architecture agnostic PMEM API, and that you think the PMEM and ND_BLK
 drivers should just directly include asm/cacheflush.h and use the x86
 APIs directly?

  +static inline void arch_persistent_flush(void *vaddr, size_t size)
  +{
  +   clflush_cache_range(vaddr, size);
  +}

 Shouldn't this really be using clwb() -- we really need a
 clwb_cache_range() I guess?

 I think we will need a clwb_cache_range() for DAX, for when it responds
 to a msync() or fsync() call and needs to rip through a bunch of
 memory, writing it back to the DIMMs.  I just didn't add it yet because
 I didn't have a consumer.

 It turns out that for the block aperture I/O case we really do need a
 flush instead of a writeback, though, so clflush_cache_range() is
 perfect.  Here's the flow, which is a read from a block window
 aperture:

 1) The nd_blk driver gets a read request, and selects a block window to
 use.  It's entirely possible that this block window's aperture has
 clean cache lines associated with it in the processor cache hierarchy.
  It shouldn't be possible that it has dirty cache lines - we either
 just did a read, or we did a write and would have used NT stores.

 2) Write a new address into the block window control register.  The
 memory backing the aperture moves to the new address.  Any clean lines
 held in the processor cache are now out of sync.

 3) Flush the cache lines associated with the aperture.  The lines are
 guaranteed to be clean, so the flush will just discard them and no
 writebacks will occur.

 4) Read the contents of the block aperture, servicing the read.

 This is the read flow outlined it the driver writer's guide:

 http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf

 Incidentally, clflush_cache_range() seems to have the same flaw as
 the
 proposed use case for clwb() had... if the buffer is aligned it will
 needlessly flush the last line twice.  It should really look
 something
 like this (which would be a good standalone patch):

 void clflush_cache_range(void *vaddr, unsigned int size)
 {
 void *vend = vaddr + size - 1;

 mb();

   vaddr = (void *)
   ((unsigned long)vaddr
 ~(boot_cpu_data.x86_clflush_size - 1));

 for (; vaddr  vend; vaddr += boot_cpu_data.x86_clflush_size)
 clflushopt(vaddr);

 mb();
 }
 EXPORT_SYMBOL_GPL(clflush_cache_range);

 Ah, yep, I saw the same thing and already submitted patches to fix.  I
 think this change should be in the TIP tree:

 https://lkml.org/lkml/2015/5/11/336


 I also note that with your implementation we have a wmb() in
 arch_persistent_sync() and an mb() in arch_persistent_flush()...
 surely one is redundant?

 Actually, the way that we need to use arch_persistent_flush() for our
 block window read case, the fencing works out so that nothing is
 redundant.  We never actually use both a persistent_sync() call and a
 persistent_flush() call during the same I/O.  Reads use
 persistent_flush() to invalidate obsolete lines in the cache before
 reading real data from the aperture of ete DIMM, and writes use a bunch
 of NT stores followed by a persistent_sync().

 The PMEM driver doesn't use persistent_flush() at all - this API is
 only needed for the block window read case.

Then that's not a persistence flush, that's a shootdown of the
previous mmio block window setting.  If it's only for BLK reads I
think we need to use clflush_cache_range() directly.  Given that BLK
mode already depends on ACPI I think it's fine for now to make BLK
mode depend on x86.  Otherwise, we need a new cross-arch generic cache
flush primitive like io_flush_cache_range() and have BLK mode depend
on ARCH_HAS_IO_FLUSH_CACHE_RANGE.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/6] x86, pmem: add PMEM API for persistent memory

2015-05-28 Thread H. Peter Anvin
On 05/28/2015 05:02 PM, Dan Williams wrote:
> 
> Hmm, yes, but I believe Ross (on vacation now) was following the
> precedent set by commit cd8ddf1a2800 "x86: clflush_page_range needs
> mfence" whereby the api handles all necessary fencing internally.
> Shall we introduce something like __unordered_clflush_cache_range()
> for arch_persistent_flush() to use with the understanding it will be
> following up with the wmb() in arch_persistent_sync()?
> 

Are we ever going to have arch_persistent_sync() without
arch_persistent_flush()?

However, thinking about it, it would be more efficient to do all flushes
first and then have a single barrier.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/6] x86, pmem: add PMEM API for persistent memory

2015-05-28 Thread Dan Williams
On Thu, May 28, 2015 at 4:20 PM, H. Peter Anvin  wrote:
> On 05/28/2015 03:35 PM, Ross Zwisler wrote:
>> Add a new PMEM API to x86, and allow for architectures that do not
>> implement this API.  Architectures that implement the PMEM API should
>> define ARCH_HAS_PMEM_API in their kernel configuration and must provide
>> implementations for persistent_copy(), persistent_flush() and
>> persistent_sync().
>
>>
>>  void clflush_cache_range(void *addr, unsigned int size);
>>
>
> No, no, no, no, no.  Include the proper header file.
>
>> +static inline void arch_persistent_flush(void *vaddr, size_t size)
>> +{
>> + clflush_cache_range(vaddr, size);
>> +}
>
> Shouldn't this really be using clwb() -- we really need a
> clwb_cache_range() I guess?
>
> Incidentally, clflush_cache_range() seems to have the same flaw as the
> proposed use case for clwb() had... if the buffer is aligned it will
> needlessly flush the last line twice.  It should really look something
> like this (which would be a good standalone patch):
>
> void clflush_cache_range(void *vaddr, unsigned int size)
> {
> void *vend = vaddr + size - 1;
>
> mb();
>
> vaddr = (void *)
> ((unsigned long)vaddr
>  & ~(boot_cpu_data.x86_clflush_size - 1));
>
> for (; vaddr < vend; vaddr += boot_cpu_data.x86_clflush_size)
> clflushopt(vaddr);
>
> mb();
> }
> EXPORT_SYMBOL_GPL(clflush_cache_range);
>
> I also note that with your implementation we have a wmb() in
> arch_persistent_sync() and an mb() in arch_persistent_flush()... surely
> one is redundant?

Hmm, yes, but I believe Ross (on vacation now) was following the
precedent set by commit cd8ddf1a2800 "x86: clflush_page_range needs
mfence" whereby the api handles all necessary fencing internally.
Shall we introduce something like __unordered_clflush_cache_range()
for arch_persistent_flush() to use with the understanding it will be
following up with the wmb() in arch_persistent_sync()?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/6] x86, pmem: add PMEM API for persistent memory

2015-05-28 Thread H. Peter Anvin
On 05/28/2015 03:35 PM, Ross Zwisler wrote:
> Add a new PMEM API to x86, and allow for architectures that do not
> implement this API.  Architectures that implement the PMEM API should
> define ARCH_HAS_PMEM_API in their kernel configuration and must provide
> implementations for persistent_copy(), persistent_flush() and
> persistent_sync().

>  
>  void clflush_cache_range(void *addr, unsigned int size);
>  

No, no, no, no, no.  Include the proper header file.

> +static inline void arch_persistent_flush(void *vaddr, size_t size)
> +{
> + clflush_cache_range(vaddr, size);
> +}

Shouldn't this really be using clwb() -- we really need a
clwb_cache_range() I guess?

Incidentally, clflush_cache_range() seems to have the same flaw as the
proposed use case for clwb() had... if the buffer is aligned it will
needlessly flush the last line twice.  It should really look something
like this (which would be a good standalone patch):

void clflush_cache_range(void *vaddr, unsigned int size)
{
void *vend = vaddr + size - 1;

mb();

vaddr = (void *)
((unsigned long)vaddr
 & ~(boot_cpu_data.x86_clflush_size - 1));

for (; vaddr < vend; vaddr += boot_cpu_data.x86_clflush_size)
clflushopt(vaddr);

mb();
}
EXPORT_SYMBOL_GPL(clflush_cache_range);

I also note that with your implementation we have a wmb() in
arch_persistent_sync() and an mb() in arch_persistent_flush()... surely
one is redundant?

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/6] x86, pmem: add PMEM API for persistent memory

2015-05-28 Thread Ross Zwisler
Add a new PMEM API to x86, and allow for architectures that do not
implement this API.  Architectures that implement the PMEM API should
define ARCH_HAS_PMEM_API in their kernel configuration and must provide
implementations for persistent_copy(), persistent_flush() and
persistent_sync().

Signed-off-by: Ross Zwisler 
Cc: Dan Williams 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: x...@kernel.org
Cc: linux-nvd...@lists.01.org
---
 MAINTAINERS   |  1 +
 arch/x86/Kconfig  |  3 ++
 arch/x86/include/asm/cacheflush.h | 23 
 include/linux/pmem.h  | 79 +++
 4 files changed, 106 insertions(+)
 create mode 100644 include/linux/pmem.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 0448fec8e44a..ca1f3d99618d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5944,6 +5944,7 @@ L:linux-nvd...@lists.01.org
 Q: https://patchwork.kernel.org/project/linux-nvdimm/list/
 S: Supported
 F: drivers/block/nd/pmem.c
+F: include/linux/pmem.h
 
 LINUX FOR IBM pSERIES (RS/6000)
 M: Paul Mackerras 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 23c587938804..eb8f12e715af 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -215,6 +215,9 @@ config ARCH_HAS_CPU_RELAX
 config ARCH_HAS_CACHE_LINE_SIZE
def_bool y
 
+config ARCH_HAS_PMEM_API
+   def_bool y
+
 config HAVE_SETUP_PER_CPU_AREA
def_bool y
 
diff --git a/arch/x86/include/asm/cacheflush.h 
b/arch/x86/include/asm/cacheflush.h
index 47c8e32f621a..ffd5ccdc86f0 100644
--- a/arch/x86/include/asm/cacheflush.h
+++ b/arch/x86/include/asm/cacheflush.h
@@ -4,6 +4,7 @@
 /* Caches aren't brain-dead on the intel. */
 #include 
 #include 
+#include 
 
 /*
  * The set_memory_* API can be used to change various attributes of a virtual
@@ -84,6 +85,28 @@ int set_pages_rw(struct page *page, int numpages);
 
 void clflush_cache_range(void *addr, unsigned int size);
 
+static inline void arch_persistent_copy(void *dst, const void *src, size_t n)
+{
+   /*
+* We are copying between two kernel buffers, so it should be
+* impossible for us to hit this BUG_ON() because we should never need
+* to take a page fault.
+*/
+   BUG_ON(__copy_from_user_inatomic_nocache(dst,
+   (__user const void *)src, n));
+}
+
+static inline void arch_persistent_flush(void *vaddr, size_t size)
+{
+   clflush_cache_range(vaddr, size);
+}
+
+static inline void arch_persistent_sync(void)
+{
+   wmb();
+   pcommit_sfence();
+}
+
 #ifdef CONFIG_DEBUG_RODATA
 void mark_rodata_ro(void);
 extern const int rodata_test_data;
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
new file mode 100644
index ..88ade7376632
--- /dev/null
+++ b/include/linux/pmem.h
@@ -0,0 +1,79 @@
+/*
+ * Copyright(c) 2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __PMEM_H__
+#define __PMEM_H__
+
+#include 
+
+/*
+ * Architectures that define ARCH_HAS_PMEM_API must provide implementations
+ * for persistent_copy(), persistent_flush() and persistent_sync().
+ */
+
+#ifdef CONFIG_ARCH_HAS_PMEM_API
+/**
+ * persistent_copy - copy data to persistent memory
+ * @dst: destination buffer for the copy
+ * @src: source buffer for the copy
+ * @n: length of the copy in bytes
+ *
+ * Perform a memory copy that results in the destination of the copy being
+ * evicted from the processor cache hierarchy ("accepted to memory").  This
+ * can typically be accomplished with non-temporal stores or by regular stores
+ * followed by cache flush commands via persistent_flush().
+ */
+static inline void persistent_copy(void *dst, const void *src, size_t n)
+{
+   arch_persistent_copy(dst, src, n);
+}
+
+/**
+ * persistent_flush - flush a memory range from the processor cache
+ * @vaddr: virtual address to begin flushing
+ * @size: number of bytes to flush
+ *
+ * This call needs to include fencing so that the flushing will be ordered
+ * with respect to both reads and writes.
+ */
+static inline void persistent_flush(void *vaddr, size_t size)
+{
+   arch_persistent_flush(vaddr, size);
+}
+
+/**
+ * persistent_sync - synchronize writes to persistent memory
+ *
+ * To be used after a series of copies and/or flushes, this should perform any
+ * necessary fencing to order writes/flushes and then ensure that writes that
+ * have been "accepted to memory", i.e. previously flushed from the cache
+ * hierarchy, are actually written to the appropriate DIMMs.  This 

[PATCH 3/6] x86, pmem: add PMEM API for persistent memory

2015-05-28 Thread Ross Zwisler
Add a new PMEM API to x86, and allow for architectures that do not
implement this API.  Architectures that implement the PMEM API should
define ARCH_HAS_PMEM_API in their kernel configuration and must provide
implementations for persistent_copy(), persistent_flush() and
persistent_sync().

Signed-off-by: Ross Zwisler ross.zwis...@linux.intel.com
Cc: Dan Williams dan.j.willi...@intel.com
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@redhat.com
Cc: H. Peter Anvin h...@zytor.com
Cc: x...@kernel.org
Cc: linux-nvd...@lists.01.org
---
 MAINTAINERS   |  1 +
 arch/x86/Kconfig  |  3 ++
 arch/x86/include/asm/cacheflush.h | 23 
 include/linux/pmem.h  | 79 +++
 4 files changed, 106 insertions(+)
 create mode 100644 include/linux/pmem.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 0448fec8e44a..ca1f3d99618d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5944,6 +5944,7 @@ L:linux-nvd...@lists.01.org
 Q: https://patchwork.kernel.org/project/linux-nvdimm/list/
 S: Supported
 F: drivers/block/nd/pmem.c
+F: include/linux/pmem.h
 
 LINUX FOR IBM pSERIES (RS/6000)
 M: Paul Mackerras pau...@au.ibm.com
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 23c587938804..eb8f12e715af 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -215,6 +215,9 @@ config ARCH_HAS_CPU_RELAX
 config ARCH_HAS_CACHE_LINE_SIZE
def_bool y
 
+config ARCH_HAS_PMEM_API
+   def_bool y
+
 config HAVE_SETUP_PER_CPU_AREA
def_bool y
 
diff --git a/arch/x86/include/asm/cacheflush.h 
b/arch/x86/include/asm/cacheflush.h
index 47c8e32f621a..ffd5ccdc86f0 100644
--- a/arch/x86/include/asm/cacheflush.h
+++ b/arch/x86/include/asm/cacheflush.h
@@ -4,6 +4,7 @@
 /* Caches aren't brain-dead on the intel. */
 #include asm-generic/cacheflush.h
 #include asm/special_insns.h
+#include asm/uaccess.h
 
 /*
  * The set_memory_* API can be used to change various attributes of a virtual
@@ -84,6 +85,28 @@ int set_pages_rw(struct page *page, int numpages);
 
 void clflush_cache_range(void *addr, unsigned int size);
 
+static inline void arch_persistent_copy(void *dst, const void *src, size_t n)
+{
+   /*
+* We are copying between two kernel buffers, so it should be
+* impossible for us to hit this BUG_ON() because we should never need
+* to take a page fault.
+*/
+   BUG_ON(__copy_from_user_inatomic_nocache(dst,
+   (__user const void *)src, n));
+}
+
+static inline void arch_persistent_flush(void *vaddr, size_t size)
+{
+   clflush_cache_range(vaddr, size);
+}
+
+static inline void arch_persistent_sync(void)
+{
+   wmb();
+   pcommit_sfence();
+}
+
 #ifdef CONFIG_DEBUG_RODATA
 void mark_rodata_ro(void);
 extern const int rodata_test_data;
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
new file mode 100644
index ..88ade7376632
--- /dev/null
+++ b/include/linux/pmem.h
@@ -0,0 +1,79 @@
+/*
+ * Copyright(c) 2015 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#ifndef __PMEM_H__
+#define __PMEM_H__
+
+#include asm/cacheflush.h
+
+/*
+ * Architectures that define ARCH_HAS_PMEM_API must provide implementations
+ * for persistent_copy(), persistent_flush() and persistent_sync().
+ */
+
+#ifdef CONFIG_ARCH_HAS_PMEM_API
+/**
+ * persistent_copy - copy data to persistent memory
+ * @dst: destination buffer for the copy
+ * @src: source buffer for the copy
+ * @n: length of the copy in bytes
+ *
+ * Perform a memory copy that results in the destination of the copy being
+ * evicted from the processor cache hierarchy (accepted to memory).  This
+ * can typically be accomplished with non-temporal stores or by regular stores
+ * followed by cache flush commands via persistent_flush().
+ */
+static inline void persistent_copy(void *dst, const void *src, size_t n)
+{
+   arch_persistent_copy(dst, src, n);
+}
+
+/**
+ * persistent_flush - flush a memory range from the processor cache
+ * @vaddr: virtual address to begin flushing
+ * @size: number of bytes to flush
+ *
+ * This call needs to include fencing so that the flushing will be ordered
+ * with respect to both reads and writes.
+ */
+static inline void persistent_flush(void *vaddr, size_t size)
+{
+   arch_persistent_flush(vaddr, size);
+}
+
+/**
+ * persistent_sync - synchronize writes to persistent memory
+ *
+ * To be used after a series of copies and/or flushes, this should perform any
+ * necessary fencing to order 

Re: [PATCH 3/6] x86, pmem: add PMEM API for persistent memory

2015-05-28 Thread H. Peter Anvin
On 05/28/2015 03:35 PM, Ross Zwisler wrote:
 Add a new PMEM API to x86, and allow for architectures that do not
 implement this API.  Architectures that implement the PMEM API should
 define ARCH_HAS_PMEM_API in their kernel configuration and must provide
 implementations for persistent_copy(), persistent_flush() and
 persistent_sync().

  
  void clflush_cache_range(void *addr, unsigned int size);
  

No, no, no, no, no.  Include the proper header file.

 +static inline void arch_persistent_flush(void *vaddr, size_t size)
 +{
 + clflush_cache_range(vaddr, size);
 +}

Shouldn't this really be using clwb() -- we really need a
clwb_cache_range() I guess?

Incidentally, clflush_cache_range() seems to have the same flaw as the
proposed use case for clwb() had... if the buffer is aligned it will
needlessly flush the last line twice.  It should really look something
like this (which would be a good standalone patch):

void clflush_cache_range(void *vaddr, unsigned int size)
{
void *vend = vaddr + size - 1;

mb();

vaddr = (void *)
((unsigned long)vaddr
  ~(boot_cpu_data.x86_clflush_size - 1));

for (; vaddr  vend; vaddr += boot_cpu_data.x86_clflush_size)
clflushopt(vaddr);

mb();
}
EXPORT_SYMBOL_GPL(clflush_cache_range);

I also note that with your implementation we have a wmb() in
arch_persistent_sync() and an mb() in arch_persistent_flush()... surely
one is redundant?

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/6] x86, pmem: add PMEM API for persistent memory

2015-05-28 Thread Dan Williams
On Thu, May 28, 2015 at 4:20 PM, H. Peter Anvin h...@zytor.com wrote:
 On 05/28/2015 03:35 PM, Ross Zwisler wrote:
 Add a new PMEM API to x86, and allow for architectures that do not
 implement this API.  Architectures that implement the PMEM API should
 define ARCH_HAS_PMEM_API in their kernel configuration and must provide
 implementations for persistent_copy(), persistent_flush() and
 persistent_sync().


  void clflush_cache_range(void *addr, unsigned int size);


 No, no, no, no, no.  Include the proper header file.

 +static inline void arch_persistent_flush(void *vaddr, size_t size)
 +{
 + clflush_cache_range(vaddr, size);
 +}

 Shouldn't this really be using clwb() -- we really need a
 clwb_cache_range() I guess?

 Incidentally, clflush_cache_range() seems to have the same flaw as the
 proposed use case for clwb() had... if the buffer is aligned it will
 needlessly flush the last line twice.  It should really look something
 like this (which would be a good standalone patch):

 void clflush_cache_range(void *vaddr, unsigned int size)
 {
 void *vend = vaddr + size - 1;

 mb();

 vaddr = (void *)
 ((unsigned long)vaddr
   ~(boot_cpu_data.x86_clflush_size - 1));

 for (; vaddr  vend; vaddr += boot_cpu_data.x86_clflush_size)
 clflushopt(vaddr);

 mb();
 }
 EXPORT_SYMBOL_GPL(clflush_cache_range);

 I also note that with your implementation we have a wmb() in
 arch_persistent_sync() and an mb() in arch_persistent_flush()... surely
 one is redundant?

Hmm, yes, but I believe Ross (on vacation now) was following the
precedent set by commit cd8ddf1a2800 x86: clflush_page_range needs
mfence whereby the api handles all necessary fencing internally.
Shall we introduce something like __unordered_clflush_cache_range()
for arch_persistent_flush() to use with the understanding it will be
following up with the wmb() in arch_persistent_sync()?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/6] x86, pmem: add PMEM API for persistent memory

2015-05-28 Thread H. Peter Anvin
On 05/28/2015 05:02 PM, Dan Williams wrote:
 
 Hmm, yes, but I believe Ross (on vacation now) was following the
 precedent set by commit cd8ddf1a2800 x86: clflush_page_range needs
 mfence whereby the api handles all necessary fencing internally.
 Shall we introduce something like __unordered_clflush_cache_range()
 for arch_persistent_flush() to use with the understanding it will be
 following up with the wmb() in arch_persistent_sync()?
 

Are we ever going to have arch_persistent_sync() without
arch_persistent_flush()?

However, thinking about it, it would be more efficient to do all flushes
first and then have a single barrier.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/