Re: [dm-devel] [PATCH 2/2] dm-writecache

2018-02-13 Thread Dan Williams
On Tue, Feb 13, 2018 at 5:24 PM, Mikulas Patocka  wrote:
>
>
> On Tue, 13 Feb 2018, Dan Williams wrote:
>
>> On Tue, Feb 13, 2018 at 2:00 PM, Mikulas Patocka  wrote:
>> >
>> >
>> > On Fri, 8 Dec 2017, Dan Williams wrote:
>> >
>> >> > > > when we write to
>> >> > > > persistent memory using cached write instructions and use dax_flush
>> >> > > > afterwards to flush cache for the affected range, the performance 
>> >> > > > is about
>> >> > > > 350MB/s. It is practically unusable - worse than low-end SSDs.
>> >> > > >
>> >> > > > On the other hand, the movnti instruction can sustain performance 
>> >> > > > of one
>> >> > > > 8-byte write per clock cycle. We don't have to flush cache 
>> >> > > > afterwards, the
>> >> > > > only thing that must be done is to flush the write-combining buffer 
>> >> > > > with
>> >> > > > the sfence instruction. Movnti has much better throughput than 
>> >> > > > dax_flush.
>> >> > >
>> >> > > What about memcpy_flushcache?
>> >> >
>> >> > but
>> >> >
>> >> > - using memcpy_flushcache is overkill if we need just one or two 8-byte
>> >> > writes to the metadata area. Why not use movnti directly?
>> >> >
>> >>
>> >> The driver performs so many 8-byte moves that the cost of the
>> >> memcpy_flushcache() function call significantly eats into your
>> >> performance?
>> >
>> > I've measured it on Skylake i7-6700 - and the dm-writecache driver has 2%
>> > lower throughput when it uses memcpy_flushcache() to update it metadata
>> > instead of explicitly coded "movnti" instructions.
>> >
>> > I've created this patch - it doesn't change API in any way, but it
>> > optimizes memcpy_flushcache for 4, 8 and 16-byte writes (that is what my
>> > driver mostly uses). With this patch, I can remove the explicit "asm"
>> > statements from my driver. Would you consider commiting this patch to the
>> > kernel?
>> >
>> > Mikulas
>> >
>> >
>>
>> Yes, this looks good to me. You can send it to the x86 folks with my:
>>
>> Reviewed-by: Dan Williams 
>>
>> ...or let me know and I can chase it through the -tip tree. Either way
>> works for me.
>
> If you have access to some tree that will be merged, you can commit the
> patch there.

Sure, no worries, I'll take it from here.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 2/2] dm-writecache

2018-02-13 Thread Mikulas Patocka


On Tue, 13 Feb 2018, Dan Williams wrote:

> On Tue, Feb 13, 2018 at 2:00 PM, Mikulas Patocka  wrote:
> >
> >
> > On Fri, 8 Dec 2017, Dan Williams wrote:
> >
> >> > > > when we write to
> >> > > > persistent memory using cached write instructions and use dax_flush
> >> > > > afterwards to flush cache for the affected range, the performance is 
> >> > > > about
> >> > > > 350MB/s. It is practically unusable - worse than low-end SSDs.
> >> > > >
> >> > > > On the other hand, the movnti instruction can sustain performance of 
> >> > > > one
> >> > > > 8-byte write per clock cycle. We don't have to flush cache 
> >> > > > afterwards, the
> >> > > > only thing that must be done is to flush the write-combining buffer 
> >> > > > with
> >> > > > the sfence instruction. Movnti has much better throughput than 
> >> > > > dax_flush.
> >> > >
> >> > > What about memcpy_flushcache?
> >> >
> >> > but
> >> >
> >> > - using memcpy_flushcache is overkill if we need just one or two 8-byte
> >> > writes to the metadata area. Why not use movnti directly?
> >> >
> >>
> >> The driver performs so many 8-byte moves that the cost of the
> >> memcpy_flushcache() function call significantly eats into your
> >> performance?
> >
> > I've measured it on Skylake i7-6700 - and the dm-writecache driver has 2%
> > lower throughput when it uses memcpy_flushcache() to update it metadata
> > instead of explicitly coded "movnti" instructions.
> >
> > I've created this patch - it doesn't change API in any way, but it
> > optimizes memcpy_flushcache for 4, 8 and 16-byte writes (that is what my
> > driver mostly uses). With this patch, I can remove the explicit "asm"
> > statements from my driver. Would you consider commiting this patch to the
> > kernel?
> >
> > Mikulas
> >
> >
> 
> Yes, this looks good to me. You can send it to the x86 folks with my:
> 
> Reviewed-by: Dan Williams 
> 
> ...or let me know and I can chase it through the -tip tree. Either way
> works for me.

If you have access to some tree that will be merged, you can commit the 
patch there.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 2/2] dm-writecache

2018-02-13 Thread Dan Williams
On Tue, Feb 13, 2018 at 2:00 PM, Mikulas Patocka  wrote:
>
>
> On Fri, 8 Dec 2017, Dan Williams wrote:
>
>> > > > when we write to
>> > > > persistent memory using cached write instructions and use dax_flush
>> > > > afterwards to flush cache for the affected range, the performance is 
>> > > > about
>> > > > 350MB/s. It is practically unusable - worse than low-end SSDs.
>> > > >
>> > > > On the other hand, the movnti instruction can sustain performance of 
>> > > > one
>> > > > 8-byte write per clock cycle. We don't have to flush cache afterwards, 
>> > > > the
>> > > > only thing that must be done is to flush the write-combining buffer 
>> > > > with
>> > > > the sfence instruction. Movnti has much better throughput than 
>> > > > dax_flush.
>> > >
>> > > What about memcpy_flushcache?
>> >
>> > but
>> >
>> > - using memcpy_flushcache is overkill if we need just one or two 8-byte
>> > writes to the metadata area. Why not use movnti directly?
>> >
>>
>> The driver performs so many 8-byte moves that the cost of the
>> memcpy_flushcache() function call significantly eats into your
>> performance?
>
> I've measured it on Skylake i7-6700 - and the dm-writecache driver has 2%
> lower throughput when it uses memcpy_flushcache() to update it metadata
> instead of explicitly coded "movnti" instructions.
>
> I've created this patch - it doesn't change API in any way, but it
> optimizes memcpy_flushcache for 4, 8 and 16-byte writes (that is what my
> driver mostly uses). With this patch, I can remove the explicit "asm"
> statements from my driver. Would you consider commiting this patch to the
> kernel?
>
> Mikulas
>
>

Yes, this looks good to me. You can send it to the x86 folks with my:

Reviewed-by: Dan Williams 

...or let me know and I can chase it through the -tip tree. Either way
works for me.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 2/2] dm-writecache

2018-02-13 Thread Mikulas Patocka


On Fri, 8 Dec 2017, Dan Williams wrote:

> > > > when we write to
> > > > persistent memory using cached write instructions and use dax_flush
> > > > afterwards to flush cache for the affected range, the performance is 
> > > > about
> > > > 350MB/s. It is practically unusable - worse than low-end SSDs.
> > > >
> > > > On the other hand, the movnti instruction can sustain performance of one
> > > > 8-byte write per clock cycle. We don't have to flush cache afterwards, 
> > > > the
> > > > only thing that must be done is to flush the write-combining buffer with
> > > > the sfence instruction. Movnti has much better throughput than 
> > > > dax_flush.
> > >
> > > What about memcpy_flushcache?
> >
> > but
> >
> > - using memcpy_flushcache is overkill if we need just one or two 8-byte
> > writes to the metadata area. Why not use movnti directly?
> >
> 
> The driver performs so many 8-byte moves that the cost of the
> memcpy_flushcache() function call significantly eats into your
> performance?

I've measured it on Skylake i7-6700 - and the dm-writecache driver has 2% 
lower throughput when it uses memcpy_flushcache() to update it metadata 
instead of explicitly coded "movnti" instructions.

I've created this patch - it doesn't change API in any way, but it 
optimizes memcpy_flushcache for 4, 8 and 16-byte writes (that is what my 
driver mostly uses). With this patch, I can remove the explicit "asm" 
statements from my driver. Would you consider commiting this patch to the 
kernel?

Mikulas




x86: optimize memcpy_flushcache

I use memcpy_flushcache in my persistent memory driver for metadata
updates and it turns out that the overhead of memcpy_flushcache causes 2%
performance degradation compared to "movnti" instruction explicitly coded
using inline assembler.

This patch recognizes memcpy_flushcache calls with constant short length
and turns them into inline assembler - so that I don't have to use inline
assembler in the driver.

Signed-off-by: Mikulas Patocka 

---
 arch/x86/include/asm/string_64.h |   20 +++-
 arch/x86/lib/usercopy_64.c   |6 +++---
 2 files changed, 22 insertions(+), 4 deletions(-)

Index: linux-2.6/arch/x86/include/asm/string_64.h
===
--- linux-2.6.orig/arch/x86/include/asm/string_64.h 2018-01-31 
11:06:19.953577699 -0500
+++ linux-2.6/arch/x86/include/asm/string_64.h  2018-02-13 12:31:06.506810497 
-0500
@@ -147,7 +147,25 @@ memcpy_mcsafe(void *dst, const void *src
 
 #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
 #define __HAVE_ARCH_MEMCPY_FLUSHCACHE 1
-void memcpy_flushcache(void *dst, const void *src, size_t cnt);
+void __memcpy_flushcache(void *dst, const void *src, size_t cnt);
+static __always_inline void memcpy_flushcache(void *dst, const void *src, 
size_t cnt)
+{
+   if (__builtin_constant_p(cnt)) {
+   switch (cnt) {
+   case 4:
+   asm ("movntil %1, %0" : "=m"(*(u32 *)dst) : 
"r"(*(u32 *)src));
+   return;
+   case 8:
+   asm ("movntiq %1, %0" : "=m"(*(u64 *)dst) : 
"r"(*(u64 *)src));
+   return;
+   case 16:
+   asm ("movntiq %1, %0" : "=m"(*(u64 *)dst) : 
"r"(*(u64 *)src));
+   asm ("movntiq %1, %0" : "=m"(*(u64 *)(dst + 8)) 
: "r"(*(u64 *)(src + 8)));
+   return;
+   }
+   }
+   __memcpy_flushcache(dst, src, cnt);
+}
 #endif
 
 #endif /* __KERNEL__ */
Index: linux-2.6/arch/x86/lib/usercopy_64.c
===
--- linux-2.6.orig/arch/x86/lib/usercopy_64.c   2018-01-31 11:06:19.988577678 
-0500
+++ linux-2.6/arch/x86/lib/usercopy_64.c2018-02-13 11:56:40.249154414 
-0500
@@ -133,7 +133,7 @@ long __copy_user_flushcache(void *dst, c
return rc;
 }
 
-void memcpy_flushcache(void *_dst, const void *_src, size_t size)
+void __memcpy_flushcache(void *_dst, const void *_src, size_t size)
 {
unsigned long dest = (unsigned long) _dst;
unsigned long source = (unsigned long) _src;
@@ -196,14 +196,14 @@ void memcpy_flushcache(void *_dst, const
clean_cache_range((void *) dest, size);
}
 }
-EXPORT_SYMBOL_GPL(memcpy_flushcache);
+EXPORT_SYMBOL_GPL(__memcpy_flushcache);
 
 void memcpy_page_flushcache(char *to, struct page *page, size_t offset,
size_t len)
 {
char *from = kmap_atomic(page);
 
-   memcpy_flushcache(to, from + offset, len);
+   __memcpy_flushcache(to, from + offset, len);
kunmap_atomic(from);
 }
 #endif

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 2/2] dm-writecache

2017-12-22 Thread Dan Williams
On Fri, Dec 22, 2017 at 12:56 PM, Mikulas Patocka  wrote:
>
>
> On Fri, 8 Dec 2017, Dan Williams wrote:
>
>> > > What about memcpy_flushcache?
>> >
>> > but
>> >
>> > - using memcpy_flushcache is overkill if we need just one or two 8-byte
>> > writes to the metadata area. Why not use movnti directly?
>> >
>>
>> The driver performs so many 8-byte moves that the cost of the
>> memcpy_flushcache() function call significantly eats into your
>> performance?
>>
>> > - on some architctures, memcpy_flushcache is just an alias for memcpy, so
>> > there will still be some arch-specific ifdefs
>>
>> ...but those should all be hidden by arch code, not in drivers.
>>
>> > - there is no architecture-neutral way how to guarantee ordering between
>> > multiple memcpy_flushcache calls. On x86, we need wmb(), on Power we
>> > don't, on ARM64 I don't know (arch_wb_cache_pmem calls dmb(osh),
>> > memcpy_flushcache doesn't - I don't know what are the implications of this
>> > difference) on other architectures, wmb() is insufficient to guarantee
>> > ordering between multiple memcpy_flushcache calls.
>>
>> wmb() should always be sufficient to order stores on all architectures.
>
> No, it isn't. See this example:
>
> uint64_t var_a, var_b;
>
> void fn(void)
> {
> uint64_t val = 3;
> memcpy_flushcache(_a, , 8);
> wmb();
> val = 5;
> memcpy_flushcache(_b, , 8);;
> }
>
> On x86-64, memcpy_flushcache is implemented using the movnti instruction
> (that writes the value to the write-combining buffer) and wmb() is
> compiled into the sfence instruction (that flushes the write-combining
> buffer) - so it's OK - it is guaranteed that the variable var_a is written
> to persistent memory before the variable var_b;
>
> However, on i686 (and most other architectures), memcpy_flushcache is just
> an alias for memcpy - see this code in include/linux/string.h:
> #ifndef __HAVE_ARCH_MEMCPY_FLUSHCACHE
> static inline void memcpy_flushcache(void *dst, const void *src, size_t
> cnt)
> {
> memcpy(dst, src, cnt);
> }
> #endif
>
> So, memcpy_flushcache writes the variable var_a to the cache, wmb()
> flushes the pipeline, but otherwise does nothing (because on i686 all
> writes to the cache are already ordered) and then memcpy_flushcache writes
> the variable var_b to the cache - however, both writes end up in cache and
> wmb() doesn't flush the cache. So wmb() doesn't provide any sort of
> guarantee that var_a is written to persistent memory before var_b. If the
> cache sector containing the variable var_b is more contended than the
> cache sector containg the variable var_a, the CPU will flush cache line
> containing var_b before var_a.

You're overlooking the fact that there is no persistent memory support
for i686. The pmem driver will warn you of this fact:

dev_warn(dev, "unable to guarantee persistence of writes\n");

The reason persistent memory is not supported on i686 is due to the
fact that older cpus do not guarantee that non-temporal stores always
bypass the cache.

> We have the dax_flush() function that (unlike wmb) guarantees that the
> specified range is flushed from the cache and written to persistent
> memory. But dax_flush is very slow on x86-64.

dax_flush() does not make that guarantee on i686, and it's only slow
on architectures that were not built for persistent memory support
i.e. architectures prior to the arrival of the clwb and clflushopt
instructions.

> So, we have a situation where wmb() is fast, but it only guarantees
> ordering on x86-64 and dax_flush() that guarantees ordering on all
> architectures, but it is slow on x86-64.
>
> So, my driver uses wmb() on x86-64 and dax_flush() on all the others.

...and I'm saying you're hard coding for a cpu that was not built to
support persistent memory efficiently in the first place.

> You argue that the driver shouldn't use any per-architecture #ifdefs, but
> there is no other way how to do it - using dax_flush() on x86-64 kills
> performance and using wmb() on all architectures is unreliable because
> wmb() doesn't guarantee cache flushing at all.
>
>> > At least for Broadwell, the write-back memory type on persistent memory
>> > has so horrible performance that it not really usable.
>>
>> ...and my concern is that you're designing a pmem driver mechanism for
>> Broadwell that predates the clwb instruction for efficient pmem
>> access.
>
> That's what we have for testing. If I get access to some Skylake server, I
> can test if using write combining is faster than cache flushing.

I'm still missing where memcpy_flushcache is unsuitable for your use
case given that i686 does not support reliable peristent memory
access.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 2/2] dm-writecache

2017-12-22 Thread Mikulas Patocka


On Fri, 8 Dec 2017, Dan Williams wrote:

> > > What about memcpy_flushcache?
> >
> > but
> >
> > - using memcpy_flushcache is overkill if we need just one or two 8-byte
> > writes to the metadata area. Why not use movnti directly?
> >
> 
> The driver performs so many 8-byte moves that the cost of the
> memcpy_flushcache() function call significantly eats into your
> performance?
>
> > - on some architctures, memcpy_flushcache is just an alias for memcpy, so
> > there will still be some arch-specific ifdefs
> 
> ...but those should all be hidden by arch code, not in drivers.
> 
> > - there is no architecture-neutral way how to guarantee ordering between
> > multiple memcpy_flushcache calls. On x86, we need wmb(), on Power we
> > don't, on ARM64 I don't know (arch_wb_cache_pmem calls dmb(osh),
> > memcpy_flushcache doesn't - I don't know what are the implications of this
> > difference) on other architectures, wmb() is insufficient to guarantee
> > ordering between multiple memcpy_flushcache calls.
> 
> wmb() should always be sufficient to order stores on all architectures.

No, it isn't. See this example:

uint64_t var_a, var_b;

void fn(void)
{
uint64_t val = 3;
memcpy_flushcache(_a, , 8);
wmb();
val = 5;
memcpy_flushcache(_b, , 8);;
}

On x86-64, memcpy_flushcache is implemented using the movnti instruction 
(that writes the value to the write-combining buffer) and wmb() is 
compiled into the sfence instruction (that flushes the write-combining 
buffer) - so it's OK - it is guaranteed that the variable var_a is written 
to persistent memory before the variable var_b;

However, on i686 (and most other architectures), memcpy_flushcache is just 
an alias for memcpy - see this code in include/linux/string.h:
#ifndef __HAVE_ARCH_MEMCPY_FLUSHCACHE
static inline void memcpy_flushcache(void *dst, const void *src, size_t 
cnt)
{
memcpy(dst, src, cnt);
}
#endif

So, memcpy_flushcache writes the variable var_a to the cache, wmb() 
flushes the pipeline, but otherwise does nothing (because on i686 all 
writes to the cache are already ordered) and then memcpy_flushcache writes 
the variable var_b to the cache - however, both writes end up in cache and 
wmb() doesn't flush the cache. So wmb() doesn't provide any sort of 
guarantee that var_a is written to persistent memory before var_b. If the 
cache sector containing the variable var_b is more contended than the 
cache sector containg the variable var_a, the CPU will flush cache line 
containing var_b before var_a.

We have the dax_flush() function that (unlike wmb) guarantees that the 
specified range is flushed from the cache and written to persistent 
memory. But dax_flush is very slow on x86-64.

So, we have a situation where wmb() is fast, but it only guarantees 
ordering on x86-64 and dax_flush() that guarantees ordering on all 
architectures, but it is slow on x86-64.

So, my driver uses wmb() on x86-64 and dax_flush() on all the others.

You argue that the driver shouldn't use any per-architecture #ifdefs, but 
there is no other way how to do it - using dax_flush() on x86-64 kills 
performance and using wmb() on all architectures is unreliable because 
wmb() doesn't guarantee cache flushing at all.

> > At least for Broadwell, the write-back memory type on persistent memory
> > has so horrible performance that it not really usable.
> 
> ...and my concern is that you're designing a pmem driver mechanism for
> Broadwell that predates the clwb instruction for efficient pmem
> access.

That's what we have for testing. If I get access to some Skylake server, I 
can test if using write combining is faster than cache flushing.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 2/2] dm-writecache

2017-12-08 Thread Dan Williams
On Mon, Dec 4, 2017 at 9:40 PM, Mikulas Patocka  wrote:
> On Mon, 27 Nov 2017, Dan Williams wrote:
[..]
> > > So, according to the document, flushing the cache should be enough for
> > > writes to reach persistent memory.
> >
> > The document assumes ADR is present.
>
> Could there be a case where persistent memory is present, but ADR is not?

No, that's why it was safe to deprecate the pcommit instruction,
software can assume that once writes reach the memory controller they
are persistent and there is no requirement to flush them through the
memory controller to media which is what the pcommit instruction
performed.

> > > > > +#ifdef CONFIG_X86_64
> > > >
> > > > In general something is broken if we end up with per-arch ifdefs like
> > > > this in drivers to handle persistent memory. This should be using the
> > > > pmem api or extensions of it, and we need to settle on a mechanism for
> > > > upper-level drivers to ask if pmem is driving platform protected
> > > > persistent memory.
> > > >
> > > > > +#define NT_STORE(dest, src)asm ("movnti %1, %0" : "=m"(dest) : 
> > > > > "r"(src))
> > > > > +#define FLUSH_RANGE(dax, ptr, size)  do { } while (0)
> > > > > +#define COMMIT_FLUSHED()   wmb()
> > > > > +#else
> > > > > +#define NT_STORE(dest, src)ACCESS_ONCE(dest) = (src)
> > > > > +#define FLUSH_RANGEdax_flush
> > > > > +#define COMMIT_FLUSHED()   do { } while (0)
> > > >
> > > > Is this just for test purposes? How does the user discover that they
> > > > are running in a degraded mode as far as persistence guarantees? I
> > > > think we should be falling back DM_WRITECACHE_ONLY_SSD mode if we're
> > > > not on a pmem platform.
> > >
> > > What degraded mode do you mean?
> >
> > Fall back to treating pmem like an SSD / block-device.
>
> If the dm-writecache driver cannot reliably flush cache - then the
> /dev/pmem block device driver couldn't reliably flush cache neither.

Right, but the pmem driver warns when it can't detect that it is
running on a persistent memory enabled platform. I'm saying we should
turn that warning into an api so that in-kernel consumers of pmem can
make the same warning or determination to disable functionality on
non-pmem capable platforms.

> > > According to that document, flushing cache
> > > should be enough. If it is not enough, what else do I need to do to flush
> > > cache?
> >
> > ...ensure that you are on a platform where flushing the cpu cache is enough.
> >
> > > The above code is not for test purposes. It is for performance purposes.
> > >
> > > On dual-socket Broadwell server with persistent memory
> >
> > What platform is this... does it have ADR, does the BIOS produce an NFIT?
>
> I don't know. I already released the machine for someone else. If you
> want, I can try to re-acquire the access to it.

As long as your driver consumes the new/TBD in-kernel api to determine
if the platform is pmem-enabled or not then I think we're good. My
concern is that we're optimizing the kernel interfaces for a platform
that does not support pmem, or requires a pmem support approach that
we would abandon when the clwb instruction can replace clflush usage.

>
> > > when we write to
> > > persistent memory using cached write instructions and use dax_flush
> > > afterwards to flush cache for the affected range, the performance is about
> > > 350MB/s. It is practically unusable - worse than low-end SSDs.
> > >
> > > On the other hand, the movnti instruction can sustain performance of one
> > > 8-byte write per clock cycle. We don't have to flush cache afterwards, the
> > > only thing that must be done is to flush the write-combining buffer with
> > > the sfence instruction. Movnti has much better throughput than dax_flush.
> >
> > What about memcpy_flushcache?
>
> but
>
> - using memcpy_flushcache is overkill if we need just one or two 8-byte
> writes to the metadata area. Why not use movnti directly?
>

The driver performs so many 8-byte moves that the cost of the
memcpy_flushcache() function call significantly eats into your
performance?

> - on some architctures, memcpy_flushcache is just an alias for memcpy, so
> there will still be some arch-specific ifdefs

...but those should all be hidden by arch code, not in drivers.

> - there is no architecture-neutral way how to guarantee ordering between
> multiple memcpy_flushcache calls. On x86, we need wmb(), on Power we
> don't, on ARM64 I don't know (arch_wb_cache_pmem calls dmb(osh),
> memcpy_flushcache doesn't - I don't know what are the implications of this
> difference) on other architectures, wmb() is insufficient to guarantee
> ordering between multiple memcpy_flushcache calls.

wmb() should always be sufficient to order stores on all architectures.

>
> - on Power and ARM64, memcpy_flushcache just does memcpy and flushes the
> cache afterwards. What is better for performance? Is is better to do
> multiple memcpy calls and later multiple dax_flush calls? Or is it better
> to 

Re: [dm-devel] [PATCH 2/2] dm-writecache

2017-12-04 Thread Mikulas Patocka


On Mon, 27 Nov 2017, Dan Williams wrote:

> On Mon, Nov 27, 2017 at 8:01 PM, Mikulas Patocka  wrote:
> >
> >
> >
> > On Fri, 24 Nov 2017, Dan Williams wrote:
> >
> > > On Wed, Nov 22, 2017 at 6:36 PM, Mikulas Patocka  
> > > wrote:
> > > > Hi
> > > >
> > > > Here I'm sending slighly update dm-writecache target.
> > > >
> > >
> > > I would expect some theory of operation documentation in the changelog
> > > or in Documentation/ for a new driver.
> >
> > It caches writes for a block device in persistent memory. It doesn't cache
> > reads because reads are supposed to be cached in normal ram.
> 
> Is it a cache or more of a write-buffer? How aggressively does it
> write back data? Are there any recommendations about cache size vs
> backing device size? Does it also buffer sequential writes? For
> example bcache tries to avoid caching sequential traffic.

It caches all writes. When the size of the cache reaches a predefined 
threshold (50% by default), it starts writing cache content to the backing 
device.

The recommended cache size depends on the workload - if the user 
repeatedly overwrites only a small part of the disk, then small cache is 
sufficient.

> > > > +/*
> > > > + * The clflushopt instruction is very slow on Broadwell, so we use 
> > > > non-temporal
> > > > + * stores instead.
> > > > + */
> > >
> > > The clflushopt instruction's first instantiation in a cpu was after
> > > Broadwell. Also, the performance of clflushopt does not much matter
> > > when running on a system without ADR (asynchronous DRAM refresh) where
> > > flushing the cpu cache can just leave the writes stranded on their way
> > > to the DIMM. ADR is not available on general purpose servers until
> > > ACPI 6.x NFIT-defined persistent memory platforms.
> >
> > There used to be pcommit instruction, but intel abandoned it - see
> > https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction
> 
> Yes, I am familiar with that document, I wrote the pcommit removal
> patch and referenced ADR in the commit message:
> 
> fd1d961dd681 x86/insn: remove pcommit
> 
> > So, according to the document, flushing the cache should be enough for
> > writes to reach persistent memory.
> 
> The document assumes ADR is present.

Could there be a case where persistent memory is present, but ADR is not? 

> > > > +#ifdef CONFIG_X86_64
> > >
> > > In general something is broken if we end up with per-arch ifdefs like
> > > this in drivers to handle persistent memory. This should be using the
> > > pmem api or extensions of it, and we need to settle on a mechanism for
> > > upper-level drivers to ask if pmem is driving platform protected
> > > persistent memory.
> > >
> > > > +#define NT_STORE(dest, src)asm ("movnti %1, %0" : "=m"(dest) : 
> > > > "r"(src))
> > > > +#define FLUSH_RANGE(dax, ptr, size)  do { } while (0)
> > > > +#define COMMIT_FLUSHED()   wmb()
> > > > +#else
> > > > +#define NT_STORE(dest, src)ACCESS_ONCE(dest) = (src)
> > > > +#define FLUSH_RANGEdax_flush
> > > > +#define COMMIT_FLUSHED()   do { } while (0)
> > >
> > > Is this just for test purposes? How does the user discover that they
> > > are running in a degraded mode as far as persistence guarantees? I
> > > think we should be falling back DM_WRITECACHE_ONLY_SSD mode if we're
> > > not on a pmem platform.
> >
> > What degraded mode do you mean?
> 
> Fall back to treating pmem like an SSD / block-device.

If the dm-writecache driver cannot reliably flush cache - then the 
/dev/pmem block device driver couldn't reliably flush cache neither.

> > According to that document, flushing cache
> > should be enough. If it is not enough, what else do I need to do to flush
> > cache?
> 
> ...ensure that you are on a platform where flushing the cpu cache is enough.
> 
> > The above code is not for test purposes. It is for performance purposes.
> >
> > On dual-socket Broadwell server with persistent memory
> 
> What platform is this... does it have ADR, does the BIOS produce an NFIT?

I don't know. I already released the machine for someone else. If you 
want, I can try to re-acquire the access to it.

> > when we write to
> > persistent memory using cached write instructions and use dax_flush
> > afterwards to flush cache for the affected range, the performance is about
> > 350MB/s. It is practically unusable - worse than low-end SSDs.
> >
> > On the other hand, the movnti instruction can sustain performance of one
> > 8-byte write per clock cycle. We don't have to flush cache afterwards, the
> > only thing that must be done is to flush the write-combining buffer with
> > the sfence instruction. Movnti has much better throughput than dax_flush.
> 
> What about memcpy_flushcache?

but

- using memcpy_flushcache is overkill if we need just one or two 8-byte 
writes to the metadata area. Why not use movnti directly?

- on some architctures, memcpy_flushcache is just an alias for memcpy, so 
there 

Re: [dm-devel] [PATCH 2/2] dm-writecache

2017-11-27 Thread Dan Williams
On Mon, Nov 27, 2017 at 8:01 PM, Mikulas Patocka  wrote:
>
>
>
> On Fri, 24 Nov 2017, Dan Williams wrote:
>
> > On Wed, Nov 22, 2017 at 6:36 PM, Mikulas Patocka  
> > wrote:
> > > Hi
> > >
> > > Here I'm sending slighly update dm-writecache target.
> > >
> >
> > I would expect some theory of operation documentation in the changelog
> > or in Documentation/ for a new driver.
>
> It caches writes for a block device in persistent memory. It doesn't cache
> reads because reads are supposed to be cached in normal ram.

Is it a cache or more of a write-buffer? How aggressively does it
write back data? Are there any recommendations about cache size vs
backing device size? Does it also buffer sequential writes? For
example bcache tries to avoid caching sequential traffic.

>
> > > Signed-off-by: Mikulas Patocka 
> > >
> > [..]
> > > +/*
> > > + * The clflushopt instruction is very slow on Broadwell, so we use 
> > > non-temporal
> > > + * stores instead.
> > > + */
> >
> > The clflushopt instruction's first instantiation in a cpu was after
> > Broadwell. Also, the performance of clflushopt does not much matter
> > when running on a system without ADR (asynchronous DRAM refresh) where
> > flushing the cpu cache can just leave the writes stranded on their way
> > to the DIMM. ADR is not available on general purpose servers until
> > ACPI 6.x NFIT-defined persistent memory platforms.
>
> There used to be pcommit instruction, but intel abandoned it - see
> https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction

Yes, I am familiar with that document, I wrote the pcommit removal
patch and referenced ADR in the commit message:

fd1d961dd681 x86/insn: remove pcommit

> So, according to the document, flushing the cache should be enough for
> writes to reach persistent memory.

The document assumes ADR is present.

> > > +#ifdef CONFIG_X86_64
> >
> > In general something is broken if we end up with per-arch ifdefs like
> > this in drivers to handle persistent memory. This should be using the
> > pmem api or extensions of it, and we need to settle on a mechanism for
> > upper-level drivers to ask if pmem is driving platform protected
> > persistent memory.
> >
> > > +#define NT_STORE(dest, src)asm ("movnti %1, %0" : "=m"(dest) : 
> > > "r"(src))
> > > +#define FLUSH_RANGE(dax, ptr, size)  do { } while (0)
> > > +#define COMMIT_FLUSHED()   wmb()
> > > +#else
> > > +#define NT_STORE(dest, src)ACCESS_ONCE(dest) = (src)
> > > +#define FLUSH_RANGEdax_flush
> > > +#define COMMIT_FLUSHED()   do { } while (0)
> >
> > Is this just for test purposes? How does the user discover that they
> > are running in a degraded mode as far as persistence guarantees? I
> > think we should be falling back DM_WRITECACHE_ONLY_SSD mode if we're
> > not on a pmem platform.
>
> What degraded mode do you mean?

Fall back to treating pmem like an SSD / block-device.

> According to that document, flushing cache
> should be enough. If it is not enough, what else do I need to do to flush
> cache?

...ensure that you are on a platform where flushing the cpu cache is enough.

> The above code is not for test purposes. It is for performance purposes.
>
> On dual-socket Broadwell server with persistent memory

What platform is this... does it have ADR, does the BIOS produce an NFIT?

> when we write to
> persistent memory using cached write instructions and use dax_flush
> afterwards to flush cache for the affected range, the performance is about
> 350MB/s. It is practically unusable - worse than low-end SSDs.
>
> On the other hand, the movnti instruction can sustain performance of one
> 8-byte write per clock cycle. We don't have to flush cache afterwards, the
> only thing that must be done is to flush the write-combining buffer with
> the sfence instruction. Movnti has much better throughput than dax_flush.

What about memcpy_flushcache?

> You argue that I should use the pmem api, but there is no pmem api
> providing the movnti intruction. I have no choice, but coding the
> instruction directly in assembler.

You have not convinced me that memcpy_flushcache is insufficient, or
that the driver is mistakenly assuming guarantees about persistence
when it is not running on a persistent memory enabled platform.

> If you want to create API, take that "NT_STORE" definition and move it to
> some x86-specific include file.

I'm asking you to not solve global problems in your local driver. If
the pmem api is deficient for your use case then let's work on
improving it, but as of now I'm still trying to baseline your
assumptions.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 2/2] dm-writecache

2017-11-27 Thread Mikulas Patocka


On Fri, 24 Nov 2017, Dan Williams wrote:

> On Wed, Nov 22, 2017 at 6:36 PM, Mikulas Patocka  wrote:
> > Hi
> >
> > Here I'm sending slighly update dm-writecache target.
> >
> 
> I would expect some theory of operation documentation in the changelog
> or in Documentation/ for a new driver.

It caches writes for a block device in persistent memory. It doesn't cache 
reads because reads are supposed to be cached in normal ram.

> > Signed-off-by: Mikulas Patocka 
> >
> [..]
> > +/*
> > + * The clflushopt instruction is very slow on Broadwell, so we use 
> > non-temporal
> > + * stores instead.
> > + */
> 
> The clflushopt instruction's first instantiation in a cpu was after
> Broadwell. Also, the performance of clflushopt does not much matter
> when running on a system without ADR (asynchronous DRAM refresh) where
> flushing the cpu cache can just leave the writes stranded on their way
> to the DIMM. ADR is not available on general purpose servers until
> ACPI 6.x NFIT-defined persistent memory platforms.

There used to be pcommit instruction, but intel abandoned it - see 
https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction

So, according to the document, flushing the cache should be enough for 
writes to reach persistent memory.

> > +#ifdef CONFIG_X86_64
> 
> In general something is broken if we end up with per-arch ifdefs like
> this in drivers to handle persistent memory. This should be using the
> pmem api or extensions of it, and we need to settle on a mechanism for
> upper-level drivers to ask if pmem is driving platform protected
> persistent memory.
> 
> > +#define NT_STORE(dest, src)asm ("movnti %1, %0" : "=m"(dest) : 
> > "r"(src))
> > +#define FLUSH_RANGE(dax, ptr, size)  do { } while (0)
> > +#define COMMIT_FLUSHED()   wmb()
> > +#else
> > +#define NT_STORE(dest, src)ACCESS_ONCE(dest) = (src)
> > +#define FLUSH_RANGEdax_flush
> > +#define COMMIT_FLUSHED()   do { } while (0)
> 
> Is this just for test purposes? How does the user discover that they
> are running in a degraded mode as far as persistence guarantees? I
> think we should be falling back DM_WRITECACHE_ONLY_SSD mode if we're
> not on a pmem platform.

What degraded mode do you mean? According to that document, flushing cache 
should be enough. If it is not enough, what else do I need to do to flush 
cache?


The above code is not for test purposes. It is for performance purposes.

On dual-socket Broadwell server with persistent memory, when we write to 
persistent memory using cached write instructions and use dax_flush 
afterwards to flush cache for the affected range, the performance is about 
350MB/s. It is practically unusable - worse than low-end SSDs.

On the other hand, the movnti instruction can sustain performance of one 
8-byte write per clock cycle. We don't have to flush cache afterwards, the 
only thing that must be done is to flush the write-combining buffer with 
the sfence instruction. Movnti has much better throughput than dax_flush.

You argue that I should use the pmem api, but there is no pmem api 
providing the movnti intruction. I have no choice, but coding the 
instruction directly in assembler.

If you want to create API, take that "NT_STORE" definition and move it to 
some x86-specific include file.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 2/2] dm-writecache

2017-11-23 Thread Mikulas Patocka


On Thu, 23 Nov 2017, Dan Williams wrote:

> On Thu, Nov 23, 2017 at 12:29 AM, Christoph Hellwig  
> wrote:
> >> +/*
> >> + * The clflushopt instruction is very slow on Broadwell, so we use 
> >> non-temporal
> >> + * stores instead.
> >> + */
> >> +#ifdef CONFIG_X86_64
> >> +#define NT_STORE(dest, src)  asm ("movnti %1, %0" : "=m"(dest) : "r"(src))
> >> +#define FLUSH_RANGE(dax, ptr, size)  do { } while (0)
> >> +#define COMMIT_FLUSHED() wmb()
> >> +#else
> >> +#define NT_STORE(dest, src)  ACCESS_ONCE(dest) = (src)
> >> +#define FLUSH_RANGE  dax_flush
> >> +#define COMMIT_FLUSHED() do { } while (0)
> >> +#endif
> >
> > Please go through the proper pmem APIs.
> 
> Yes, see _copy_from_iter_flushcache() and all its related apis.
> Drivers shouldn't be doing anything different than what fs/dax.c is
> using.

There is not any arch-independent API that abstracts over a single movnti 
instruction.

Do you want to add it?

memcpy_flushcache is overkill for one or two 8-byte stores. And - on 
non-x86 architectures memcpy_flushcache doesn't flush cache at all.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 2/2] dm-writecache

2017-11-23 Thread Dan Williams
On Thu, Nov 23, 2017 at 12:29 AM, Christoph Hellwig  wrote:
>> +/*
>> + * The clflushopt instruction is very slow on Broadwell, so we use 
>> non-temporal
>> + * stores instead.
>> + */
>> +#ifdef CONFIG_X86_64
>> +#define NT_STORE(dest, src)  asm ("movnti %1, %0" : "=m"(dest) : "r"(src))
>> +#define FLUSH_RANGE(dax, ptr, size)  do { } while (0)
>> +#define COMMIT_FLUSHED() wmb()
>> +#else
>> +#define NT_STORE(dest, src)  ACCESS_ONCE(dest) = (src)
>> +#define FLUSH_RANGE  dax_flush
>> +#define COMMIT_FLUSHED() do { } while (0)
>> +#endif
>
> Please go through the proper pmem APIs.

Yes, see _copy_from_iter_flushcache() and all its related apis.
Drivers shouldn't be doing anything different than what fs/dax.c is
using.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 2/2] dm-writecache

2017-11-23 Thread Christoph Hellwig
> +/*
> + * The clflushopt instruction is very slow on Broadwell, so we use 
> non-temporal
> + * stores instead.
> + */
> +#ifdef CONFIG_X86_64
> +#define NT_STORE(dest, src)  asm ("movnti %1, %0" : "=m"(dest) : "r"(src))
> +#define FLUSH_RANGE(dax, ptr, size)  do { } while (0)
> +#define COMMIT_FLUSHED() wmb()
> +#else
> +#define NT_STORE(dest, src)  ACCESS_ONCE(dest) = (src)
> +#define FLUSH_RANGE  dax_flush
> +#define COMMIT_FLUSHED() do { } while (0)
> +#endif

Please go through the proper pmem APIs.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel