Re: [dm-devel] [PATCH 2/2] dm-writecache
On Tue, Feb 13, 2018 at 5:24 PM, Mikulas Patockawrote: > > > On Tue, 13 Feb 2018, Dan Williams wrote: > >> On Tue, Feb 13, 2018 at 2:00 PM, Mikulas Patocka wrote: >> > >> > >> > On Fri, 8 Dec 2017, Dan Williams wrote: >> > >> >> > > > when we write to >> >> > > > persistent memory using cached write instructions and use dax_flush >> >> > > > afterwards to flush cache for the affected range, the performance >> >> > > > is about >> >> > > > 350MB/s. It is practically unusable - worse than low-end SSDs. >> >> > > > >> >> > > > On the other hand, the movnti instruction can sustain performance >> >> > > > of one >> >> > > > 8-byte write per clock cycle. We don't have to flush cache >> >> > > > afterwards, the >> >> > > > only thing that must be done is to flush the write-combining buffer >> >> > > > with >> >> > > > the sfence instruction. Movnti has much better throughput than >> >> > > > dax_flush. >> >> > > >> >> > > What about memcpy_flushcache? >> >> > >> >> > but >> >> > >> >> > - using memcpy_flushcache is overkill if we need just one or two 8-byte >> >> > writes to the metadata area. Why not use movnti directly? >> >> > >> >> >> >> The driver performs so many 8-byte moves that the cost of the >> >> memcpy_flushcache() function call significantly eats into your >> >> performance? >> > >> > I've measured it on Skylake i7-6700 - and the dm-writecache driver has 2% >> > lower throughput when it uses memcpy_flushcache() to update it metadata >> > instead of explicitly coded "movnti" instructions. >> > >> > I've created this patch - it doesn't change API in any way, but it >> > optimizes memcpy_flushcache for 4, 8 and 16-byte writes (that is what my >> > driver mostly uses). With this patch, I can remove the explicit "asm" >> > statements from my driver. Would you consider commiting this patch to the >> > kernel? >> > >> > Mikulas >> > >> > >> >> Yes, this looks good to me. You can send it to the x86 folks with my: >> >> Reviewed-by: Dan Williams >> >> ...or let me know and I can chase it through the -tip tree. Either way >> works for me. > > If you have access to some tree that will be merged, you can commit the > patch there. Sure, no worries, I'll take it from here. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 2/2] dm-writecache
On Tue, 13 Feb 2018, Dan Williams wrote: > On Tue, Feb 13, 2018 at 2:00 PM, Mikulas Patockawrote: > > > > > > On Fri, 8 Dec 2017, Dan Williams wrote: > > > >> > > > when we write to > >> > > > persistent memory using cached write instructions and use dax_flush > >> > > > afterwards to flush cache for the affected range, the performance is > >> > > > about > >> > > > 350MB/s. It is practically unusable - worse than low-end SSDs. > >> > > > > >> > > > On the other hand, the movnti instruction can sustain performance of > >> > > > one > >> > > > 8-byte write per clock cycle. We don't have to flush cache > >> > > > afterwards, the > >> > > > only thing that must be done is to flush the write-combining buffer > >> > > > with > >> > > > the sfence instruction. Movnti has much better throughput than > >> > > > dax_flush. > >> > > > >> > > What about memcpy_flushcache? > >> > > >> > but > >> > > >> > - using memcpy_flushcache is overkill if we need just one or two 8-byte > >> > writes to the metadata area. Why not use movnti directly? > >> > > >> > >> The driver performs so many 8-byte moves that the cost of the > >> memcpy_flushcache() function call significantly eats into your > >> performance? > > > > I've measured it on Skylake i7-6700 - and the dm-writecache driver has 2% > > lower throughput when it uses memcpy_flushcache() to update it metadata > > instead of explicitly coded "movnti" instructions. > > > > I've created this patch - it doesn't change API in any way, but it > > optimizes memcpy_flushcache for 4, 8 and 16-byte writes (that is what my > > driver mostly uses). With this patch, I can remove the explicit "asm" > > statements from my driver. Would you consider commiting this patch to the > > kernel? > > > > Mikulas > > > > > > Yes, this looks good to me. You can send it to the x86 folks with my: > > Reviewed-by: Dan Williams > > ...or let me know and I can chase it through the -tip tree. Either way > works for me. If you have access to some tree that will be merged, you can commit the patch there. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 2/2] dm-writecache
On Tue, Feb 13, 2018 at 2:00 PM, Mikulas Patockawrote: > > > On Fri, 8 Dec 2017, Dan Williams wrote: > >> > > > when we write to >> > > > persistent memory using cached write instructions and use dax_flush >> > > > afterwards to flush cache for the affected range, the performance is >> > > > about >> > > > 350MB/s. It is practically unusable - worse than low-end SSDs. >> > > > >> > > > On the other hand, the movnti instruction can sustain performance of >> > > > one >> > > > 8-byte write per clock cycle. We don't have to flush cache afterwards, >> > > > the >> > > > only thing that must be done is to flush the write-combining buffer >> > > > with >> > > > the sfence instruction. Movnti has much better throughput than >> > > > dax_flush. >> > > >> > > What about memcpy_flushcache? >> > >> > but >> > >> > - using memcpy_flushcache is overkill if we need just one or two 8-byte >> > writes to the metadata area. Why not use movnti directly? >> > >> >> The driver performs so many 8-byte moves that the cost of the >> memcpy_flushcache() function call significantly eats into your >> performance? > > I've measured it on Skylake i7-6700 - and the dm-writecache driver has 2% > lower throughput when it uses memcpy_flushcache() to update it metadata > instead of explicitly coded "movnti" instructions. > > I've created this patch - it doesn't change API in any way, but it > optimizes memcpy_flushcache for 4, 8 and 16-byte writes (that is what my > driver mostly uses). With this patch, I can remove the explicit "asm" > statements from my driver. Would you consider commiting this patch to the > kernel? > > Mikulas > > Yes, this looks good to me. You can send it to the x86 folks with my: Reviewed-by: Dan Williams ...or let me know and I can chase it through the -tip tree. Either way works for me. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 2/2] dm-writecache
On Fri, 8 Dec 2017, Dan Williams wrote: > > > > when we write to > > > > persistent memory using cached write instructions and use dax_flush > > > > afterwards to flush cache for the affected range, the performance is > > > > about > > > > 350MB/s. It is practically unusable - worse than low-end SSDs. > > > > > > > > On the other hand, the movnti instruction can sustain performance of one > > > > 8-byte write per clock cycle. We don't have to flush cache afterwards, > > > > the > > > > only thing that must be done is to flush the write-combining buffer with > > > > the sfence instruction. Movnti has much better throughput than > > > > dax_flush. > > > > > > What about memcpy_flushcache? > > > > but > > > > - using memcpy_flushcache is overkill if we need just one or two 8-byte > > writes to the metadata area. Why not use movnti directly? > > > > The driver performs so many 8-byte moves that the cost of the > memcpy_flushcache() function call significantly eats into your > performance? I've measured it on Skylake i7-6700 - and the dm-writecache driver has 2% lower throughput when it uses memcpy_flushcache() to update it metadata instead of explicitly coded "movnti" instructions. I've created this patch - it doesn't change API in any way, but it optimizes memcpy_flushcache for 4, 8 and 16-byte writes (that is what my driver mostly uses). With this patch, I can remove the explicit "asm" statements from my driver. Would you consider commiting this patch to the kernel? Mikulas x86: optimize memcpy_flushcache I use memcpy_flushcache in my persistent memory driver for metadata updates and it turns out that the overhead of memcpy_flushcache causes 2% performance degradation compared to "movnti" instruction explicitly coded using inline assembler. This patch recognizes memcpy_flushcache calls with constant short length and turns them into inline assembler - so that I don't have to use inline assembler in the driver. Signed-off-by: Mikulas Patocka--- arch/x86/include/asm/string_64.h | 20 +++- arch/x86/lib/usercopy_64.c |6 +++--- 2 files changed, 22 insertions(+), 4 deletions(-) Index: linux-2.6/arch/x86/include/asm/string_64.h === --- linux-2.6.orig/arch/x86/include/asm/string_64.h 2018-01-31 11:06:19.953577699 -0500 +++ linux-2.6/arch/x86/include/asm/string_64.h 2018-02-13 12:31:06.506810497 -0500 @@ -147,7 +147,25 @@ memcpy_mcsafe(void *dst, const void *src #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE #define __HAVE_ARCH_MEMCPY_FLUSHCACHE 1 -void memcpy_flushcache(void *dst, const void *src, size_t cnt); +void __memcpy_flushcache(void *dst, const void *src, size_t cnt); +static __always_inline void memcpy_flushcache(void *dst, const void *src, size_t cnt) +{ + if (__builtin_constant_p(cnt)) { + switch (cnt) { + case 4: + asm ("movntil %1, %0" : "=m"(*(u32 *)dst) : "r"(*(u32 *)src)); + return; + case 8: + asm ("movntiq %1, %0" : "=m"(*(u64 *)dst) : "r"(*(u64 *)src)); + return; + case 16: + asm ("movntiq %1, %0" : "=m"(*(u64 *)dst) : "r"(*(u64 *)src)); + asm ("movntiq %1, %0" : "=m"(*(u64 *)(dst + 8)) : "r"(*(u64 *)(src + 8))); + return; + } + } + __memcpy_flushcache(dst, src, cnt); +} #endif #endif /* __KERNEL__ */ Index: linux-2.6/arch/x86/lib/usercopy_64.c === --- linux-2.6.orig/arch/x86/lib/usercopy_64.c 2018-01-31 11:06:19.988577678 -0500 +++ linux-2.6/arch/x86/lib/usercopy_64.c2018-02-13 11:56:40.249154414 -0500 @@ -133,7 +133,7 @@ long __copy_user_flushcache(void *dst, c return rc; } -void memcpy_flushcache(void *_dst, const void *_src, size_t size) +void __memcpy_flushcache(void *_dst, const void *_src, size_t size) { unsigned long dest = (unsigned long) _dst; unsigned long source = (unsigned long) _src; @@ -196,14 +196,14 @@ void memcpy_flushcache(void *_dst, const clean_cache_range((void *) dest, size); } } -EXPORT_SYMBOL_GPL(memcpy_flushcache); +EXPORT_SYMBOL_GPL(__memcpy_flushcache); void memcpy_page_flushcache(char *to, struct page *page, size_t offset, size_t len) { char *from = kmap_atomic(page); - memcpy_flushcache(to, from + offset, len); + __memcpy_flushcache(to, from + offset, len); kunmap_atomic(from); } #endif -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 2/2] dm-writecache
On Fri, Dec 22, 2017 at 12:56 PM, Mikulas Patockawrote: > > > On Fri, 8 Dec 2017, Dan Williams wrote: > >> > > What about memcpy_flushcache? >> > >> > but >> > >> > - using memcpy_flushcache is overkill if we need just one or two 8-byte >> > writes to the metadata area. Why not use movnti directly? >> > >> >> The driver performs so many 8-byte moves that the cost of the >> memcpy_flushcache() function call significantly eats into your >> performance? >> >> > - on some architctures, memcpy_flushcache is just an alias for memcpy, so >> > there will still be some arch-specific ifdefs >> >> ...but those should all be hidden by arch code, not in drivers. >> >> > - there is no architecture-neutral way how to guarantee ordering between >> > multiple memcpy_flushcache calls. On x86, we need wmb(), on Power we >> > don't, on ARM64 I don't know (arch_wb_cache_pmem calls dmb(osh), >> > memcpy_flushcache doesn't - I don't know what are the implications of this >> > difference) on other architectures, wmb() is insufficient to guarantee >> > ordering between multiple memcpy_flushcache calls. >> >> wmb() should always be sufficient to order stores on all architectures. > > No, it isn't. See this example: > > uint64_t var_a, var_b; > > void fn(void) > { > uint64_t val = 3; > memcpy_flushcache(_a, , 8); > wmb(); > val = 5; > memcpy_flushcache(_b, , 8);; > } > > On x86-64, memcpy_flushcache is implemented using the movnti instruction > (that writes the value to the write-combining buffer) and wmb() is > compiled into the sfence instruction (that flushes the write-combining > buffer) - so it's OK - it is guaranteed that the variable var_a is written > to persistent memory before the variable var_b; > > However, on i686 (and most other architectures), memcpy_flushcache is just > an alias for memcpy - see this code in include/linux/string.h: > #ifndef __HAVE_ARCH_MEMCPY_FLUSHCACHE > static inline void memcpy_flushcache(void *dst, const void *src, size_t > cnt) > { > memcpy(dst, src, cnt); > } > #endif > > So, memcpy_flushcache writes the variable var_a to the cache, wmb() > flushes the pipeline, but otherwise does nothing (because on i686 all > writes to the cache are already ordered) and then memcpy_flushcache writes > the variable var_b to the cache - however, both writes end up in cache and > wmb() doesn't flush the cache. So wmb() doesn't provide any sort of > guarantee that var_a is written to persistent memory before var_b. If the > cache sector containing the variable var_b is more contended than the > cache sector containg the variable var_a, the CPU will flush cache line > containing var_b before var_a. You're overlooking the fact that there is no persistent memory support for i686. The pmem driver will warn you of this fact: dev_warn(dev, "unable to guarantee persistence of writes\n"); The reason persistent memory is not supported on i686 is due to the fact that older cpus do not guarantee that non-temporal stores always bypass the cache. > We have the dax_flush() function that (unlike wmb) guarantees that the > specified range is flushed from the cache and written to persistent > memory. But dax_flush is very slow on x86-64. dax_flush() does not make that guarantee on i686, and it's only slow on architectures that were not built for persistent memory support i.e. architectures prior to the arrival of the clwb and clflushopt instructions. > So, we have a situation where wmb() is fast, but it only guarantees > ordering on x86-64 and dax_flush() that guarantees ordering on all > architectures, but it is slow on x86-64. > > So, my driver uses wmb() on x86-64 and dax_flush() on all the others. ...and I'm saying you're hard coding for a cpu that was not built to support persistent memory efficiently in the first place. > You argue that the driver shouldn't use any per-architecture #ifdefs, but > there is no other way how to do it - using dax_flush() on x86-64 kills > performance and using wmb() on all architectures is unreliable because > wmb() doesn't guarantee cache flushing at all. > >> > At least for Broadwell, the write-back memory type on persistent memory >> > has so horrible performance that it not really usable. >> >> ...and my concern is that you're designing a pmem driver mechanism for >> Broadwell that predates the clwb instruction for efficient pmem >> access. > > That's what we have for testing. If I get access to some Skylake server, I > can test if using write combining is faster than cache flushing. I'm still missing where memcpy_flushcache is unsuitable for your use case given that i686 does not support reliable peristent memory access. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 2/2] dm-writecache
On Fri, 8 Dec 2017, Dan Williams wrote: > > > What about memcpy_flushcache? > > > > but > > > > - using memcpy_flushcache is overkill if we need just one or two 8-byte > > writes to the metadata area. Why not use movnti directly? > > > > The driver performs so many 8-byte moves that the cost of the > memcpy_flushcache() function call significantly eats into your > performance? > > > - on some architctures, memcpy_flushcache is just an alias for memcpy, so > > there will still be some arch-specific ifdefs > > ...but those should all be hidden by arch code, not in drivers. > > > - there is no architecture-neutral way how to guarantee ordering between > > multiple memcpy_flushcache calls. On x86, we need wmb(), on Power we > > don't, on ARM64 I don't know (arch_wb_cache_pmem calls dmb(osh), > > memcpy_flushcache doesn't - I don't know what are the implications of this > > difference) on other architectures, wmb() is insufficient to guarantee > > ordering between multiple memcpy_flushcache calls. > > wmb() should always be sufficient to order stores on all architectures. No, it isn't. See this example: uint64_t var_a, var_b; void fn(void) { uint64_t val = 3; memcpy_flushcache(_a, , 8); wmb(); val = 5; memcpy_flushcache(_b, , 8);; } On x86-64, memcpy_flushcache is implemented using the movnti instruction (that writes the value to the write-combining buffer) and wmb() is compiled into the sfence instruction (that flushes the write-combining buffer) - so it's OK - it is guaranteed that the variable var_a is written to persistent memory before the variable var_b; However, on i686 (and most other architectures), memcpy_flushcache is just an alias for memcpy - see this code in include/linux/string.h: #ifndef __HAVE_ARCH_MEMCPY_FLUSHCACHE static inline void memcpy_flushcache(void *dst, const void *src, size_t cnt) { memcpy(dst, src, cnt); } #endif So, memcpy_flushcache writes the variable var_a to the cache, wmb() flushes the pipeline, but otherwise does nothing (because on i686 all writes to the cache are already ordered) and then memcpy_flushcache writes the variable var_b to the cache - however, both writes end up in cache and wmb() doesn't flush the cache. So wmb() doesn't provide any sort of guarantee that var_a is written to persistent memory before var_b. If the cache sector containing the variable var_b is more contended than the cache sector containg the variable var_a, the CPU will flush cache line containing var_b before var_a. We have the dax_flush() function that (unlike wmb) guarantees that the specified range is flushed from the cache and written to persistent memory. But dax_flush is very slow on x86-64. So, we have a situation where wmb() is fast, but it only guarantees ordering on x86-64 and dax_flush() that guarantees ordering on all architectures, but it is slow on x86-64. So, my driver uses wmb() on x86-64 and dax_flush() on all the others. You argue that the driver shouldn't use any per-architecture #ifdefs, but there is no other way how to do it - using dax_flush() on x86-64 kills performance and using wmb() on all architectures is unreliable because wmb() doesn't guarantee cache flushing at all. > > At least for Broadwell, the write-back memory type on persistent memory > > has so horrible performance that it not really usable. > > ...and my concern is that you're designing a pmem driver mechanism for > Broadwell that predates the clwb instruction for efficient pmem > access. That's what we have for testing. If I get access to some Skylake server, I can test if using write combining is faster than cache flushing. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 2/2] dm-writecache
On Mon, Dec 4, 2017 at 9:40 PM, Mikulas Patockawrote: > On Mon, 27 Nov 2017, Dan Williams wrote: [..] > > > So, according to the document, flushing the cache should be enough for > > > writes to reach persistent memory. > > > > The document assumes ADR is present. > > Could there be a case where persistent memory is present, but ADR is not? No, that's why it was safe to deprecate the pcommit instruction, software can assume that once writes reach the memory controller they are persistent and there is no requirement to flush them through the memory controller to media which is what the pcommit instruction performed. > > > > > +#ifdef CONFIG_X86_64 > > > > > > > > In general something is broken if we end up with per-arch ifdefs like > > > > this in drivers to handle persistent memory. This should be using the > > > > pmem api or extensions of it, and we need to settle on a mechanism for > > > > upper-level drivers to ask if pmem is driving platform protected > > > > persistent memory. > > > > > > > > > +#define NT_STORE(dest, src)asm ("movnti %1, %0" : "=m"(dest) : > > > > > "r"(src)) > > > > > +#define FLUSH_RANGE(dax, ptr, size) do { } while (0) > > > > > +#define COMMIT_FLUSHED() wmb() > > > > > +#else > > > > > +#define NT_STORE(dest, src)ACCESS_ONCE(dest) = (src) > > > > > +#define FLUSH_RANGEdax_flush > > > > > +#define COMMIT_FLUSHED() do { } while (0) > > > > > > > > Is this just for test purposes? How does the user discover that they > > > > are running in a degraded mode as far as persistence guarantees? I > > > > think we should be falling back DM_WRITECACHE_ONLY_SSD mode if we're > > > > not on a pmem platform. > > > > > > What degraded mode do you mean? > > > > Fall back to treating pmem like an SSD / block-device. > > If the dm-writecache driver cannot reliably flush cache - then the > /dev/pmem block device driver couldn't reliably flush cache neither. Right, but the pmem driver warns when it can't detect that it is running on a persistent memory enabled platform. I'm saying we should turn that warning into an api so that in-kernel consumers of pmem can make the same warning or determination to disable functionality on non-pmem capable platforms. > > > According to that document, flushing cache > > > should be enough. If it is not enough, what else do I need to do to flush > > > cache? > > > > ...ensure that you are on a platform where flushing the cpu cache is enough. > > > > > The above code is not for test purposes. It is for performance purposes. > > > > > > On dual-socket Broadwell server with persistent memory > > > > What platform is this... does it have ADR, does the BIOS produce an NFIT? > > I don't know. I already released the machine for someone else. If you > want, I can try to re-acquire the access to it. As long as your driver consumes the new/TBD in-kernel api to determine if the platform is pmem-enabled or not then I think we're good. My concern is that we're optimizing the kernel interfaces for a platform that does not support pmem, or requires a pmem support approach that we would abandon when the clwb instruction can replace clflush usage. > > > > when we write to > > > persistent memory using cached write instructions and use dax_flush > > > afterwards to flush cache for the affected range, the performance is about > > > 350MB/s. It is practically unusable - worse than low-end SSDs. > > > > > > On the other hand, the movnti instruction can sustain performance of one > > > 8-byte write per clock cycle. We don't have to flush cache afterwards, the > > > only thing that must be done is to flush the write-combining buffer with > > > the sfence instruction. Movnti has much better throughput than dax_flush. > > > > What about memcpy_flushcache? > > but > > - using memcpy_flushcache is overkill if we need just one or two 8-byte > writes to the metadata area. Why not use movnti directly? > The driver performs so many 8-byte moves that the cost of the memcpy_flushcache() function call significantly eats into your performance? > - on some architctures, memcpy_flushcache is just an alias for memcpy, so > there will still be some arch-specific ifdefs ...but those should all be hidden by arch code, not in drivers. > - there is no architecture-neutral way how to guarantee ordering between > multiple memcpy_flushcache calls. On x86, we need wmb(), on Power we > don't, on ARM64 I don't know (arch_wb_cache_pmem calls dmb(osh), > memcpy_flushcache doesn't - I don't know what are the implications of this > difference) on other architectures, wmb() is insufficient to guarantee > ordering between multiple memcpy_flushcache calls. wmb() should always be sufficient to order stores on all architectures. > > - on Power and ARM64, memcpy_flushcache just does memcpy and flushes the > cache afterwards. What is better for performance? Is is better to do > multiple memcpy calls and later multiple dax_flush calls? Or is it better > to
Re: [dm-devel] [PATCH 2/2] dm-writecache
On Mon, 27 Nov 2017, Dan Williams wrote: > On Mon, Nov 27, 2017 at 8:01 PM, Mikulas Patockawrote: > > > > > > > > On Fri, 24 Nov 2017, Dan Williams wrote: > > > > > On Wed, Nov 22, 2017 at 6:36 PM, Mikulas Patocka > > > wrote: > > > > Hi > > > > > > > > Here I'm sending slighly update dm-writecache target. > > > > > > > > > > I would expect some theory of operation documentation in the changelog > > > or in Documentation/ for a new driver. > > > > It caches writes for a block device in persistent memory. It doesn't cache > > reads because reads are supposed to be cached in normal ram. > > Is it a cache or more of a write-buffer? How aggressively does it > write back data? Are there any recommendations about cache size vs > backing device size? Does it also buffer sequential writes? For > example bcache tries to avoid caching sequential traffic. It caches all writes. When the size of the cache reaches a predefined threshold (50% by default), it starts writing cache content to the backing device. The recommended cache size depends on the workload - if the user repeatedly overwrites only a small part of the disk, then small cache is sufficient. > > > > +/* > > > > + * The clflushopt instruction is very slow on Broadwell, so we use > > > > non-temporal > > > > + * stores instead. > > > > + */ > > > > > > The clflushopt instruction's first instantiation in a cpu was after > > > Broadwell. Also, the performance of clflushopt does not much matter > > > when running on a system without ADR (asynchronous DRAM refresh) where > > > flushing the cpu cache can just leave the writes stranded on their way > > > to the DIMM. ADR is not available on general purpose servers until > > > ACPI 6.x NFIT-defined persistent memory platforms. > > > > There used to be pcommit instruction, but intel abandoned it - see > > https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction > > Yes, I am familiar with that document, I wrote the pcommit removal > patch and referenced ADR in the commit message: > > fd1d961dd681 x86/insn: remove pcommit > > > So, according to the document, flushing the cache should be enough for > > writes to reach persistent memory. > > The document assumes ADR is present. Could there be a case where persistent memory is present, but ADR is not? > > > > +#ifdef CONFIG_X86_64 > > > > > > In general something is broken if we end up with per-arch ifdefs like > > > this in drivers to handle persistent memory. This should be using the > > > pmem api or extensions of it, and we need to settle on a mechanism for > > > upper-level drivers to ask if pmem is driving platform protected > > > persistent memory. > > > > > > > +#define NT_STORE(dest, src)asm ("movnti %1, %0" : "=m"(dest) : > > > > "r"(src)) > > > > +#define FLUSH_RANGE(dax, ptr, size) do { } while (0) > > > > +#define COMMIT_FLUSHED() wmb() > > > > +#else > > > > +#define NT_STORE(dest, src)ACCESS_ONCE(dest) = (src) > > > > +#define FLUSH_RANGEdax_flush > > > > +#define COMMIT_FLUSHED() do { } while (0) > > > > > > Is this just for test purposes? How does the user discover that they > > > are running in a degraded mode as far as persistence guarantees? I > > > think we should be falling back DM_WRITECACHE_ONLY_SSD mode if we're > > > not on a pmem platform. > > > > What degraded mode do you mean? > > Fall back to treating pmem like an SSD / block-device. If the dm-writecache driver cannot reliably flush cache - then the /dev/pmem block device driver couldn't reliably flush cache neither. > > According to that document, flushing cache > > should be enough. If it is not enough, what else do I need to do to flush > > cache? > > ...ensure that you are on a platform where flushing the cpu cache is enough. > > > The above code is not for test purposes. It is for performance purposes. > > > > On dual-socket Broadwell server with persistent memory > > What platform is this... does it have ADR, does the BIOS produce an NFIT? I don't know. I already released the machine for someone else. If you want, I can try to re-acquire the access to it. > > when we write to > > persistent memory using cached write instructions and use dax_flush > > afterwards to flush cache for the affected range, the performance is about > > 350MB/s. It is practically unusable - worse than low-end SSDs. > > > > On the other hand, the movnti instruction can sustain performance of one > > 8-byte write per clock cycle. We don't have to flush cache afterwards, the > > only thing that must be done is to flush the write-combining buffer with > > the sfence instruction. Movnti has much better throughput than dax_flush. > > What about memcpy_flushcache? but - using memcpy_flushcache is overkill if we need just one or two 8-byte writes to the metadata area. Why not use movnti directly? - on some architctures, memcpy_flushcache is just an alias for memcpy, so there
Re: [dm-devel] [PATCH 2/2] dm-writecache
On Mon, Nov 27, 2017 at 8:01 PM, Mikulas Patockawrote: > > > > On Fri, 24 Nov 2017, Dan Williams wrote: > > > On Wed, Nov 22, 2017 at 6:36 PM, Mikulas Patocka > > wrote: > > > Hi > > > > > > Here I'm sending slighly update dm-writecache target. > > > > > > > I would expect some theory of operation documentation in the changelog > > or in Documentation/ for a new driver. > > It caches writes for a block device in persistent memory. It doesn't cache > reads because reads are supposed to be cached in normal ram. Is it a cache or more of a write-buffer? How aggressively does it write back data? Are there any recommendations about cache size vs backing device size? Does it also buffer sequential writes? For example bcache tries to avoid caching sequential traffic. > > > > Signed-off-by: Mikulas Patocka > > > > > [..] > > > +/* > > > + * The clflushopt instruction is very slow on Broadwell, so we use > > > non-temporal > > > + * stores instead. > > > + */ > > > > The clflushopt instruction's first instantiation in a cpu was after > > Broadwell. Also, the performance of clflushopt does not much matter > > when running on a system without ADR (asynchronous DRAM refresh) where > > flushing the cpu cache can just leave the writes stranded on their way > > to the DIMM. ADR is not available on general purpose servers until > > ACPI 6.x NFIT-defined persistent memory platforms. > > There used to be pcommit instruction, but intel abandoned it - see > https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction Yes, I am familiar with that document, I wrote the pcommit removal patch and referenced ADR in the commit message: fd1d961dd681 x86/insn: remove pcommit > So, according to the document, flushing the cache should be enough for > writes to reach persistent memory. The document assumes ADR is present. > > > +#ifdef CONFIG_X86_64 > > > > In general something is broken if we end up with per-arch ifdefs like > > this in drivers to handle persistent memory. This should be using the > > pmem api or extensions of it, and we need to settle on a mechanism for > > upper-level drivers to ask if pmem is driving platform protected > > persistent memory. > > > > > +#define NT_STORE(dest, src)asm ("movnti %1, %0" : "=m"(dest) : > > > "r"(src)) > > > +#define FLUSH_RANGE(dax, ptr, size) do { } while (0) > > > +#define COMMIT_FLUSHED() wmb() > > > +#else > > > +#define NT_STORE(dest, src)ACCESS_ONCE(dest) = (src) > > > +#define FLUSH_RANGEdax_flush > > > +#define COMMIT_FLUSHED() do { } while (0) > > > > Is this just for test purposes? How does the user discover that they > > are running in a degraded mode as far as persistence guarantees? I > > think we should be falling back DM_WRITECACHE_ONLY_SSD mode if we're > > not on a pmem platform. > > What degraded mode do you mean? Fall back to treating pmem like an SSD / block-device. > According to that document, flushing cache > should be enough. If it is not enough, what else do I need to do to flush > cache? ...ensure that you are on a platform where flushing the cpu cache is enough. > The above code is not for test purposes. It is for performance purposes. > > On dual-socket Broadwell server with persistent memory What platform is this... does it have ADR, does the BIOS produce an NFIT? > when we write to > persistent memory using cached write instructions and use dax_flush > afterwards to flush cache for the affected range, the performance is about > 350MB/s. It is practically unusable - worse than low-end SSDs. > > On the other hand, the movnti instruction can sustain performance of one > 8-byte write per clock cycle. We don't have to flush cache afterwards, the > only thing that must be done is to flush the write-combining buffer with > the sfence instruction. Movnti has much better throughput than dax_flush. What about memcpy_flushcache? > You argue that I should use the pmem api, but there is no pmem api > providing the movnti intruction. I have no choice, but coding the > instruction directly in assembler. You have not convinced me that memcpy_flushcache is insufficient, or that the driver is mistakenly assuming guarantees about persistence when it is not running on a persistent memory enabled platform. > If you want to create API, take that "NT_STORE" definition and move it to > some x86-specific include file. I'm asking you to not solve global problems in your local driver. If the pmem api is deficient for your use case then let's work on improving it, but as of now I'm still trying to baseline your assumptions. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 2/2] dm-writecache
On Fri, 24 Nov 2017, Dan Williams wrote: > On Wed, Nov 22, 2017 at 6:36 PM, Mikulas Patockawrote: > > Hi > > > > Here I'm sending slighly update dm-writecache target. > > > > I would expect some theory of operation documentation in the changelog > or in Documentation/ for a new driver. It caches writes for a block device in persistent memory. It doesn't cache reads because reads are supposed to be cached in normal ram. > > Signed-off-by: Mikulas Patocka > > > [..] > > +/* > > + * The clflushopt instruction is very slow on Broadwell, so we use > > non-temporal > > + * stores instead. > > + */ > > The clflushopt instruction's first instantiation in a cpu was after > Broadwell. Also, the performance of clflushopt does not much matter > when running on a system without ADR (asynchronous DRAM refresh) where > flushing the cpu cache can just leave the writes stranded on their way > to the DIMM. ADR is not available on general purpose servers until > ACPI 6.x NFIT-defined persistent memory platforms. There used to be pcommit instruction, but intel abandoned it - see https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction So, according to the document, flushing the cache should be enough for writes to reach persistent memory. > > +#ifdef CONFIG_X86_64 > > In general something is broken if we end up with per-arch ifdefs like > this in drivers to handle persistent memory. This should be using the > pmem api or extensions of it, and we need to settle on a mechanism for > upper-level drivers to ask if pmem is driving platform protected > persistent memory. > > > +#define NT_STORE(dest, src)asm ("movnti %1, %0" : "=m"(dest) : > > "r"(src)) > > +#define FLUSH_RANGE(dax, ptr, size) do { } while (0) > > +#define COMMIT_FLUSHED() wmb() > > +#else > > +#define NT_STORE(dest, src)ACCESS_ONCE(dest) = (src) > > +#define FLUSH_RANGEdax_flush > > +#define COMMIT_FLUSHED() do { } while (0) > > Is this just for test purposes? How does the user discover that they > are running in a degraded mode as far as persistence guarantees? I > think we should be falling back DM_WRITECACHE_ONLY_SSD mode if we're > not on a pmem platform. What degraded mode do you mean? According to that document, flushing cache should be enough. If it is not enough, what else do I need to do to flush cache? The above code is not for test purposes. It is for performance purposes. On dual-socket Broadwell server with persistent memory, when we write to persistent memory using cached write instructions and use dax_flush afterwards to flush cache for the affected range, the performance is about 350MB/s. It is practically unusable - worse than low-end SSDs. On the other hand, the movnti instruction can sustain performance of one 8-byte write per clock cycle. We don't have to flush cache afterwards, the only thing that must be done is to flush the write-combining buffer with the sfence instruction. Movnti has much better throughput than dax_flush. You argue that I should use the pmem api, but there is no pmem api providing the movnti intruction. I have no choice, but coding the instruction directly in assembler. If you want to create API, take that "NT_STORE" definition and move it to some x86-specific include file. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 2/2] dm-writecache
On Thu, 23 Nov 2017, Dan Williams wrote: > On Thu, Nov 23, 2017 at 12:29 AM, Christoph Hellwig> wrote: > >> +/* > >> + * The clflushopt instruction is very slow on Broadwell, so we use > >> non-temporal > >> + * stores instead. > >> + */ > >> +#ifdef CONFIG_X86_64 > >> +#define NT_STORE(dest, src) asm ("movnti %1, %0" : "=m"(dest) : "r"(src)) > >> +#define FLUSH_RANGE(dax, ptr, size) do { } while (0) > >> +#define COMMIT_FLUSHED() wmb() > >> +#else > >> +#define NT_STORE(dest, src) ACCESS_ONCE(dest) = (src) > >> +#define FLUSH_RANGE dax_flush > >> +#define COMMIT_FLUSHED() do { } while (0) > >> +#endif > > > > Please go through the proper pmem APIs. > > Yes, see _copy_from_iter_flushcache() and all its related apis. > Drivers shouldn't be doing anything different than what fs/dax.c is > using. There is not any arch-independent API that abstracts over a single movnti instruction. Do you want to add it? memcpy_flushcache is overkill for one or two 8-byte stores. And - on non-x86 architectures memcpy_flushcache doesn't flush cache at all. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 2/2] dm-writecache
On Thu, Nov 23, 2017 at 12:29 AM, Christoph Hellwigwrote: >> +/* >> + * The clflushopt instruction is very slow on Broadwell, so we use >> non-temporal >> + * stores instead. >> + */ >> +#ifdef CONFIG_X86_64 >> +#define NT_STORE(dest, src) asm ("movnti %1, %0" : "=m"(dest) : "r"(src)) >> +#define FLUSH_RANGE(dax, ptr, size) do { } while (0) >> +#define COMMIT_FLUSHED() wmb() >> +#else >> +#define NT_STORE(dest, src) ACCESS_ONCE(dest) = (src) >> +#define FLUSH_RANGE dax_flush >> +#define COMMIT_FLUSHED() do { } while (0) >> +#endif > > Please go through the proper pmem APIs. Yes, see _copy_from_iter_flushcache() and all its related apis. Drivers shouldn't be doing anything different than what fs/dax.c is using. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] [PATCH 2/2] dm-writecache
> +/* > + * The clflushopt instruction is very slow on Broadwell, so we use > non-temporal > + * stores instead. > + */ > +#ifdef CONFIG_X86_64 > +#define NT_STORE(dest, src) asm ("movnti %1, %0" : "=m"(dest) : "r"(src)) > +#define FLUSH_RANGE(dax, ptr, size) do { } while (0) > +#define COMMIT_FLUSHED() wmb() > +#else > +#define NT_STORE(dest, src) ACCESS_ONCE(dest) = (src) > +#define FLUSH_RANGE dax_flush > +#define COMMIT_FLUSHED() do { } while (0) > +#endif Please go through the proper pmem APIs. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel