Re: [RFC] memcpy_nocache() and memcpy_writethrough()

2017-01-03 Thread Dan Williams
On Tue, Jan 3, 2017 at 5:59 PM, Al Viro  wrote:
> On Tue, Jan 03, 2017 at 05:38:54PM -0800, Dan Williams wrote:
>> > 1) memcpy_to_pmem() seems to rely upon the __copy_from_user_nocache()
>> > having only used movnt; it does not attempt clwb at all.
>>
>> Yes, and there was a fix a while back to make sure it always used
>> movnt so clwb after the fact is not required:
>>
>> a82eee742452 x86/uaccess/64: Handle the caching of 4-byte nocache
>> copies properly in __copy_user_nocache()
>>
>> > 2) __copy_from_user_nocache() for short copies does not use movnt at all.
>> > In that case neither sfence nor clwb is issued.
>>
>> For the 32bit case, yes, but the pmem driver should warn about this
>> when it checks platform persistent memory capabilities (i.e. x86 32bit
>> not supported). Ugh, we may have lost that warning for this specific
>> case recently, I'll go double check and fix it up.
>>
>> > 3) it uses movnt only for part of copying in case of misaligned copy;
>> > No clwb is issued, but sfence *is* - at the very end in 64bit case,
>> > between movnt and copying the tail - in 32bit one.  Incidentally,
>> > while 64bit case takes care to align the destination for movnt part,
>> > 32bit one does not.
>> >
>> > How much of the above is broken and what do the callers rely upon?
>>
>> 32bit issues are known, but 64bit path is ok since that fix above.
>
> Bollocks.  That fix above does *NOT* eliminate all cached stores.  Just look
> at the damn function - it still does cached stores for until the target is
> aligned and it does the same for tail when end of destination is not aligned.
> Right there in arch/x86/lib/copy_user_64.S.

No, it does not eliminate all cache stores, but the cases where we use
it have naturally aligned targets.

Yes, it is terrible to then call wrap it in a memcpy_to_pmem() wrapper
which does not document these alignment constraints.


Re: [RFC] memcpy_nocache() and memcpy_writethrough()

2017-01-03 Thread Dan Williams
On Tue, Jan 3, 2017 at 5:59 PM, Al Viro  wrote:
> On Tue, Jan 03, 2017 at 05:38:54PM -0800, Dan Williams wrote:
>> > 1) memcpy_to_pmem() seems to rely upon the __copy_from_user_nocache()
>> > having only used movnt; it does not attempt clwb at all.
>>
>> Yes, and there was a fix a while back to make sure it always used
>> movnt so clwb after the fact is not required:
>>
>> a82eee742452 x86/uaccess/64: Handle the caching of 4-byte nocache
>> copies properly in __copy_user_nocache()
>>
>> > 2) __copy_from_user_nocache() for short copies does not use movnt at all.
>> > In that case neither sfence nor clwb is issued.
>>
>> For the 32bit case, yes, but the pmem driver should warn about this
>> when it checks platform persistent memory capabilities (i.e. x86 32bit
>> not supported). Ugh, we may have lost that warning for this specific
>> case recently, I'll go double check and fix it up.
>>
>> > 3) it uses movnt only for part of copying in case of misaligned copy;
>> > No clwb is issued, but sfence *is* - at the very end in 64bit case,
>> > between movnt and copying the tail - in 32bit one.  Incidentally,
>> > while 64bit case takes care to align the destination for movnt part,
>> > 32bit one does not.
>> >
>> > How much of the above is broken and what do the callers rely upon?
>>
>> 32bit issues are known, but 64bit path is ok since that fix above.
>
> Bollocks.  That fix above does *NOT* eliminate all cached stores.  Just look
> at the damn function - it still does cached stores for until the target is
> aligned and it does the same for tail when end of destination is not aligned.
> Right there in arch/x86/lib/copy_user_64.S.

No, it does not eliminate all cache stores, but the cases where we use
it have naturally aligned targets.

Yes, it is terrible to then call wrap it in a memcpy_to_pmem() wrapper
which does not document these alignment constraints.


Re: [RFC] memcpy_nocache() and memcpy_writethrough()

2017-01-03 Thread Al Viro
On Tue, Jan 03, 2017 at 05:38:54PM -0800, Dan Williams wrote:
> > 1) memcpy_to_pmem() seems to rely upon the __copy_from_user_nocache()
> > having only used movnt; it does not attempt clwb at all.
> 
> Yes, and there was a fix a while back to make sure it always used
> movnt so clwb after the fact is not required:
> 
> a82eee742452 x86/uaccess/64: Handle the caching of 4-byte nocache
> copies properly in __copy_user_nocache()
> 
> > 2) __copy_from_user_nocache() for short copies does not use movnt at all.
> > In that case neither sfence nor clwb is issued.
> 
> For the 32bit case, yes, but the pmem driver should warn about this
> when it checks platform persistent memory capabilities (i.e. x86 32bit
> not supported). Ugh, we may have lost that warning for this specific
> case recently, I'll go double check and fix it up.
> 
> > 3) it uses movnt only for part of copying in case of misaligned copy;
> > No clwb is issued, but sfence *is* - at the very end in 64bit case,
> > between movnt and copying the tail - in 32bit one.  Incidentally,
> > while 64bit case takes care to align the destination for movnt part,
> > 32bit one does not.
> >
> > How much of the above is broken and what do the callers rely upon?
> 
> 32bit issues are known, but 64bit path is ok since that fix above.

Bollocks.  That fix above does *NOT* eliminate all cached stores.  Just look
at the damn function - it still does cached stores for until the target is
aligned and it does the same for tail when end of destination is not aligned.
Right there in arch/x86/lib/copy_user_64.S.

> > In particular, is that sfence the right thing for pmem usecases?
> 
> That sfence is not there for pmem purposes. The dax / pmem usage does
> not expect memcpy_to_pmem() to fence as it may have more writes to
> queue up and amortize all the writes with a later fence. This seems to
> be even more evidence for moving this functionality away from the
> uaccess routines to somewhere more pmem specific.


Re: [RFC] memcpy_nocache() and memcpy_writethrough()

2017-01-03 Thread Al Viro
On Tue, Jan 03, 2017 at 05:38:54PM -0800, Dan Williams wrote:
> > 1) memcpy_to_pmem() seems to rely upon the __copy_from_user_nocache()
> > having only used movnt; it does not attempt clwb at all.
> 
> Yes, and there was a fix a while back to make sure it always used
> movnt so clwb after the fact is not required:
> 
> a82eee742452 x86/uaccess/64: Handle the caching of 4-byte nocache
> copies properly in __copy_user_nocache()
> 
> > 2) __copy_from_user_nocache() for short copies does not use movnt at all.
> > In that case neither sfence nor clwb is issued.
> 
> For the 32bit case, yes, but the pmem driver should warn about this
> when it checks platform persistent memory capabilities (i.e. x86 32bit
> not supported). Ugh, we may have lost that warning for this specific
> case recently, I'll go double check and fix it up.
> 
> > 3) it uses movnt only for part of copying in case of misaligned copy;
> > No clwb is issued, but sfence *is* - at the very end in 64bit case,
> > between movnt and copying the tail - in 32bit one.  Incidentally,
> > while 64bit case takes care to align the destination for movnt part,
> > 32bit one does not.
> >
> > How much of the above is broken and what do the callers rely upon?
> 
> 32bit issues are known, but 64bit path is ok since that fix above.

Bollocks.  That fix above does *NOT* eliminate all cached stores.  Just look
at the damn function - it still does cached stores for until the target is
aligned and it does the same for tail when end of destination is not aligned.
Right there in arch/x86/lib/copy_user_64.S.

> > In particular, is that sfence the right thing for pmem usecases?
> 
> That sfence is not there for pmem purposes. The dax / pmem usage does
> not expect memcpy_to_pmem() to fence as it may have more writes to
> queue up and amortize all the writes with a later fence. This seems to
> be even more evidence for moving this functionality away from the
> uaccess routines to somewhere more pmem specific.


Re: [RFC] memcpy_nocache() and memcpy_writethrough()

2017-01-03 Thread Dan Williams
On Tue, Jan 3, 2017 at 3:22 PM, Al Viro  wrote:
> On Tue, Jan 03, 2017 at 01:14:11PM -0800, Dan Williams wrote:
>
>> Robert was describing the overall flow / mechanics, but I think it is
>> easier to visualize the sfence as a flush command sent to a disk
>> device with a volatile cache. In fact, that's how we implemented it in
>> the pmem block device driver. The pmem block device registers itself
>> as requiring REQ_FLUSH to be sent to persist writes. The driver issues
>> sfence on the assumption that all writes to pmem have either bypassed
>> the cache with movnt, or are scheduled for write-back via one of the
>> flush instructions (clflush, clwb, or clflushopt).
>
> *blink*
>
> 1) memcpy_to_pmem() seems to rely upon the __copy_from_user_nocache()
> having only used movnt; it does not attempt clwb at all.

Yes, and there was a fix a while back to make sure it always used
movnt so clwb after the fact is not required:

a82eee742452 x86/uaccess/64: Handle the caching of 4-byte nocache
copies properly in __copy_user_nocache()

> 2) __copy_from_user_nocache() for short copies does not use movnt at all.
> In that case neither sfence nor clwb is issued.

For the 32bit case, yes, but the pmem driver should warn about this
when it checks platform persistent memory capabilities (i.e. x86 32bit
not supported). Ugh, we may have lost that warning for this specific
case recently, I'll go double check and fix it up.

> 3) it uses movnt only for part of copying in case of misaligned copy;
> No clwb is issued, but sfence *is* - at the very end in 64bit case,
> between movnt and copying the tail - in 32bit one.  Incidentally,
> while 64bit case takes care to align the destination for movnt part,
> 32bit one does not.
>
> How much of the above is broken and what do the callers rely upon?

32bit issues are known, but 64bit path is ok since that fix above.

> In particular, is that sfence the right thing for pmem usecases?

That sfence is not there for pmem purposes. The dax / pmem usage does
not expect memcpy_to_pmem() to fence as it may have more writes to
queue up and amortize all the writes with a later fence. This seems to
be even more evidence for moving this functionality away from the
uaccess routines to somewhere more pmem specific.


Re: [RFC] memcpy_nocache() and memcpy_writethrough()

2017-01-03 Thread Dan Williams
On Tue, Jan 3, 2017 at 3:22 PM, Al Viro  wrote:
> On Tue, Jan 03, 2017 at 01:14:11PM -0800, Dan Williams wrote:
>
>> Robert was describing the overall flow / mechanics, but I think it is
>> easier to visualize the sfence as a flush command sent to a disk
>> device with a volatile cache. In fact, that's how we implemented it in
>> the pmem block device driver. The pmem block device registers itself
>> as requiring REQ_FLUSH to be sent to persist writes. The driver issues
>> sfence on the assumption that all writes to pmem have either bypassed
>> the cache with movnt, or are scheduled for write-back via one of the
>> flush instructions (clflush, clwb, or clflushopt).
>
> *blink*
>
> 1) memcpy_to_pmem() seems to rely upon the __copy_from_user_nocache()
> having only used movnt; it does not attempt clwb at all.

Yes, and there was a fix a while back to make sure it always used
movnt so clwb after the fact is not required:

a82eee742452 x86/uaccess/64: Handle the caching of 4-byte nocache
copies properly in __copy_user_nocache()

> 2) __copy_from_user_nocache() for short copies does not use movnt at all.
> In that case neither sfence nor clwb is issued.

For the 32bit case, yes, but the pmem driver should warn about this
when it checks platform persistent memory capabilities (i.e. x86 32bit
not supported). Ugh, we may have lost that warning for this specific
case recently, I'll go double check and fix it up.

> 3) it uses movnt only for part of copying in case of misaligned copy;
> No clwb is issued, but sfence *is* - at the very end in 64bit case,
> between movnt and copying the tail - in 32bit one.  Incidentally,
> while 64bit case takes care to align the destination for movnt part,
> 32bit one does not.
>
> How much of the above is broken and what do the callers rely upon?

32bit issues are known, but 64bit path is ok since that fix above.

> In particular, is that sfence the right thing for pmem usecases?

That sfence is not there for pmem purposes. The dax / pmem usage does
not expect memcpy_to_pmem() to fence as it may have more writes to
queue up and amortize all the writes with a later fence. This seems to
be even more evidence for moving this functionality away from the
uaccess routines to somewhere more pmem specific.


Re: [RFC] memcpy_nocache() and memcpy_writethrough()

2017-01-03 Thread Dan Williams
On Tue, Jan 3, 2017 at 3:46 PM, Linus Torvalds
 wrote:
> On Tue, Jan 3, 2017 at 3:22 PM, Al Viro  wrote:
>>
>> 1) memcpy_to_pmem() seems to rely upon the __copy_from_user_nocache()
>> having only used movnt; it does not attempt clwb at all.
>>
>> 2) __copy_from_user_nocache() for short copies does not use movnt at all.
>> In that case neither sfence nor clwb is issued.
>
> Quite frankly, the whole "memcpy_nocache()" idea or (ab-)using
> copy_user_nocache() just needs to die. It's idiotic.
>
> As you point out, it's also fundamentally buggy crap.
>
> Throw it away. There is no possible way this is ever valid or
> portable. We're not going to lie and claim that it is.
>
> If some driver ends up using "movnt" by hand, that is up to that
> *driver*. But no way in hell should we care about this one whit in the
> sense of . Get rid of that shit.
>
> So Al - just ignore this whole issue. It's not your headache. Any code
> that tries to depend on some non-caching memcpy is terminally buggy,
> and those code paths need to fix themselves, not ask others to fix
> their braindamage for them.

It's not Al's headache and our usage of __copy_from_user_nocache is a
blatant abuse, but the discussion is worth having because this is not
the first time we've struggled with the pmem api and the balance
between what functionality should be in fs/dax.c vs
drivers/nvdimm/pmem.c.

The stumbling block in the past to relegating all pmem accesses to the
driver is not wanting to further expand block_device_operations with
more dax specifics beyond the ->direct_access() operation we already
have.

I can think of gross ways of moving dax_iomap_actor() into the driver,
but perhaps less gross than burdening the uaccess.h maintainer with
pmem abuses.

This would also allow us to drop the needless cache maintenance for
dax capable drivers like brd that are fronting volatile memory.


Re: [RFC] memcpy_nocache() and memcpy_writethrough()

2017-01-03 Thread Dan Williams
On Tue, Jan 3, 2017 at 3:46 PM, Linus Torvalds
 wrote:
> On Tue, Jan 3, 2017 at 3:22 PM, Al Viro  wrote:
>>
>> 1) memcpy_to_pmem() seems to rely upon the __copy_from_user_nocache()
>> having only used movnt; it does not attempt clwb at all.
>>
>> 2) __copy_from_user_nocache() for short copies does not use movnt at all.
>> In that case neither sfence nor clwb is issued.
>
> Quite frankly, the whole "memcpy_nocache()" idea or (ab-)using
> copy_user_nocache() just needs to die. It's idiotic.
>
> As you point out, it's also fundamentally buggy crap.
>
> Throw it away. There is no possible way this is ever valid or
> portable. We're not going to lie and claim that it is.
>
> If some driver ends up using "movnt" by hand, that is up to that
> *driver*. But no way in hell should we care about this one whit in the
> sense of . Get rid of that shit.
>
> So Al - just ignore this whole issue. It's not your headache. Any code
> that tries to depend on some non-caching memcpy is terminally buggy,
> and those code paths need to fix themselves, not ask others to fix
> their braindamage for them.

It's not Al's headache and our usage of __copy_from_user_nocache is a
blatant abuse, but the discussion is worth having because this is not
the first time we've struggled with the pmem api and the balance
between what functionality should be in fs/dax.c vs
drivers/nvdimm/pmem.c.

The stumbling block in the past to relegating all pmem accesses to the
driver is not wanting to further expand block_device_operations with
more dax specifics beyond the ->direct_access() operation we already
have.

I can think of gross ways of moving dax_iomap_actor() into the driver,
but perhaps less gross than burdening the uaccess.h maintainer with
pmem abuses.

This would also allow us to drop the needless cache maintenance for
dax capable drivers like brd that are fronting volatile memory.


Re: [RFC] memcpy_nocache() and memcpy_writethrough()

2017-01-03 Thread Linus Torvalds
On Tue, Jan 3, 2017 at 3:22 PM, Al Viro  wrote:
>
> 1) memcpy_to_pmem() seems to rely upon the __copy_from_user_nocache()
> having only used movnt; it does not attempt clwb at all.
>
> 2) __copy_from_user_nocache() for short copies does not use movnt at all.
> In that case neither sfence nor clwb is issued.

Quite frankly, the whole "memcpy_nocache()" idea or (ab-)using
copy_user_nocache() just needs to die. It's idiotic.

As you point out, it's also fundamentally buggy crap.

Throw it away. There is no possible way this is ever valid or
portable. We're not going to lie and claim that it is.

If some driver ends up using "movnt" by hand, that is up to that
*driver*. But no way in hell should we care about this one whit in the
sense of . Get rid of that shit.

So Al - just ignore this whole issue. It's not your headache. Any code
that tries to depend on some non-caching memcpy is terminally buggy,
and those code paths need to fix themselves, not ask others to fix
their braindamage for them.

 Linus


Re: [RFC] memcpy_nocache() and memcpy_writethrough()

2017-01-03 Thread Linus Torvalds
On Tue, Jan 3, 2017 at 3:22 PM, Al Viro  wrote:
>
> 1) memcpy_to_pmem() seems to rely upon the __copy_from_user_nocache()
> having only used movnt; it does not attempt clwb at all.
>
> 2) __copy_from_user_nocache() for short copies does not use movnt at all.
> In that case neither sfence nor clwb is issued.

Quite frankly, the whole "memcpy_nocache()" idea or (ab-)using
copy_user_nocache() just needs to die. It's idiotic.

As you point out, it's also fundamentally buggy crap.

Throw it away. There is no possible way this is ever valid or
portable. We're not going to lie and claim that it is.

If some driver ends up using "movnt" by hand, that is up to that
*driver*. But no way in hell should we care about this one whit in the
sense of . Get rid of that shit.

So Al - just ignore this whole issue. It's not your headache. Any code
that tries to depend on some non-caching memcpy is terminally buggy,
and those code paths need to fix themselves, not ask others to fix
their braindamage for them.

 Linus


Re: [RFC] memcpy_nocache() and memcpy_writethrough()

2017-01-03 Thread Al Viro
On Tue, Jan 03, 2017 at 01:14:11PM -0800, Dan Williams wrote:

> Robert was describing the overall flow / mechanics, but I think it is
> easier to visualize the sfence as a flush command sent to a disk
> device with a volatile cache. In fact, that's how we implemented it in
> the pmem block device driver. The pmem block device registers itself
> as requiring REQ_FLUSH to be sent to persist writes. The driver issues
> sfence on the assumption that all writes to pmem have either bypassed
> the cache with movnt, or are scheduled for write-back via one of the
> flush instructions (clflush, clwb, or clflushopt).

*blink*

1) memcpy_to_pmem() seems to rely upon the __copy_from_user_nocache()
having only used movnt; it does not attempt clwb at all.

2) __copy_from_user_nocache() for short copies does not use movnt at all.
In that case neither sfence nor clwb is issued.

3) it uses movnt only for part of copying in case of misaligned copy;
No clwb is issued, but sfence *is* - at the very end in 64bit case,
between movnt and copying the tail - in 32bit one.  Incidentally,
while 64bit case takes care to align the destination for movnt part,
32bit one does not.

How much of the above is broken and what do the callers rely upon?  In
particular, is that sfence the right thing for pmem usecases?


Re: [RFC] memcpy_nocache() and memcpy_writethrough()

2017-01-03 Thread Al Viro
On Tue, Jan 03, 2017 at 01:14:11PM -0800, Dan Williams wrote:

> Robert was describing the overall flow / mechanics, but I think it is
> easier to visualize the sfence as a flush command sent to a disk
> device with a volatile cache. In fact, that's how we implemented it in
> the pmem block device driver. The pmem block device registers itself
> as requiring REQ_FLUSH to be sent to persist writes. The driver issues
> sfence on the assumption that all writes to pmem have either bypassed
> the cache with movnt, or are scheduled for write-back via one of the
> flush instructions (clflush, clwb, or clflushopt).

*blink*

1) memcpy_to_pmem() seems to rely upon the __copy_from_user_nocache()
having only used movnt; it does not attempt clwb at all.

2) __copy_from_user_nocache() for short copies does not use movnt at all.
In that case neither sfence nor clwb is issued.

3) it uses movnt only for part of copying in case of misaligned copy;
No clwb is issued, but sfence *is* - at the very end in 64bit case,
between movnt and copying the tail - in 32bit one.  Incidentally,
while 64bit case takes care to align the destination for movnt part,
32bit one does not.

How much of the above is broken and what do the callers rely upon?  In
particular, is that sfence the right thing for pmem usecases?


Re: [RFC] memcpy_nocache() and memcpy_writethrough()

2017-01-03 Thread Dan Williams
On Sun, Jan 1, 2017 at 9:09 PM, Al Viro <v...@zeniv.linux.org.uk> wrote:
> On Mon, Jan 02, 2017 at 02:35:36AM +, Elliott, Robert (Persistent Memory) 
> wrote:
>> > -Original Message-
>> > From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
>> > ow...@vger.kernel.org] On Behalf Of Al Viro
>> > Sent: Friday, December 30, 2016 8:26 PM
>> > Subject: [RFC] memcpy_nocache() and memcpy_writethrough()
>> >
>> ...
>> > Why does pmem need writethrough warranties, anyway?
>>
>> Using either
>> * nontemporal store instructions; or
>> * following regular store instructions with a sequence of cache flush
>> and store fence instructions (e.g., clflushopt or clwb + sfence)
>>
>> ensures that write data has reached an "ADR-safe zone" that the system
>> promises will be persistent even if there is a surprise power loss or
>> a CPU suffers from an error that isn't totally catastrophic (e.g., the
>> CPU getting disconnected from the SDRAM will always lose data on an
>> NVDIMM-N).
>
> Wait a sec...  In which places do you need sfence in all that?  movnt*
> itself can be reordered, right?  So using that for copying and storing
> the pointer afterwards would still need sfence inbetween, unless I'm
> seriously misunderstanding the situation...

Robert was describing the overall flow / mechanics, but I think it is
easier to visualize the sfence as a flush command sent to a disk
device with a volatile cache. In fact, that's how we implemented it in
the pmem block device driver. The pmem block device registers itself
as requiring REQ_FLUSH to be sent to persist writes. The driver issues
sfence on the assumption that all writes to pmem have either bypassed
the cache with movnt, or are scheduled for write-back via one of the
flush instructions (clflush, clwb, or clflushopt).

>> Newly written data becomes globally visible before it becomes ADR-safe.
>> This means software could act on the new data before a power loss, then
>> see the old data reappear after the power loss - not good.  Software
>> needs to understand that any data in the process of being written is
>> indeterminate until the persistence guarantee is met.  The BTT shows
>> one way that software can avoid that problem.
>
> Joy.  What happens in terms of latency?  I.e. how much of a stall does
> clwb inflict?

Unlike clflush, clwb is unordered, so it has lower overhead. It
schedules writeback, but does not wait for it to complete. The
clflushopt instruction is also unordered, but in addition to writeback
it also invalidates the line.


Re: [RFC] memcpy_nocache() and memcpy_writethrough()

2017-01-03 Thread Dan Williams
On Sun, Jan 1, 2017 at 9:09 PM, Al Viro  wrote:
> On Mon, Jan 02, 2017 at 02:35:36AM +, Elliott, Robert (Persistent Memory) 
> wrote:
>> > -Original Message-
>> > From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
>> > ow...@vger.kernel.org] On Behalf Of Al Viro
>> > Sent: Friday, December 30, 2016 8:26 PM
>> > Subject: [RFC] memcpy_nocache() and memcpy_writethrough()
>> >
>> ...
>> > Why does pmem need writethrough warranties, anyway?
>>
>> Using either
>> * nontemporal store instructions; or
>> * following regular store instructions with a sequence of cache flush
>> and store fence instructions (e.g., clflushopt or clwb + sfence)
>>
>> ensures that write data has reached an "ADR-safe zone" that the system
>> promises will be persistent even if there is a surprise power loss or
>> a CPU suffers from an error that isn't totally catastrophic (e.g., the
>> CPU getting disconnected from the SDRAM will always lose data on an
>> NVDIMM-N).
>
> Wait a sec...  In which places do you need sfence in all that?  movnt*
> itself can be reordered, right?  So using that for copying and storing
> the pointer afterwards would still need sfence inbetween, unless I'm
> seriously misunderstanding the situation...

Robert was describing the overall flow / mechanics, but I think it is
easier to visualize the sfence as a flush command sent to a disk
device with a volatile cache. In fact, that's how we implemented it in
the pmem block device driver. The pmem block device registers itself
as requiring REQ_FLUSH to be sent to persist writes. The driver issues
sfence on the assumption that all writes to pmem have either bypassed
the cache with movnt, or are scheduled for write-back via one of the
flush instructions (clflush, clwb, or clflushopt).

>> Newly written data becomes globally visible before it becomes ADR-safe.
>> This means software could act on the new data before a power loss, then
>> see the old data reappear after the power loss - not good.  Software
>> needs to understand that any data in the process of being written is
>> indeterminate until the persistence guarantee is met.  The BTT shows
>> one way that software can avoid that problem.
>
> Joy.  What happens in terms of latency?  I.e. how much of a stall does
> clwb inflict?

Unlike clflush, clwb is unordered, so it has lower overhead. It
schedules writeback, but does not wait for it to complete. The
clflushopt instruction is also unordered, but in addition to writeback
it also invalidates the line.


Re: [RFC] memcpy_nocache() and memcpy_writethrough()

2017-01-01 Thread Al Viro
On Mon, Jan 02, 2017 at 02:35:36AM +, Elliott, Robert (Persistent Memory) 
wrote:
> > -Original Message-
> > From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> > ow...@vger.kernel.org] On Behalf Of Al Viro
> > Sent: Friday, December 30, 2016 8:26 PM
> > Subject: [RFC] memcpy_nocache() and memcpy_writethrough()
> > 
> ...
> > Why does pmem need writethrough warranties, anyway?  
> 
> Using either 
> * nontemporal store instructions; or
> * following regular store instructions with a sequence of cache flush
> and store fence instructions (e.g., clflushopt or clwb + sfence)
> 
> ensures that write data has reached an "ADR-safe zone" that the system
> promises will be persistent even if there is a surprise power loss or
> a CPU suffers from an error that isn't totally catastrophic (e.g., the
> CPU getting disconnected from the SDRAM will always lose data on an
> NVDIMM-N).

Wait a sec...  In which places do you need sfence in all that?  movnt*
itself can be reordered, right?  So using that for copying and storing
the pointer afterwards would still need sfence inbetween, unless I'm
seriously misunderstanding the situation...

> Newly written data becomes globally visible before it becomes ADR-safe.
> This means software could act on the new data before a power loss, then
> see the old data reappear after the power loss - not good.  Software
> needs to understand that any data in the process of being written is
> indeterminate until the persistence guarantee is met.  The BTT shows
> one way that software can avoid that problem.

Joy.  What happens in terms of latency?  I.e. how much of a stall does
clwb inflict?


Re: [RFC] memcpy_nocache() and memcpy_writethrough()

2017-01-01 Thread Al Viro
On Mon, Jan 02, 2017 at 02:35:36AM +, Elliott, Robert (Persistent Memory) 
wrote:
> > -Original Message-
> > From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> > ow...@vger.kernel.org] On Behalf Of Al Viro
> > Sent: Friday, December 30, 2016 8:26 PM
> > Subject: [RFC] memcpy_nocache() and memcpy_writethrough()
> > 
> ...
> > Why does pmem need writethrough warranties, anyway?  
> 
> Using either 
> * nontemporal store instructions; or
> * following regular store instructions with a sequence of cache flush
> and store fence instructions (e.g., clflushopt or clwb + sfence)
> 
> ensures that write data has reached an "ADR-safe zone" that the system
> promises will be persistent even if there is a surprise power loss or
> a CPU suffers from an error that isn't totally catastrophic (e.g., the
> CPU getting disconnected from the SDRAM will always lose data on an
> NVDIMM-N).

Wait a sec...  In which places do you need sfence in all that?  movnt*
itself can be reordered, right?  So using that for copying and storing
the pointer afterwards would still need sfence inbetween, unless I'm
seriously misunderstanding the situation...

> Newly written data becomes globally visible before it becomes ADR-safe.
> This means software could act on the new data before a power loss, then
> see the old data reappear after the power loss - not good.  Software
> needs to understand that any data in the process of being written is
> indeterminate until the persistence guarantee is met.  The BTT shows
> one way that software can avoid that problem.

Joy.  What happens in terms of latency?  I.e. how much of a stall does
clwb inflict?


RE: [RFC] memcpy_nocache() and memcpy_writethrough()

2017-01-01 Thread Elliott, Robert (Persistent Memory)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Al Viro
> Sent: Friday, December 30, 2016 8:26 PM
> Subject: [RFC] memcpy_nocache() and memcpy_writethrough()
> 
...
> Why does pmem need writethrough warranties, anyway?  

Using either 
* nontemporal store instructions; or
* following regular store instructions with a sequence of cache flush
and store fence instructions (e.g., clflushopt or clwb + sfence)

ensures that write data has reached an "ADR-safe zone" that the system
promises will be persistent even if there is a surprise power loss or
a CPU suffers from an error that isn't totally catastrophic (e.g., the
CPU getting disconnected from the SDRAM will always lose data on an
NVDIMM-N).

The ACPI NFIT Flush Hints provide a guarantee that data is safe even
in the case of a CPU error, but that feature is not present in all
systems for all types of persistent memory.

> All explanations I've found on the net had been along the lines of
> "we should not store a pointer to pmem data structure until the
> structure itself had been committed to pmem itself" and it looks
> like something that ought to be a job for barriers - after all,
> we don't want the pointer store to be observed by _anything_
> in the system until the earlier stores are visible, so what makes
> pmem different from e.g. another CPU or a PCI busmaster, or...

Newly written data becomes globally visible before it becomes ADR-safe.
This means software could act on the new data before a power loss, then
see the old data reappear after the power loss - not good.  Software
needs to understand that any data in the process of being written is
indeterminate until the persistence guarantee is met.  The BTT shows
one way that software can avoid that problem.

---
Robert Elliott, HPE Persistent Memory




RE: [RFC] memcpy_nocache() and memcpy_writethrough()

2017-01-01 Thread Elliott, Robert (Persistent Memory)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Al Viro
> Sent: Friday, December 30, 2016 8:26 PM
> Subject: [RFC] memcpy_nocache() and memcpy_writethrough()
> 
...
> Why does pmem need writethrough warranties, anyway?  

Using either 
* nontemporal store instructions; or
* following regular store instructions with a sequence of cache flush
and store fence instructions (e.g., clflushopt or clwb + sfence)

ensures that write data has reached an "ADR-safe zone" that the system
promises will be persistent even if there is a surprise power loss or
a CPU suffers from an error that isn't totally catastrophic (e.g., the
CPU getting disconnected from the SDRAM will always lose data on an
NVDIMM-N).

The ACPI NFIT Flush Hints provide a guarantee that data is safe even
in the case of a CPU error, but that feature is not present in all
systems for all types of persistent memory.

> All explanations I've found on the net had been along the lines of
> "we should not store a pointer to pmem data structure until the
> structure itself had been committed to pmem itself" and it looks
> like something that ought to be a job for barriers - after all,
> we don't want the pointer store to be observed by _anything_
> in the system until the earlier stores are visible, so what makes
> pmem different from e.g. another CPU or a PCI busmaster, or...

Newly written data becomes globally visible before it becomes ADR-safe.
This means software could act on the new data before a power loss, then
see the old data reappear after the power loss - not good.  Software
needs to understand that any data in the process of being written is
indeterminate until the persistence guarantee is met.  The BTT shows
one way that software can avoid that problem.

---
Robert Elliott, HPE Persistent Memory




[RFC] memcpy_nocache() and memcpy_writethrough()

2016-12-30 Thread Al Viro
On Thu, Dec 29, 2016 at 08:56:13PM -0800, Dan Williams wrote:

> > Um...  Then we do have a problem - nocache variant of uaccess primitives
> > does *not* guarantee that clwb is redundant.
> >
> > What about the requirements of e.g. tcp_sendmsg() with its use of
> > skb_add_data_nocache()?  What warranties do we need there?
> 
> Yes, we need to distinguish the existing "nocache" that tries to avoid
> unnecessary cache pollution and this new "must write through" semantic
> for writing to persistent memory. I suspect usages of
> skb_add_data_nocache() are ok since they are in the transmit path.
> Receiving directly into a buffer that is expected to be persisted
> immediately is where we would need to be careful, but that is already
> backstopped by dirty cacheline tracking. So as far as I can see, we
> should only need a new memcpy_writethrough() (?) for the pmem
> direct-i/o path at present.

OK...  Right now we have several places playing with nocache:
* dax_iomap_actor().  Writethrough warranties needed, nocache
side serves to reduce the cache impact *and* avoid the need for clwb
for writethrough.
* several memcpy_to_pmem() users - acpi_nfit_blk_single_io(),
nsio_rw_bytes(), write_pmem().  No clwb attempted; is it needed there?
* hfi1_copy_sge().  Cache pollution avoidance?  The source is
in the kernel, looks like memcpy_nocache() candidate.
* ntb_memcpy_tx().  Really fishy one - it's from kernel to iomem,
with nocache userland->kernel copying primitive abused on x86.  As soon
as e.g. powerpc or sparc grows ARCH_HAS_NOCACHE_UACCESS, we are in trouble
there.  What is it actually trying to achieve?  memcpy_toio() with
cache pollution avoidance?
* networking copy_from_iter_full_nocache() users - cache pollution
avoidance, AFAICS; no writethrough warranties sought.

Why does pmem need writethrough warranties, anyway?  All explanations I've
found on the net had been along the lines of "we should not store a pointer
to pmem data structure until the structure itself had been committed to
pmem itself" and it looks like something that ought to be a job for barriers
- after all, we don't want the pointer store to be observed by _anything_
in the system until the earlier stores are visible, so what makes pmem
different from e.g. another CPU or a PCI busmaster, or...

I'm trying to figure out what would be the right API here; sure, we can
add separate memcpy_writethrough()/__copy_from_user_inatomic_writethrough()/
copy_from_iter_writethrough(), but I would like to understand what's going
on first.


[RFC] memcpy_nocache() and memcpy_writethrough()

2016-12-30 Thread Al Viro
On Thu, Dec 29, 2016 at 08:56:13PM -0800, Dan Williams wrote:

> > Um...  Then we do have a problem - nocache variant of uaccess primitives
> > does *not* guarantee that clwb is redundant.
> >
> > What about the requirements of e.g. tcp_sendmsg() with its use of
> > skb_add_data_nocache()?  What warranties do we need there?
> 
> Yes, we need to distinguish the existing "nocache" that tries to avoid
> unnecessary cache pollution and this new "must write through" semantic
> for writing to persistent memory. I suspect usages of
> skb_add_data_nocache() are ok since they are in the transmit path.
> Receiving directly into a buffer that is expected to be persisted
> immediately is where we would need to be careful, but that is already
> backstopped by dirty cacheline tracking. So as far as I can see, we
> should only need a new memcpy_writethrough() (?) for the pmem
> direct-i/o path at present.

OK...  Right now we have several places playing with nocache:
* dax_iomap_actor().  Writethrough warranties needed, nocache
side serves to reduce the cache impact *and* avoid the need for clwb
for writethrough.
* several memcpy_to_pmem() users - acpi_nfit_blk_single_io(),
nsio_rw_bytes(), write_pmem().  No clwb attempted; is it needed there?
* hfi1_copy_sge().  Cache pollution avoidance?  The source is
in the kernel, looks like memcpy_nocache() candidate.
* ntb_memcpy_tx().  Really fishy one - it's from kernel to iomem,
with nocache userland->kernel copying primitive abused on x86.  As soon
as e.g. powerpc or sparc grows ARCH_HAS_NOCACHE_UACCESS, we are in trouble
there.  What is it actually trying to achieve?  memcpy_toio() with
cache pollution avoidance?
* networking copy_from_iter_full_nocache() users - cache pollution
avoidance, AFAICS; no writethrough warranties sought.

Why does pmem need writethrough warranties, anyway?  All explanations I've
found on the net had been along the lines of "we should not store a pointer
to pmem data structure until the structure itself had been committed to
pmem itself" and it looks like something that ought to be a job for barriers
- after all, we don't want the pointer store to be observed by _anything_
in the system until the earlier stores are visible, so what makes pmem
different from e.g. another CPU or a PCI busmaster, or...

I'm trying to figure out what would be the right API here; sure, we can
add separate memcpy_writethrough()/__copy_from_user_inatomic_writethrough()/
copy_from_iter_writethrough(), but I would like to understand what's going
on first.