Re: [PATCH v2] nd_blk: add support for "read flush" DSM flag

2015-08-22 Thread Dan Williams
On Thu, Aug 20, 2015 at 2:15 PM, Ross Zwisler
 wrote:
> On Thu, 2015-08-20 at 13:27 -0700, Dan Williams wrote:
> [...]
>> With regards to the fencing, since we already take care to flush
>> writes we don't need to fence at all for the flush, right?  All we
>> care about is that reads see valid data.
>
> We were careful to flush writes, but we could still have (now stale) data in
> the cache either due to a previous read or because of prefetching.  So we
> need to flush, and we need to fence to make sure that our flushing stays
> correctly ordered with respect to our reads.

Hmm, so clflushopt does not guarantee that a read in program order
after a clflushopt sees the invalidate?  It seems like we're not
getting any advantage of using clfushopt vs clflush.

Let's go with this for now, but anything further should be guided by
performance numbers.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] nd_blk: add support for read flush DSM flag

2015-08-22 Thread Dan Williams
On Thu, Aug 20, 2015 at 2:15 PM, Ross Zwisler
ross.zwis...@linux.intel.com wrote:
 On Thu, 2015-08-20 at 13:27 -0700, Dan Williams wrote:
 [...]
 With regards to the fencing, since we already take care to flush
 writes we don't need to fence at all for the flush, right?  All we
 care about is that reads see valid data.

 We were careful to flush writes, but we could still have (now stale) data in
 the cache either due to a previous read or because of prefetching.  So we
 need to flush, and we need to fence to make sure that our flushing stays
 correctly ordered with respect to our reads.

Hmm, so clflushopt does not guarantee that a read in program order
after a clflushopt sees the invalidate?  It seems like we're not
getting any advantage of using clfushopt vs clflush.

Let's go with this for now, but anything further should be guided by
performance numbers.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] nd_blk: add support for "read flush" DSM flag

2015-08-20 Thread Ross Zwisler
On Thu, 2015-08-20 at 13:27 -0700, Dan Williams wrote:
[...]
> With regards to the fencing, since we already take care to flush
> writes we don't need to fence at all for the flush, right?  All we
> care about is that reads see valid data.

We were careful to flush writes, but we could still have (now stale) data in
the cache either due to a previous read or because of prefetching.  So we
need to flush, and we need to fence to make sure that our flushing stays
correctly ordered with respect to our reads.

So, to recap, I think are options are:

a)
loop through aperture segments, flushing each without any fencing
mb() to order all the flushes
loop through aperture segments, doing the reads

b)
loop through aperture segments
flushing segment with mb() fencing
read the new data for this segment from the DIMM

Option b) is what is already implemented and is cleaner, but option a) has the
benefit of having many fewer fences.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] nd_blk: add support for "read flush" DSM flag

2015-08-20 Thread Dan Williams
On Thu, Aug 20, 2015 at 12:00 PM, Ross Zwisler
 wrote:
> On Thu, 2015-08-20 at 11:26 -0700, Dan Williams wrote:
>> On Thu, Aug 20, 2015 at 11:17 AM, Ross Zwisler
>>  wrote:
>> > On Thu, 2015-08-20 at 10:59 -0700, Dan Williams wrote:
>> [..]
>> > Ah, I think we're getting confused about the deinterleave part.
>> >
>> > The aperture is a set of contiguous addresses from the perspective of the
>> > DIMM, but when it's interleaved by the iMC it becomes a bunch of segments 
>> > that
>> > are not contiguous in the virtual address space of the kernel.
>> >
>> > Meaning, say you have an 8k aperture that is interleaved with one other 
>> > DIMM
>> > on a 256 byte granularity - this means that in SPA space you'll end up 
>> > with a
>> > big mesh of 256 byte chunks, half of which belong to you and half which 
>> > don't:
>> >
>> > SPA space:
>> > ++
>> > |256 bytes (ours)|
>> > ++
>> > |256 bytes (not ours)|
>> > ++
>> > |256 bytes (ours)|
>> > ++
>> > |256 bytes (not ours)|
>> > ++
>> > ...
>> >
>> > To be able to flush the entire aperture unconditionally, we have to walk
>> > through all the segments that belong to use and flush each one of them.  I
>> > don't think we want to blindly flush the entire interleaved space because 
>> > a)
>> > the other chunks are some other DIMMs' apertures, and b) we'd be flushing 
>> > 2x
>> > or more (depending on how many DIMMs are interleaved) the space we need, 
>> > one
>> > cache line at a time.
>>
>> I am indeed proposing flushing other DIMMs because those ranges are
>> invalidated by the aperture moving.  This is based on the assumption
>> that the flushing is cheaper in the case when no dirty-lines are
>> found.  The performance gains of doing piecemeal flushes seems not
>> worth the complexity.
>
> Why are the segments belonging to other apertures invalidated because we have
> moved our aperture?  They are all independent cache lines (segments must be a
> multiple of the cache line size), and the other apertures might be in the
> middle of some other I/O operation on some other CPU that we know nothing
> about.
>

Ah, ok the other DIMM data in the aperture should never be consumed
and does not need to be invalidated.  I'm straightened out on that
aspect.

With regards to the fencing, since we already take care to flush
writes we don't need to fence at all for the flush, right?  All we
care about is that reads see valid data.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] nd_blk: add support for "read flush" DSM flag

2015-08-20 Thread Ross Zwisler
On Thu, 2015-08-20 at 11:26 -0700, Dan Williams wrote:
> On Thu, Aug 20, 2015 at 11:17 AM, Ross Zwisler
>  wrote:
> > On Thu, 2015-08-20 at 10:59 -0700, Dan Williams wrote:
> [..]
> > Ah, I think we're getting confused about the deinterleave part.
> >
> > The aperture is a set of contiguous addresses from the perspective of the
> > DIMM, but when it's interleaved by the iMC it becomes a bunch of segments 
> > that
> > are not contiguous in the virtual address space of the kernel.
> >
> > Meaning, say you have an 8k aperture that is interleaved with one other DIMM
> > on a 256 byte granularity - this means that in SPA space you'll end up with 
> > a
> > big mesh of 256 byte chunks, half of which belong to you and half which 
> > don't:
> >
> > SPA space:
> > ++
> > |256 bytes (ours)|
> > ++
> > |256 bytes (not ours)|
> > ++
> > |256 bytes (ours)|
> > ++
> > |256 bytes (not ours)|
> > ++
> > ...
> >
> > To be able to flush the entire aperture unconditionally, we have to walk
> > through all the segments that belong to use and flush each one of them.  I
> > don't think we want to blindly flush the entire interleaved space because a)
> > the other chunks are some other DIMMs' apertures, and b) we'd be flushing 2x
> > or more (depending on how many DIMMs are interleaved) the space we need, one
> > cache line at a time.
> 
> I am indeed proposing flushing other DIMMs because those ranges are
> invalidated by the aperture moving.  This is based on the assumption
> that the flushing is cheaper in the case when no dirty-lines are
> found.  The performance gains of doing piecemeal flushes seems not
> worth the complexity.

Why are the segments belonging to other apertures invalidated because we have
moved our aperture?  They are all independent cache lines (segments must be a
multiple of the cache line size), and the other apertures might be in the
middle of some other I/O operation on some other CPU that we know nothing
about.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] nd_blk: add support for "read flush" DSM flag

2015-08-20 Thread Dan Williams
On Thu, Aug 20, 2015 at 11:17 AM, Ross Zwisler
 wrote:
> On Thu, 2015-08-20 at 10:59 -0700, Dan Williams wrote:
[..]
> Ah, I think we're getting confused about the deinterleave part.
>
> The aperture is a set of contiguous addresses from the perspective of the
> DIMM, but when it's interleaved by the iMC it becomes a bunch of segments that
> are not contiguous in the virtual address space of the kernel.
>
> Meaning, say you have an 8k aperture that is interleaved with one other DIMM
> on a 256 byte granularity - this means that in SPA space you'll end up with a
> big mesh of 256 byte chunks, half of which belong to you and half which don't:
>
> SPA space:
> ++
> |256 bytes (ours)|
> ++
> |256 bytes (not ours)|
> ++
> |256 bytes (ours)|
> ++
> |256 bytes (not ours)|
> ++
> ...
>
> To be able to flush the entire aperture unconditionally, we have to walk
> through all the segments that belong to use and flush each one of them.  I
> don't think we want to blindly flush the entire interleaved space because a)
> the other chunks are some other DIMMs' apertures, and b) we'd be flushing 2x
> or more (depending on how many DIMMs are interleaved) the space we need, one
> cache line at a time.

I am indeed proposing flushing other DIMMs because those ranges are
invalidated by the aperture moving.  This is based on the assumption
that the flushing is cheaper in the case when no dirty-lines are
found.  The performance gains of doing piecemeal flushes seems not
worth the complexity.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] nd_blk: add support for "read flush" DSM flag

2015-08-20 Thread Ross Zwisler
On Thu, 2015-08-20 at 10:59 -0700, Dan Williams wrote:
> On Thu, Aug 20, 2015 at 9:44 AM, Ross Zwisler
>  wrote:
> > On Wed, 2015-08-19 at 16:06 -0700, Dan Williams wrote:
> >> On Wed, Aug 19, 2015 at 3:48 PM, Ross Zwisler
> >>  wrote:
> >> > Add support for the "read flush" _DSM flag, as outlined in the DSM spec:
> >> >
> >> > http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> >> >
> >> > This flag tells the ND BLK driver that it needs to flush the cache lines
> >> > associated with the aperture after the aperture is moved but before any
> >> > new data is read.  This ensures that any stale cache lines from the
> >> > previous contents of the aperture will be discarded from the processor
> >> > cache, and the new data will be read properly from the DIMM.  We know
> >> > that the cache lines are clean and will be discarded without any
> >> > writeback because either a) the previous aperture operation was a read,
> >> > and we never modified the contents of the aperture, or b) the previous
> >> > aperture operation was a write and we must have written back the dirtied
> >> > contents of the aperture to the DIMM before the I/O was completed.
> >> >
> >> > By supporting the "read flush" flag we can also change the ND BLK
> >> > aperture mapping from write-combining to write-back via memremap().
> >> >
> >> > In order to add support for the "read flush" flag I needed to add a
> >> > generic routine to invalidate cache lines, mmio_flush_range().  This is
> >> > protected by the ARCH_HAS_MMIO_FLUSH Kconfig variable, and is currently
> >> > only supported on x86.
> >> >
> >> > Signed-off-by: Ross Zwisler 
> >> > Cc: Dan Williams 
> >> [..]
> >> > diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
> >> > index 7c2638f..56fff01 100644
> >> > --- a/drivers/acpi/nfit.c
> >> > +++ b/drivers/acpi/nfit.c
> >> [..]
> >> >  static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
> >> > @@ -1078,11 +1078,16 @@ static int acpi_nfit_blk_single_io(struct 
> >> > nfit_blk *nfit_blk,
> >> > }
> >> >
> >> > if (rw)
> >> > -   memcpy_to_pmem(mmio->aperture + offset,
> >> > +   memcpy_to_pmem(mmio->addr.aperture + offset,
> >> > iobuf + copied, c);
> >> > -   else
> >> > +   else {
> >> > +   if (nfit_blk->dimm_flags & ND_BLK_READ_FLUSH)
> >> > +   mmio_flush_range((void __force *)
> >> > +   mmio->addr.aperture + offset, c);
> >> > +
> >> > memcpy_from_pmem(iobuf + copied,
> >> > -   mmio->aperture + offset, c);
> >> > +   mmio->addr.aperture + offset, c);
> >> > +   }
> >>
> >> Why is the flush inside the "while (len)" loop?  I think it should be
> >> done immediately after the call to write_blk_ctl() since that is the
> >> point at which the aperture becomes invalidated, and not prior to each
> >> read within a given aperture position.  Taking it a bit further, we
> >> may be writing the same address into the control register as was there
> >> previously so we wouldn't need to flush in that case.
> >
> > The reason I was doing it in the "while (len)" loop is that you have to walk
> > through the interleave tables, reading each segment until you have read 
> > 'len'
> > bytes.  If we were to invalidate right after the write_blk_ctl(), we would
> > essentially have to re-create the "while (len)" loop, hop through all the
> > segments doing the invalidation, then run through the segments again doing 
> > the
> > actual I/O.
> >
> > It seemed a lot cleaner to just run through the segments once, invalidating
> > and reading each segment individually.
> 
> I agree it's cleaner if it is considering the de-interleave, but why
> consider interleave at all?  In other words just flush the entire
> aperture unconditionally.  Regardless of whether it reads all of the
> aperture it is indeed invalid because the aperture has moved.  I'm not
> seeing the benefit of being careful to let stale data stay in the
> cache a bit longer.

Ah, I think we're getting confused about the deinterleave part.

The aperture is a set of contiguous addresses from the perspective of the
DIMM, but when it's interleaved by the iMC it becomes a bunch of segments that
are not contiguous in the virtual address space of the kernel.

Meaning, say you have an 8k aperture that is interleaved with one other DIMM
on a 256 byte granularity - this means that in SPA space you'll end up with a
big mesh of 256 byte chunks, half of which belong to you and half which don't:

SPA space:
++
|256 bytes (ours)|
++
|256 bytes (not ours)|
++
|256 bytes (ours)|
++
|256 bytes (not ours)|
++
...

To be able to flush the entire aperture 

Re: [PATCH v2] nd_blk: add support for "read flush" DSM flag

2015-08-20 Thread Dan Williams
On Thu, Aug 20, 2015 at 9:44 AM, Ross Zwisler
 wrote:
> On Wed, 2015-08-19 at 16:06 -0700, Dan Williams wrote:
>> On Wed, Aug 19, 2015 at 3:48 PM, Ross Zwisler
>>  wrote:
>> > Add support for the "read flush" _DSM flag, as outlined in the DSM spec:
>> >
>> > http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
>> >
>> > This flag tells the ND BLK driver that it needs to flush the cache lines
>> > associated with the aperture after the aperture is moved but before any
>> > new data is read.  This ensures that any stale cache lines from the
>> > previous contents of the aperture will be discarded from the processor
>> > cache, and the new data will be read properly from the DIMM.  We know
>> > that the cache lines are clean and will be discarded without any
>> > writeback because either a) the previous aperture operation was a read,
>> > and we never modified the contents of the aperture, or b) the previous
>> > aperture operation was a write and we must have written back the dirtied
>> > contents of the aperture to the DIMM before the I/O was completed.
>> >
>> > By supporting the "read flush" flag we can also change the ND BLK
>> > aperture mapping from write-combining to write-back via memremap().
>> >
>> > In order to add support for the "read flush" flag I needed to add a
>> > generic routine to invalidate cache lines, mmio_flush_range().  This is
>> > protected by the ARCH_HAS_MMIO_FLUSH Kconfig variable, and is currently
>> > only supported on x86.
>> >
>> > Signed-off-by: Ross Zwisler 
>> > Cc: Dan Williams 
>> [..]
>> > diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
>> > index 7c2638f..56fff01 100644
>> > --- a/drivers/acpi/nfit.c
>> > +++ b/drivers/acpi/nfit.c
>> [..]
>> >  static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
>> > @@ -1078,11 +1078,16 @@ static int acpi_nfit_blk_single_io(struct nfit_blk 
>> > *nfit_blk,
>> > }
>> >
>> > if (rw)
>> > -   memcpy_to_pmem(mmio->aperture + offset,
>> > +   memcpy_to_pmem(mmio->addr.aperture + offset,
>> > iobuf + copied, c);
>> > -   else
>> > +   else {
>> > +   if (nfit_blk->dimm_flags & ND_BLK_READ_FLUSH)
>> > +   mmio_flush_range((void __force *)
>> > +   mmio->addr.aperture + offset, c);
>> > +
>> > memcpy_from_pmem(iobuf + copied,
>> > -   mmio->aperture + offset, c);
>> > +   mmio->addr.aperture + offset, c);
>> > +   }
>>
>> Why is the flush inside the "while (len)" loop?  I think it should be
>> done immediately after the call to write_blk_ctl() since that is the
>> point at which the aperture becomes invalidated, and not prior to each
>> read within a given aperture position.  Taking it a bit further, we
>> may be writing the same address into the control register as was there
>> previously so we wouldn't need to flush in that case.
>
> The reason I was doing it in the "while (len)" loop is that you have to walk
> through the interleave tables, reading each segment until you have read 'len'
> bytes.  If we were to invalidate right after the write_blk_ctl(), we would
> essentially have to re-create the "while (len)" loop, hop through all the
> segments doing the invalidation, then run through the segments again doing the
> actual I/O.
>
> It seemed a lot cleaner to just run through the segments once, invalidating
> and reading each segment individually.

I agree it's cleaner if it is considering the de-interleave, but why
consider interleave at all?  In other words just flush the entire
aperture unconditionally.  Regardless of whether it reads all of the
aperture it is indeed invalid because the aperture has moved.  I'm not
seeing the benefit of being careful to let stale data stay in the
cache a bit longer.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] nd_blk: add support for "read flush" DSM flag

2015-08-20 Thread Ross Zwisler
On Wed, 2015-08-19 at 16:06 -0700, Dan Williams wrote:
> On Wed, Aug 19, 2015 at 3:48 PM, Ross Zwisler
>  wrote:
> > Add support for the "read flush" _DSM flag, as outlined in the DSM spec:
> >
> > http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> >
> > This flag tells the ND BLK driver that it needs to flush the cache lines
> > associated with the aperture after the aperture is moved but before any
> > new data is read.  This ensures that any stale cache lines from the
> > previous contents of the aperture will be discarded from the processor
> > cache, and the new data will be read properly from the DIMM.  We know
> > that the cache lines are clean and will be discarded without any
> > writeback because either a) the previous aperture operation was a read,
> > and we never modified the contents of the aperture, or b) the previous
> > aperture operation was a write and we must have written back the dirtied
> > contents of the aperture to the DIMM before the I/O was completed.
> >
> > By supporting the "read flush" flag we can also change the ND BLK
> > aperture mapping from write-combining to write-back via memremap().
> >
> > In order to add support for the "read flush" flag I needed to add a
> > generic routine to invalidate cache lines, mmio_flush_range().  This is
> > protected by the ARCH_HAS_MMIO_FLUSH Kconfig variable, and is currently
> > only supported on x86.
> >
> > Signed-off-by: Ross Zwisler 
> > Cc: Dan Williams 
> [..]
> > diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
> > index 7c2638f..56fff01 100644
> > --- a/drivers/acpi/nfit.c
> > +++ b/drivers/acpi/nfit.c
> [..]
> >  static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
> > @@ -1078,11 +1078,16 @@ static int acpi_nfit_blk_single_io(struct nfit_blk 
> > *nfit_blk,
> > }
> >
> > if (rw)
> > -   memcpy_to_pmem(mmio->aperture + offset,
> > +   memcpy_to_pmem(mmio->addr.aperture + offset,
> > iobuf + copied, c);
> > -   else
> > +   else {
> > +   if (nfit_blk->dimm_flags & ND_BLK_READ_FLUSH)
> > +   mmio_flush_range((void __force *)
> > +   mmio->addr.aperture + offset, c);
> > +
> > memcpy_from_pmem(iobuf + copied,
> > -   mmio->aperture + offset, c);
> > +   mmio->addr.aperture + offset, c);
> > +   }
> 
> Why is the flush inside the "while (len)" loop?  I think it should be
> done immediately after the call to write_blk_ctl() since that is the
> point at which the aperture becomes invalidated, and not prior to each
> read within a given aperture position.  Taking it a bit further, we
> may be writing the same address into the control register as was there
> previously so we wouldn't need to flush in that case.

The reason I was doing it in the "while (len)" loop is that you have to walk
through the interleave tables, reading each segment until you have read 'len'
bytes.  If we were to invalidate right after the write_blk_ctl(), we would
essentially have to re-create the "while (len)" loop, hop through all the
segments doing the invalidation, then run through the segments again doing the
actual I/O.

It seemed a lot cleaner to just run through the segments once, invalidating
and reading each segment individually.

The bad news about the current approach is that we end up doing a bunch of
extra mb() fencing, twice per segment via clflush_cache_range().

The other option would be to do the double pass, but on the first pass to just
do the flushing without fencing, then fence everything, then do the reads.

I don't have a good feel for how much overhead all this extra fencing will be
vs the cost of traversing the segments twice.  The code is certainly simpler
with the way its implemented now.  If you feel that the extra fencing is too
expensive I'll implement it as a double-pass.  Otherwise we may want to wait
for performance data to justify the change.

Regarding skipping the flush if the control register is unchanged - sure, that
seems like a good idea.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] nd_blk: add support for "read flush" DSM flag

2015-08-20 Thread Ross Zwisler
On Thu, 2015-08-20 at 11:21 +0100, Will Deacon wrote:
> On Wed, Aug 19, 2015 at 11:48:04PM +0100, Ross Zwisler wrote:
> > Add support for the "read flush" _DSM flag, as outlined in the DSM spec:
> > 
> > http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> > 
> > This flag tells the ND BLK driver that it needs to flush the cache lines
> > associated with the aperture after the aperture is moved but before any
> > new data is read.  This ensures that any stale cache lines from the
> > previous contents of the aperture will be discarded from the processor
> > cache, and the new data will be read properly from the DIMM.  We know
> > that the cache lines are clean and will be discarded without any
> > writeback because either a) the previous aperture operation was a read,
> > and we never modified the contents of the aperture, or b) the previous
> > aperture operation was a write and we must have written back the dirtied
> > contents of the aperture to the DIMM before the I/O was completed.
> 
> Is this operation expected to be implemented as a destructive invalidation
> (i.e. discarding any dirty lines from the cache) or also a writeback of any
> dirtylines as part of the invalidation?
> 
> If its the former, we might want to give it a scarier name to ensure that
> it doesn't grow users outside of NVDIMM scenarios, since "flush" is used
> elsewhere for things like flush_dcache_page.

It is the latter - there will be a writeback of any dirty lines as part of the
invalidation.  The real thing I'm looking for is a forced invalidation
(including writeback, if needed), even on architectures that are cache
coherent.  This is useful when the device can change the data at that memory
location, but not as part of a normal DMA operation that would keep the cache
coherent.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] nd_blk: add support for "read flush" DSM flag

2015-08-20 Thread Will Deacon
On Wed, Aug 19, 2015 at 11:48:04PM +0100, Ross Zwisler wrote:
> Add support for the "read flush" _DSM flag, as outlined in the DSM spec:
> 
> http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> 
> This flag tells the ND BLK driver that it needs to flush the cache lines
> associated with the aperture after the aperture is moved but before any
> new data is read.  This ensures that any stale cache lines from the
> previous contents of the aperture will be discarded from the processor
> cache, and the new data will be read properly from the DIMM.  We know
> that the cache lines are clean and will be discarded without any
> writeback because either a) the previous aperture operation was a read,
> and we never modified the contents of the aperture, or b) the previous
> aperture operation was a write and we must have written back the dirtied
> contents of the aperture to the DIMM before the I/O was completed.

Is this operation expected to be implemented as a destructive invalidation
(i.e. discarding any dirty lines from the cache) or also a writeback of any
dirtylines as part of the invalidation?

If its the former, we might want to give it a scarier name to ensure that
it doesn't grow users outside of NVDIMM scenarios, since "flush" is used
elsewhere for things like flush_dcache_page.

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] nd_blk: add support for read flush DSM flag

2015-08-20 Thread Will Deacon
On Wed, Aug 19, 2015 at 11:48:04PM +0100, Ross Zwisler wrote:
 Add support for the read flush _DSM flag, as outlined in the DSM spec:
 
 http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
 
 This flag tells the ND BLK driver that it needs to flush the cache lines
 associated with the aperture after the aperture is moved but before any
 new data is read.  This ensures that any stale cache lines from the
 previous contents of the aperture will be discarded from the processor
 cache, and the new data will be read properly from the DIMM.  We know
 that the cache lines are clean and will be discarded without any
 writeback because either a) the previous aperture operation was a read,
 and we never modified the contents of the aperture, or b) the previous
 aperture operation was a write and we must have written back the dirtied
 contents of the aperture to the DIMM before the I/O was completed.

Is this operation expected to be implemented as a destructive invalidation
(i.e. discarding any dirty lines from the cache) or also a writeback of any
dirtylines as part of the invalidation?

If its the former, we might want to give it a scarier name to ensure that
it doesn't grow users outside of NVDIMM scenarios, since flush is used
elsewhere for things like flush_dcache_page.

Will
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] nd_blk: add support for read flush DSM flag

2015-08-20 Thread Ross Zwisler
On Thu, 2015-08-20 at 11:26 -0700, Dan Williams wrote:
 On Thu, Aug 20, 2015 at 11:17 AM, Ross Zwisler
 ross.zwis...@linux.intel.com wrote:
  On Thu, 2015-08-20 at 10:59 -0700, Dan Williams wrote:
 [..]
  Ah, I think we're getting confused about the deinterleave part.
 
  The aperture is a set of contiguous addresses from the perspective of the
  DIMM, but when it's interleaved by the iMC it becomes a bunch of segments 
  that
  are not contiguous in the virtual address space of the kernel.
 
  Meaning, say you have an 8k aperture that is interleaved with one other DIMM
  on a 256 byte granularity - this means that in SPA space you'll end up with 
  a
  big mesh of 256 byte chunks, half of which belong to you and half which 
  don't:
 
  SPA space:
  ++
  |256 bytes (ours)|
  ++
  |256 bytes (not ours)|
  ++
  |256 bytes (ours)|
  ++
  |256 bytes (not ours)|
  ++
  ...
 
  To be able to flush the entire aperture unconditionally, we have to walk
  through all the segments that belong to use and flush each one of them.  I
  don't think we want to blindly flush the entire interleaved space because a)
  the other chunks are some other DIMMs' apertures, and b) we'd be flushing 2x
  or more (depending on how many DIMMs are interleaved) the space we need, one
  cache line at a time.
 
 I am indeed proposing flushing other DIMMs because those ranges are
 invalidated by the aperture moving.  This is based on the assumption
 that the flushing is cheaper in the case when no dirty-lines are
 found.  The performance gains of doing piecemeal flushes seems not
 worth the complexity.

Why are the segments belonging to other apertures invalidated because we have
moved our aperture?  They are all independent cache lines (segments must be a
multiple of the cache line size), and the other apertures might be in the
middle of some other I/O operation on some other CPU that we know nothing
about.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] nd_blk: add support for read flush DSM flag

2015-08-20 Thread Ross Zwisler
On Thu, 2015-08-20 at 10:59 -0700, Dan Williams wrote:
 On Thu, Aug 20, 2015 at 9:44 AM, Ross Zwisler
 ross.zwis...@linux.intel.com wrote:
  On Wed, 2015-08-19 at 16:06 -0700, Dan Williams wrote:
  On Wed, Aug 19, 2015 at 3:48 PM, Ross Zwisler
  ross.zwis...@linux.intel.com wrote:
   Add support for the read flush _DSM flag, as outlined in the DSM spec:
  
   http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
  
   This flag tells the ND BLK driver that it needs to flush the cache lines
   associated with the aperture after the aperture is moved but before any
   new data is read.  This ensures that any stale cache lines from the
   previous contents of the aperture will be discarded from the processor
   cache, and the new data will be read properly from the DIMM.  We know
   that the cache lines are clean and will be discarded without any
   writeback because either a) the previous aperture operation was a read,
   and we never modified the contents of the aperture, or b) the previous
   aperture operation was a write and we must have written back the dirtied
   contents of the aperture to the DIMM before the I/O was completed.
  
   By supporting the read flush flag we can also change the ND BLK
   aperture mapping from write-combining to write-back via memremap().
  
   In order to add support for the read flush flag I needed to add a
   generic routine to invalidate cache lines, mmio_flush_range().  This is
   protected by the ARCH_HAS_MMIO_FLUSH Kconfig variable, and is currently
   only supported on x86.
  
   Signed-off-by: Ross Zwisler ross.zwis...@linux.intel.com
   Cc: Dan Williams dan.j.willi...@intel.com
  [..]
   diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
   index 7c2638f..56fff01 100644
   --- a/drivers/acpi/nfit.c
   +++ b/drivers/acpi/nfit.c
  [..]
static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
   @@ -1078,11 +1078,16 @@ static int acpi_nfit_blk_single_io(struct 
   nfit_blk *nfit_blk,
   }
  
   if (rw)
   -   memcpy_to_pmem(mmio-aperture + offset,
   +   memcpy_to_pmem(mmio-addr.aperture + offset,
   iobuf + copied, c);
   -   else
   +   else {
   +   if (nfit_blk-dimm_flags  ND_BLK_READ_FLUSH)
   +   mmio_flush_range((void __force *)
   +   mmio-addr.aperture + offset, c);
   +
   memcpy_from_pmem(iobuf + copied,
   -   mmio-aperture + offset, c);
   +   mmio-addr.aperture + offset, c);
   +   }
 
  Why is the flush inside the while (len) loop?  I think it should be
  done immediately after the call to write_blk_ctl() since that is the
  point at which the aperture becomes invalidated, and not prior to each
  read within a given aperture position.  Taking it a bit further, we
  may be writing the same address into the control register as was there
  previously so we wouldn't need to flush in that case.
 
  The reason I was doing it in the while (len) loop is that you have to walk
  through the interleave tables, reading each segment until you have read 
  'len'
  bytes.  If we were to invalidate right after the write_blk_ctl(), we would
  essentially have to re-create the while (len) loop, hop through all the
  segments doing the invalidation, then run through the segments again doing 
  the
  actual I/O.
 
  It seemed a lot cleaner to just run through the segments once, invalidating
  and reading each segment individually.
 
 I agree it's cleaner if it is considering the de-interleave, but why
 consider interleave at all?  In other words just flush the entire
 aperture unconditionally.  Regardless of whether it reads all of the
 aperture it is indeed invalid because the aperture has moved.  I'm not
 seeing the benefit of being careful to let stale data stay in the
 cache a bit longer.

Ah, I think we're getting confused about the deinterleave part.

The aperture is a set of contiguous addresses from the perspective of the
DIMM, but when it's interleaved by the iMC it becomes a bunch of segments that
are not contiguous in the virtual address space of the kernel.

Meaning, say you have an 8k aperture that is interleaved with one other DIMM
on a 256 byte granularity - this means that in SPA space you'll end up with a
big mesh of 256 byte chunks, half of which belong to you and half which don't:

SPA space:
++
|256 bytes (ours)|
++
|256 bytes (not ours)|
++
|256 bytes (ours)|
++
|256 bytes (not ours)|
++
...

To be able to flush the entire aperture unconditionally, we have to walk
through all the segments that belong to use and flush each one of them.  I
don't think we want to blindly flush the entire interleaved space 

Re: [PATCH v2] nd_blk: add support for read flush DSM flag

2015-08-20 Thread Dan Williams
On Thu, Aug 20, 2015 at 11:17 AM, Ross Zwisler
ross.zwis...@linux.intel.com wrote:
 On Thu, 2015-08-20 at 10:59 -0700, Dan Williams wrote:
[..]
 Ah, I think we're getting confused about the deinterleave part.

 The aperture is a set of contiguous addresses from the perspective of the
 DIMM, but when it's interleaved by the iMC it becomes a bunch of segments that
 are not contiguous in the virtual address space of the kernel.

 Meaning, say you have an 8k aperture that is interleaved with one other DIMM
 on a 256 byte granularity - this means that in SPA space you'll end up with a
 big mesh of 256 byte chunks, half of which belong to you and half which don't:

 SPA space:
 ++
 |256 bytes (ours)|
 ++
 |256 bytes (not ours)|
 ++
 |256 bytes (ours)|
 ++
 |256 bytes (not ours)|
 ++
 ...

 To be able to flush the entire aperture unconditionally, we have to walk
 through all the segments that belong to use and flush each one of them.  I
 don't think we want to blindly flush the entire interleaved space because a)
 the other chunks are some other DIMMs' apertures, and b) we'd be flushing 2x
 or more (depending on how many DIMMs are interleaved) the space we need, one
 cache line at a time.

I am indeed proposing flushing other DIMMs because those ranges are
invalidated by the aperture moving.  This is based on the assumption
that the flushing is cheaper in the case when no dirty-lines are
found.  The performance gains of doing piecemeal flushes seems not
worth the complexity.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] nd_blk: add support for read flush DSM flag

2015-08-20 Thread Dan Williams
On Thu, Aug 20, 2015 at 9:44 AM, Ross Zwisler
ross.zwis...@linux.intel.com wrote:
 On Wed, 2015-08-19 at 16:06 -0700, Dan Williams wrote:
 On Wed, Aug 19, 2015 at 3:48 PM, Ross Zwisler
 ross.zwis...@linux.intel.com wrote:
  Add support for the read flush _DSM flag, as outlined in the DSM spec:
 
  http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
 
  This flag tells the ND BLK driver that it needs to flush the cache lines
  associated with the aperture after the aperture is moved but before any
  new data is read.  This ensures that any stale cache lines from the
  previous contents of the aperture will be discarded from the processor
  cache, and the new data will be read properly from the DIMM.  We know
  that the cache lines are clean and will be discarded without any
  writeback because either a) the previous aperture operation was a read,
  and we never modified the contents of the aperture, or b) the previous
  aperture operation was a write and we must have written back the dirtied
  contents of the aperture to the DIMM before the I/O was completed.
 
  By supporting the read flush flag we can also change the ND BLK
  aperture mapping from write-combining to write-back via memremap().
 
  In order to add support for the read flush flag I needed to add a
  generic routine to invalidate cache lines, mmio_flush_range().  This is
  protected by the ARCH_HAS_MMIO_FLUSH Kconfig variable, and is currently
  only supported on x86.
 
  Signed-off-by: Ross Zwisler ross.zwis...@linux.intel.com
  Cc: Dan Williams dan.j.willi...@intel.com
 [..]
  diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
  index 7c2638f..56fff01 100644
  --- a/drivers/acpi/nfit.c
  +++ b/drivers/acpi/nfit.c
 [..]
   static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
  @@ -1078,11 +1078,16 @@ static int acpi_nfit_blk_single_io(struct nfit_blk 
  *nfit_blk,
  }
 
  if (rw)
  -   memcpy_to_pmem(mmio-aperture + offset,
  +   memcpy_to_pmem(mmio-addr.aperture + offset,
  iobuf + copied, c);
  -   else
  +   else {
  +   if (nfit_blk-dimm_flags  ND_BLK_READ_FLUSH)
  +   mmio_flush_range((void __force *)
  +   mmio-addr.aperture + offset, c);
  +
  memcpy_from_pmem(iobuf + copied,
  -   mmio-aperture + offset, c);
  +   mmio-addr.aperture + offset, c);
  +   }

 Why is the flush inside the while (len) loop?  I think it should be
 done immediately after the call to write_blk_ctl() since that is the
 point at which the aperture becomes invalidated, and not prior to each
 read within a given aperture position.  Taking it a bit further, we
 may be writing the same address into the control register as was there
 previously so we wouldn't need to flush in that case.

 The reason I was doing it in the while (len) loop is that you have to walk
 through the interleave tables, reading each segment until you have read 'len'
 bytes.  If we were to invalidate right after the write_blk_ctl(), we would
 essentially have to re-create the while (len) loop, hop through all the
 segments doing the invalidation, then run through the segments again doing the
 actual I/O.

 It seemed a lot cleaner to just run through the segments once, invalidating
 and reading each segment individually.

I agree it's cleaner if it is considering the de-interleave, but why
consider interleave at all?  In other words just flush the entire
aperture unconditionally.  Regardless of whether it reads all of the
aperture it is indeed invalid because the aperture has moved.  I'm not
seeing the benefit of being careful to let stale data stay in the
cache a bit longer.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] nd_blk: add support for read flush DSM flag

2015-08-20 Thread Ross Zwisler
On Thu, 2015-08-20 at 13:27 -0700, Dan Williams wrote:
[...]
 With regards to the fencing, since we already take care to flush
 writes we don't need to fence at all for the flush, right?  All we
 care about is that reads see valid data.

We were careful to flush writes, but we could still have (now stale) data in
the cache either due to a previous read or because of prefetching.  So we
need to flush, and we need to fence to make sure that our flushing stays
correctly ordered with respect to our reads.

So, to recap, I think are options are:

a)
loop through aperture segments, flushing each without any fencing
mb() to order all the flushes
loop through aperture segments, doing the reads

b)
loop through aperture segments
flushing segment with mb() fencing
read the new data for this segment from the DIMM

Option b) is what is already implemented and is cleaner, but option a) has the
benefit of having many fewer fences.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] nd_blk: add support for read flush DSM flag

2015-08-20 Thread Dan Williams
On Thu, Aug 20, 2015 at 12:00 PM, Ross Zwisler
ross.zwis...@linux.intel.com wrote:
 On Thu, 2015-08-20 at 11:26 -0700, Dan Williams wrote:
 On Thu, Aug 20, 2015 at 11:17 AM, Ross Zwisler
 ross.zwis...@linux.intel.com wrote:
  On Thu, 2015-08-20 at 10:59 -0700, Dan Williams wrote:
 [..]
  Ah, I think we're getting confused about the deinterleave part.
 
  The aperture is a set of contiguous addresses from the perspective of the
  DIMM, but when it's interleaved by the iMC it becomes a bunch of segments 
  that
  are not contiguous in the virtual address space of the kernel.
 
  Meaning, say you have an 8k aperture that is interleaved with one other 
  DIMM
  on a 256 byte granularity - this means that in SPA space you'll end up 
  with a
  big mesh of 256 byte chunks, half of which belong to you and half which 
  don't:
 
  SPA space:
  ++
  |256 bytes (ours)|
  ++
  |256 bytes (not ours)|
  ++
  |256 bytes (ours)|
  ++
  |256 bytes (not ours)|
  ++
  ...
 
  To be able to flush the entire aperture unconditionally, we have to walk
  through all the segments that belong to use and flush each one of them.  I
  don't think we want to blindly flush the entire interleaved space because 
  a)
  the other chunks are some other DIMMs' apertures, and b) we'd be flushing 
  2x
  or more (depending on how many DIMMs are interleaved) the space we need, 
  one
  cache line at a time.

 I am indeed proposing flushing other DIMMs because those ranges are
 invalidated by the aperture moving.  This is based on the assumption
 that the flushing is cheaper in the case when no dirty-lines are
 found.  The performance gains of doing piecemeal flushes seems not
 worth the complexity.

 Why are the segments belonging to other apertures invalidated because we have
 moved our aperture?  They are all independent cache lines (segments must be a
 multiple of the cache line size), and the other apertures might be in the
 middle of some other I/O operation on some other CPU that we know nothing
 about.


Ah, ok the other DIMM data in the aperture should never be consumed
and does not need to be invalidated.  I'm straightened out on that
aspect.

With regards to the fencing, since we already take care to flush
writes we don't need to fence at all for the flush, right?  All we
care about is that reads see valid data.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] nd_blk: add support for read flush DSM flag

2015-08-20 Thread Ross Zwisler
On Wed, 2015-08-19 at 16:06 -0700, Dan Williams wrote:
 On Wed, Aug 19, 2015 at 3:48 PM, Ross Zwisler
 ross.zwis...@linux.intel.com wrote:
  Add support for the read flush _DSM flag, as outlined in the DSM spec:
 
  http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
 
  This flag tells the ND BLK driver that it needs to flush the cache lines
  associated with the aperture after the aperture is moved but before any
  new data is read.  This ensures that any stale cache lines from the
  previous contents of the aperture will be discarded from the processor
  cache, and the new data will be read properly from the DIMM.  We know
  that the cache lines are clean and will be discarded without any
  writeback because either a) the previous aperture operation was a read,
  and we never modified the contents of the aperture, or b) the previous
  aperture operation was a write and we must have written back the dirtied
  contents of the aperture to the DIMM before the I/O was completed.
 
  By supporting the read flush flag we can also change the ND BLK
  aperture mapping from write-combining to write-back via memremap().
 
  In order to add support for the read flush flag I needed to add a
  generic routine to invalidate cache lines, mmio_flush_range().  This is
  protected by the ARCH_HAS_MMIO_FLUSH Kconfig variable, and is currently
  only supported on x86.
 
  Signed-off-by: Ross Zwisler ross.zwis...@linux.intel.com
  Cc: Dan Williams dan.j.willi...@intel.com
 [..]
  diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
  index 7c2638f..56fff01 100644
  --- a/drivers/acpi/nfit.c
  +++ b/drivers/acpi/nfit.c
 [..]
   static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
  @@ -1078,11 +1078,16 @@ static int acpi_nfit_blk_single_io(struct nfit_blk 
  *nfit_blk,
  }
 
  if (rw)
  -   memcpy_to_pmem(mmio-aperture + offset,
  +   memcpy_to_pmem(mmio-addr.aperture + offset,
  iobuf + copied, c);
  -   else
  +   else {
  +   if (nfit_blk-dimm_flags  ND_BLK_READ_FLUSH)
  +   mmio_flush_range((void __force *)
  +   mmio-addr.aperture + offset, c);
  +
  memcpy_from_pmem(iobuf + copied,
  -   mmio-aperture + offset, c);
  +   mmio-addr.aperture + offset, c);
  +   }
 
 Why is the flush inside the while (len) loop?  I think it should be
 done immediately after the call to write_blk_ctl() since that is the
 point at which the aperture becomes invalidated, and not prior to each
 read within a given aperture position.  Taking it a bit further, we
 may be writing the same address into the control register as was there
 previously so we wouldn't need to flush in that case.

The reason I was doing it in the while (len) loop is that you have to walk
through the interleave tables, reading each segment until you have read 'len'
bytes.  If we were to invalidate right after the write_blk_ctl(), we would
essentially have to re-create the while (len) loop, hop through all the
segments doing the invalidation, then run through the segments again doing the
actual I/O.

It seemed a lot cleaner to just run through the segments once, invalidating
and reading each segment individually.

The bad news about the current approach is that we end up doing a bunch of
extra mb() fencing, twice per segment via clflush_cache_range().

The other option would be to do the double pass, but on the first pass to just
do the flushing without fencing, then fence everything, then do the reads.

I don't have a good feel for how much overhead all this extra fencing will be
vs the cost of traversing the segments twice.  The code is certainly simpler
with the way its implemented now.  If you feel that the extra fencing is too
expensive I'll implement it as a double-pass.  Otherwise we may want to wait
for performance data to justify the change.

Regarding skipping the flush if the control register is unchanged - sure, that
seems like a good idea.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] nd_blk: add support for read flush DSM flag

2015-08-20 Thread Ross Zwisler
On Thu, 2015-08-20 at 11:21 +0100, Will Deacon wrote:
 On Wed, Aug 19, 2015 at 11:48:04PM +0100, Ross Zwisler wrote:
  Add support for the read flush _DSM flag, as outlined in the DSM spec:
  
  http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
  
  This flag tells the ND BLK driver that it needs to flush the cache lines
  associated with the aperture after the aperture is moved but before any
  new data is read.  This ensures that any stale cache lines from the
  previous contents of the aperture will be discarded from the processor
  cache, and the new data will be read properly from the DIMM.  We know
  that the cache lines are clean and will be discarded without any
  writeback because either a) the previous aperture operation was a read,
  and we never modified the contents of the aperture, or b) the previous
  aperture operation was a write and we must have written back the dirtied
  contents of the aperture to the DIMM before the I/O was completed.
 
 Is this operation expected to be implemented as a destructive invalidation
 (i.e. discarding any dirty lines from the cache) or also a writeback of any
 dirtylines as part of the invalidation?
 
 If its the former, we might want to give it a scarier name to ensure that
 it doesn't grow users outside of NVDIMM scenarios, since flush is used
 elsewhere for things like flush_dcache_page.

It is the latter - there will be a writeback of any dirty lines as part of the
invalidation.  The real thing I'm looking for is a forced invalidation
(including writeback, if needed), even on architectures that are cache
coherent.  This is useful when the device can change the data at that memory
location, but not as part of a normal DMA operation that would keep the cache
coherent.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] nd_blk: add support for "read flush" DSM flag

2015-08-19 Thread Dan Williams
On Wed, Aug 19, 2015 at 3:48 PM, Ross Zwisler
 wrote:
> Add support for the "read flush" _DSM flag, as outlined in the DSM spec:
>
> http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
>
> This flag tells the ND BLK driver that it needs to flush the cache lines
> associated with the aperture after the aperture is moved but before any
> new data is read.  This ensures that any stale cache lines from the
> previous contents of the aperture will be discarded from the processor
> cache, and the new data will be read properly from the DIMM.  We know
> that the cache lines are clean and will be discarded without any
> writeback because either a) the previous aperture operation was a read,
> and we never modified the contents of the aperture, or b) the previous
> aperture operation was a write and we must have written back the dirtied
> contents of the aperture to the DIMM before the I/O was completed.
>
> By supporting the "read flush" flag we can also change the ND BLK
> aperture mapping from write-combining to write-back via memremap().
>
> In order to add support for the "read flush" flag I needed to add a
> generic routine to invalidate cache lines, mmio_flush_range().  This is
> protected by the ARCH_HAS_MMIO_FLUSH Kconfig variable, and is currently
> only supported on x86.
>
> Signed-off-by: Ross Zwisler 
> Cc: Dan Williams 
[..]
> diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
> index 7c2638f..56fff01 100644
> --- a/drivers/acpi/nfit.c
> +++ b/drivers/acpi/nfit.c
[..]
>  static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
> @@ -1078,11 +1078,16 @@ static int acpi_nfit_blk_single_io(struct nfit_blk 
> *nfit_blk,
> }
>
> if (rw)
> -   memcpy_to_pmem(mmio->aperture + offset,
> +   memcpy_to_pmem(mmio->addr.aperture + offset,
> iobuf + copied, c);
> -   else
> +   else {
> +   if (nfit_blk->dimm_flags & ND_BLK_READ_FLUSH)
> +   mmio_flush_range((void __force *)
> +   mmio->addr.aperture + offset, c);
> +
> memcpy_from_pmem(iobuf + copied,
> -   mmio->aperture + offset, c);
> +   mmio->addr.aperture + offset, c);
> +   }

Why is the flush inside the "while (len)" loop?  I think it should be
done immediately after the call to write_blk_ctl() since that is the
point at which the aperture becomes invalidated, and not prior to each
read within a given aperture position.  Taking it a bit further, we
may be writing the same address into the control register as was there
previously so we wouldn't need to flush in that case.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] nd_blk: add support for "read flush" DSM flag

2015-08-19 Thread Ross Zwisler
Add support for the "read flush" _DSM flag, as outlined in the DSM spec:

http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

This flag tells the ND BLK driver that it needs to flush the cache lines
associated with the aperture after the aperture is moved but before any
new data is read.  This ensures that any stale cache lines from the
previous contents of the aperture will be discarded from the processor
cache, and the new data will be read properly from the DIMM.  We know
that the cache lines are clean and will be discarded without any
writeback because either a) the previous aperture operation was a read,
and we never modified the contents of the aperture, or b) the previous
aperture operation was a write and we must have written back the dirtied
contents of the aperture to the DIMM before the I/O was completed.

By supporting the "read flush" flag we can also change the ND BLK
aperture mapping from write-combining to write-back via memremap().

In order to add support for the "read flush" flag I needed to add a
generic routine to invalidate cache lines, mmio_flush_range().  This is
protected by the ARCH_HAS_MMIO_FLUSH Kconfig variable, and is currently
only supported on x86.

Signed-off-by: Ross Zwisler 
Cc: Dan Williams 
---
 arch/x86/Kconfig  |  1 +
 arch/x86/include/asm/cacheflush.h |  2 ++
 arch/x86/include/asm/io.h |  2 --
 arch/x86/include/asm/pmem.h   |  2 ++
 drivers/acpi/Kconfig  |  1 +
 drivers/acpi/nfit.c   | 55 ++-
 drivers/acpi/nfit.h   | 16 
 lib/Kconfig   |  3 +++
 tools/testing/nvdimm/Kbuild   |  2 ++
 tools/testing/nvdimm/test/iomap.c | 30 +++--
 tools/testing/nvdimm/test/nfit.c  | 10 ---
 11 files changed, 88 insertions(+), 36 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 76c6115..03ab612 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -28,6 +28,7 @@ config X86
select ARCH_HAS_FAST_MULTIPLIER
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_PMEM_API
+   select ARCH_HAS_MMIO_FLUSH
select ARCH_HAS_SG_CHAIN
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
diff --git a/arch/x86/include/asm/cacheflush.h 
b/arch/x86/include/asm/cacheflush.h
index 471418a..e63aa38 100644
--- a/arch/x86/include/asm/cacheflush.h
+++ b/arch/x86/include/asm/cacheflush.h
@@ -89,6 +89,8 @@ int set_pages_rw(struct page *page, int numpages);
 
 void clflush_cache_range(void *addr, unsigned int size);
 
+#define mmio_flush_range(addr, size) clflush_cache_range(addr, size)
+
 #ifdef CONFIG_DEBUG_RODATA
 void mark_rodata_ro(void);
 extern const int rodata_test_data;
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index d241fbd..83ec9b1 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -248,8 +248,6 @@ static inline void flush_write_buffers(void)
 #endif
 }
 
-#define ARCH_MEMREMAP_PMEM MEMREMAP_WB
-
 #endif /* __KERNEL__ */
 
 extern void native_io_delay(void);
diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index a3a0df6..bb026c5 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -18,6 +18,8 @@
 #include 
 #include 
 
+#define ARCH_MEMREMAP_PMEM MEMREMAP_WB
+
 #ifdef CONFIG_ARCH_HAS_PMEM_API
 /**
  * arch_memcpy_to_pmem - copy data to persistent memory
diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
index 114cf48..4baeb85 100644
--- a/drivers/acpi/Kconfig
+++ b/drivers/acpi/Kconfig
@@ -410,6 +410,7 @@ config ACPI_NFIT
tristate "ACPI NVDIMM Firmware Interface Table (NFIT)"
depends on PHYS_ADDR_T_64BIT
depends on BLK_DEV
+   depends on ARCH_HAS_MMIO_FLUSH
select LIBNVDIMM
help
  Infrastructure to probe ACPI 6 compliant platforms for
diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 7c2638f..56fff01 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -1017,7 +1017,7 @@ static u64 read_blk_stat(struct nfit_blk *nfit_blk, 
unsigned int bw)
if (mmio->num_lines)
offset = to_interleave_offset(offset, mmio);
 
-   return readq(mmio->base + offset);
+   return readq(mmio->addr.base + offset);
 }
 
 static void write_blk_ctl(struct nfit_blk *nfit_blk, unsigned int bw,
@@ -1042,11 +1042,11 @@ static void write_blk_ctl(struct nfit_blk *nfit_blk, 
unsigned int bw,
if (mmio->num_lines)
offset = to_interleave_offset(offset, mmio);
 
-   writeq(cmd, mmio->base + offset);
+   writeq(cmd, mmio->addr.base + offset);
wmb_blk(nfit_blk);
 
if (nfit_blk->dimm_flags & ND_BLK_DCR_LATCH)
-   readq(mmio->base + offset);
+   readq(mmio->addr.base + offset);
 }
 
 static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
@@ -1078,11 +1078,16 @@ static int acpi_nfit_blk_single_io(struct nfit_blk 

Re: [PATCH v2] nd_blk: add support for read flush DSM flag

2015-08-19 Thread Dan Williams
On Wed, Aug 19, 2015 at 3:48 PM, Ross Zwisler
ross.zwis...@linux.intel.com wrote:
 Add support for the read flush _DSM flag, as outlined in the DSM spec:

 http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

 This flag tells the ND BLK driver that it needs to flush the cache lines
 associated with the aperture after the aperture is moved but before any
 new data is read.  This ensures that any stale cache lines from the
 previous contents of the aperture will be discarded from the processor
 cache, and the new data will be read properly from the DIMM.  We know
 that the cache lines are clean and will be discarded without any
 writeback because either a) the previous aperture operation was a read,
 and we never modified the contents of the aperture, or b) the previous
 aperture operation was a write and we must have written back the dirtied
 contents of the aperture to the DIMM before the I/O was completed.

 By supporting the read flush flag we can also change the ND BLK
 aperture mapping from write-combining to write-back via memremap().

 In order to add support for the read flush flag I needed to add a
 generic routine to invalidate cache lines, mmio_flush_range().  This is
 protected by the ARCH_HAS_MMIO_FLUSH Kconfig variable, and is currently
 only supported on x86.

 Signed-off-by: Ross Zwisler ross.zwis...@linux.intel.com
 Cc: Dan Williams dan.j.willi...@intel.com
[..]
 diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
 index 7c2638f..56fff01 100644
 --- a/drivers/acpi/nfit.c
 +++ b/drivers/acpi/nfit.c
[..]
  static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
 @@ -1078,11 +1078,16 @@ static int acpi_nfit_blk_single_io(struct nfit_blk 
 *nfit_blk,
 }

 if (rw)
 -   memcpy_to_pmem(mmio-aperture + offset,
 +   memcpy_to_pmem(mmio-addr.aperture + offset,
 iobuf + copied, c);
 -   else
 +   else {
 +   if (nfit_blk-dimm_flags  ND_BLK_READ_FLUSH)
 +   mmio_flush_range((void __force *)
 +   mmio-addr.aperture + offset, c);
 +
 memcpy_from_pmem(iobuf + copied,
 -   mmio-aperture + offset, c);
 +   mmio-addr.aperture + offset, c);
 +   }

Why is the flush inside the while (len) loop?  I think it should be
done immediately after the call to write_blk_ctl() since that is the
point at which the aperture becomes invalidated, and not prior to each
read within a given aperture position.  Taking it a bit further, we
may be writing the same address into the control register as was there
previously so we wouldn't need to flush in that case.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] nd_blk: add support for read flush DSM flag

2015-08-19 Thread Ross Zwisler
Add support for the read flush _DSM flag, as outlined in the DSM spec:

http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

This flag tells the ND BLK driver that it needs to flush the cache lines
associated with the aperture after the aperture is moved but before any
new data is read.  This ensures that any stale cache lines from the
previous contents of the aperture will be discarded from the processor
cache, and the new data will be read properly from the DIMM.  We know
that the cache lines are clean and will be discarded without any
writeback because either a) the previous aperture operation was a read,
and we never modified the contents of the aperture, or b) the previous
aperture operation was a write and we must have written back the dirtied
contents of the aperture to the DIMM before the I/O was completed.

By supporting the read flush flag we can also change the ND BLK
aperture mapping from write-combining to write-back via memremap().

In order to add support for the read flush flag I needed to add a
generic routine to invalidate cache lines, mmio_flush_range().  This is
protected by the ARCH_HAS_MMIO_FLUSH Kconfig variable, and is currently
only supported on x86.

Signed-off-by: Ross Zwisler ross.zwis...@linux.intel.com
Cc: Dan Williams dan.j.willi...@intel.com
---
 arch/x86/Kconfig  |  1 +
 arch/x86/include/asm/cacheflush.h |  2 ++
 arch/x86/include/asm/io.h |  2 --
 arch/x86/include/asm/pmem.h   |  2 ++
 drivers/acpi/Kconfig  |  1 +
 drivers/acpi/nfit.c   | 55 ++-
 drivers/acpi/nfit.h   | 16 
 lib/Kconfig   |  3 +++
 tools/testing/nvdimm/Kbuild   |  2 ++
 tools/testing/nvdimm/test/iomap.c | 30 +++--
 tools/testing/nvdimm/test/nfit.c  | 10 ---
 11 files changed, 88 insertions(+), 36 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 76c6115..03ab612 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -28,6 +28,7 @@ config X86
select ARCH_HAS_FAST_MULTIPLIER
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_PMEM_API
+   select ARCH_HAS_MMIO_FLUSH
select ARCH_HAS_SG_CHAIN
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
diff --git a/arch/x86/include/asm/cacheflush.h 
b/arch/x86/include/asm/cacheflush.h
index 471418a..e63aa38 100644
--- a/arch/x86/include/asm/cacheflush.h
+++ b/arch/x86/include/asm/cacheflush.h
@@ -89,6 +89,8 @@ int set_pages_rw(struct page *page, int numpages);
 
 void clflush_cache_range(void *addr, unsigned int size);
 
+#define mmio_flush_range(addr, size) clflush_cache_range(addr, size)
+
 #ifdef CONFIG_DEBUG_RODATA
 void mark_rodata_ro(void);
 extern const int rodata_test_data;
diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index d241fbd..83ec9b1 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -248,8 +248,6 @@ static inline void flush_write_buffers(void)
 #endif
 }
 
-#define ARCH_MEMREMAP_PMEM MEMREMAP_WB
-
 #endif /* __KERNEL__ */
 
 extern void native_io_delay(void);
diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index a3a0df6..bb026c5 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -18,6 +18,8 @@
 #include asm/cpufeature.h
 #include asm/special_insns.h
 
+#define ARCH_MEMREMAP_PMEM MEMREMAP_WB
+
 #ifdef CONFIG_ARCH_HAS_PMEM_API
 /**
  * arch_memcpy_to_pmem - copy data to persistent memory
diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
index 114cf48..4baeb85 100644
--- a/drivers/acpi/Kconfig
+++ b/drivers/acpi/Kconfig
@@ -410,6 +410,7 @@ config ACPI_NFIT
tristate ACPI NVDIMM Firmware Interface Table (NFIT)
depends on PHYS_ADDR_T_64BIT
depends on BLK_DEV
+   depends on ARCH_HAS_MMIO_FLUSH
select LIBNVDIMM
help
  Infrastructure to probe ACPI 6 compliant platforms for
diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 7c2638f..56fff01 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -1017,7 +1017,7 @@ static u64 read_blk_stat(struct nfit_blk *nfit_blk, 
unsigned int bw)
if (mmio-num_lines)
offset = to_interleave_offset(offset, mmio);
 
-   return readq(mmio-base + offset);
+   return readq(mmio-addr.base + offset);
 }
 
 static void write_blk_ctl(struct nfit_blk *nfit_blk, unsigned int bw,
@@ -1042,11 +1042,11 @@ static void write_blk_ctl(struct nfit_blk *nfit_blk, 
unsigned int bw,
if (mmio-num_lines)
offset = to_interleave_offset(offset, mmio);
 
-   writeq(cmd, mmio-base + offset);
+   writeq(cmd, mmio-addr.base + offset);
wmb_blk(nfit_blk);
 
if (nfit_blk-dimm_flags  ND_BLK_DCR_LATCH)
-   readq(mmio-base + offset);
+   readq(mmio-addr.base + offset);
 }
 
 static int acpi_nfit_blk_single_io(struct nfit_blk *nfit_blk,
@@