Re: [Lsf-pc] [LSF/MM TOPIC] Future direction of DAX

2017-01-17 Thread Kani, Toshimitsu
On Tue, 2017-01-17 at 16:59 +0100, Jan Kara wrote:
> On Fri 13-01-17 17:20:08, Ross Zwisler wrote:
 :
> > - If I recall correctly, at one point Dave Chinner suggested that
> > we change - If I recall correctly, at one point Dave Chinner
> > suggested that we change   DAX so that I/O would use cached stores
> > instead of the non-temporal stores   that it currently uses.  We
> > would then track pages that were written to by DAX in the radix
> > tree so that they would be flushed later during  
> > fsync/msync.  Does this sound like a win?  Also, assuming that we
> > can find a solution for platforms where the processor cache is part
> > of the ADR safe zone (above topic) this would be a clear
> > improvement, moving us from using non-temporal stores to faster
> > cached stores with no downside.
> 
> I guess this needs measurements. But it is worth a try.

Brain Boylston did some measurement before.
http://oss.sgi.com/archives/xfs/2016-08/msg00239.html

I updated his test program to skip pmem_persist() for the cached copy
case.

dst = dstbase;
+ #if 0
/* see note above */
if (mode == 'c')
pmem_persist(dst, dstsz);
+ #endif
}

Here are sample runs:

$ numactl -N0 time -p ./memcpyperf c /mnt/pmem0/file 100
INFO: dst 0x7f1d src 0x601200 dstsz 2756509696 cpysz 16384
real 3.28
user 3.27
sys 0.00

$ numactl -N0 time -p ./memcpyperf n /mnt/pmem0/file 100
INFO: dst 0x7f608000 src 0x601200 dstsz 2756509696 cpysz 16384
real 1.01
user 1.01
sys 0.00

$ numactl -N1 time -p ./memcpyperf c /mnt/pmem0/file 100
INFO: dst 0x7fe9 src 0x601200 dstsz 2756509696 cpysz 16384
real 4.06
user 4.06
sys 0.00

$ numactl -N1 time -p ./memcpyperf n /mnt/pmem0/file 100
INFO: dst 0x7f764000 src 0x601200 dstsz 2756509696 cpysz 16384
real 1.27
user 1.27
sys 0.00

In this simple test, using non-temporal copy is still faster than using
cached copy.

Thanks,
-Toshi



Re: [Lsf-pc] [LSF/MM TOPIC] Future direction of DAX

2017-01-17 Thread Dan Williams
On Tue, Jan 17, 2017 at 7:59 AM, Jan Kara  wrote:
> On Fri 13-01-17 17:20:08, Ross Zwisler wrote:
>> - The DAX fsync/msync model was built for platforms that need to flush dirty
>>   processor cache lines in order to make data durable on NVDIMMs.  There 
>> exist
>>   platforms, however, that are set up so that the processor caches are
>>   effectively part of the ADR safe zone.  This means that dirty data can be
>>   assumed to be durable even in the processor cache, obviating the need to
>>   manually flush the cache during fsync/msync.  These platforms still need to
>>   call fsync/msync to ensure that filesystem metadata updates are properly
>>   written to media.  Our first idea on how to properly support these 
>> platforms
>>   would be for DAX to be made aware that in some cases doesn't need to keep
>>   metadata about dirty cache lines.  A similar issue exists for volatile uses
>>   of DAX such as with BRD or with PMEM and the memmap command line parameter,
>>   and we'd like a solution that covers them all.
>
> Well, we still need the radix tree entries for locking. And you still need
> to keep track of which file offsets are writeably mapped (which we
> currently implicitely keep via dirty radix tree entries) so that you can
> writeprotect them if needed (during filesystem freezing, for reflink, ...).
> So I think what is going to gain the most by far is simply to avoid doing
> the writeback at all in such situations.

I came to the same conclusion when taking a look at this. I have some
patches that simply make the writeback optional, but do not touch any
of the other dirty tracking infrastructure. I'll send them out shortly
after a bit more testing. This also dovetails with the request from
Linus to push pmem flushing routines into the driver and stop abusing
__copy_user_nocache.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Lsf-pc] [LSF/MM TOPIC] Future direction of DAX

2017-01-17 Thread Jan Kara
On Fri 13-01-17 17:20:08, Ross Zwisler wrote:
> - The DAX fsync/msync model was built for platforms that need to flush dirty
>   processor cache lines in order to make data durable on NVDIMMs.  There exist
>   platforms, however, that are set up so that the processor caches are
>   effectively part of the ADR safe zone.  This means that dirty data can be
>   assumed to be durable even in the processor cache, obviating the need to
>   manually flush the cache during fsync/msync.  These platforms still need to
>   call fsync/msync to ensure that filesystem metadata updates are properly
>   written to media.  Our first idea on how to properly support these platforms
>   would be for DAX to be made aware that in some cases doesn't need to keep
>   metadata about dirty cache lines.  A similar issue exists for volatile uses
>   of DAX such as with BRD or with PMEM and the memmap command line parameter,
>   and we'd like a solution that covers them all.

Well, we still need the radix tree entries for locking. And you still need
to keep track of which file offsets are writeably mapped (which we
currently implicitely keep via dirty radix tree entries) so that you can
writeprotect them if needed (during filesystem freezing, for reflink, ...).
So I think what is going to gain the most by far is simply to avoid doing
the writeback at all in such situations.

> - If I recall correctly, at one point Dave Chinner suggested that we change
>   DAX so that I/O would use cached stores instead of the non-temporal stores
>   that it currently uses.  We would then track pages that were written to by
>   DAX in the radix tree so that they would be flushed later during
>   fsync/msync.  Does this sound like a win?  Also, assuming that we can find a
>   solution for platforms where the processor cache is part of the ADR safe
>   zone (above topic) this would be a clear improvement, moving us from using
>   non-temporal stores to faster cached stores with no downside.

I guess this needs measurements. But it is worth a try.

> - Jan suggested [2] that we could use the radix tree as a cache to service DAX
>   faults without needing to call into the filesystem.  Are there any issues
>   with this approach, and should we move forward with it as an optimization?

Yup, I'm still for it.

> - Whenever you mount a filesystem with DAX, it spits out a message that says
>   "DAX enabled. Warning: EXPERIMENTAL, use at your own risk".  What criteria
>   needs to be met for DAX to no longer be considered experimental?

So from my POV I'd be OK with removing the warning but still the code is
new so there are clearly bugs lurking ;).

> - When we msync() a huge page, if the range is less than the entire huge page,
>   should we flush the entire huge page and mark it clean in the radix tree, or
>   should we only flush the requested range and leave the radix tree entry
>   dirty?

If you do partial msync(), then you have the problem that msync(0, x),
msync(x, EOF) will not yield a clean file which may surprise somebody. So
I'm slightly skeptical.
 
> - Should we enable 1 GiB huge pages in filesystem DAX?  Does anyone have any
>   specific customer requests for this or performance data suggesting it would
>   be a win?  If so, what work needs to be done to get 1 GiB sized and aligned
>   filesystem block allocations, to get the required enabling in the MM layer,
>   etc?

I'm not convinced it is worth it now. Maybe later...

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html