Re: [lustre-discuss] Avoiding system cache when using ssd pfl extent

2022-05-20 Thread Andreas Dilger via lustre-discuss
Ake,
in this particular case I can answer your question in detail.

Before SFAOS 12.1 (IIRC) the /sys/block/*/queue/rotational setting is set from 
userspace at mount time via a udev script, and the Lustre detection of 
"rotational=0" could be racy.  Newer versions of SFAOS (12.1+) set the 
rotational state in the SCSI VPD page and this is detected directly by the 
kernel.

For EXAScaler systems that may be running older SFAOS releases, there was a 
patch made (included in 2.12.6-ddn72/EXA5.2.5) that revalidates the rotational 
device state occasionally in case it has been modified after mount time, and 
uses that to update the read_cache_enable and writethrough_cache_enable 
tunables *if they have not been explicitly set*.

Until you update to a newer EXA and/or SFAOS, you can explicitly tune 
osd-ldiskfs.*.read_cache_enable=0 and ...writethrough_cache_enable=0, using a 
wildcard "*" if all of the OSTs/MDTs are flash based.  If you have a hybrid 
NVMe/HDD system, you can explicitly select a subset of OST/MDT devices to 
disable the caches.

Cheers, Andreas

On May 20, 2022, at 02:49, Åke Sandgren 
mailto:ake.sandg...@hpc2n.umu.se>> wrote:
On 5/20/22 09:53, Andreas Dilger via lustre-discuss wrote:
To elaborate a bit on Patrick's answer, there is no mechanism to do this on the 
*client*, because the performance difference between client RAM and server 
storage is still fairly significant, especially if the application is doing 
sub-page read or write operations.
However, on the *server* the OSS and MDS will *not* put flash storage into the 
page cache, because using the kernel page cache has a measurable overhead, and 
(at least in our testing) the performance of NVMe IOPS is actually better 
*without* the page cache because more CPU is available to handle RPCs.  This is 
controlled on the server with 
osd-ldiskfs.*.{read_cache_enable,writethrough_cache_enable}, default to 0 if 
the block device is non-rotational, default to 1 if block device is rotational.

Then my question is, what is it checking to determine non-rotational?

On our systems the NVMe disks have read/writethrough_cache_enable = 1 (DDN 
SFA400NVXE) with
===
/dev/sde on /lustre/stor10/ost (NVMe)
cat /sys/block/sde/queue/rotational
0
lctl get_param osd-ldiskfs.*.*cache*enable
osd-ldiskfs.stor10-OST.read_cache_enable=1
osd-ldiskfs.stor10-OST.writethrough_cache_enable=1

EXAScaler SFA CentOS 5.2.3-r5
kmod-lustre-2.12.6_ddn58-1.el7.x86_64
===

--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se  Mobile: +46 70 7716134  
Fax: +46 90-580 14
WWW: http://www.hpc2n.umu.se
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Avoiding system cache when using ssd pfl extent

2022-05-20 Thread Åke Sandgren




On 5/20/22 09:53, Andreas Dilger via lustre-discuss wrote:

To elaborate a bit on Patrick's answer, there is no mechanism to do this on the 
*client*, because the performance difference between client RAM and server 
storage is still fairly significant, especially if the application is doing 
sub-page read or write operations.

However, on the *server* the OSS and MDS will *not* put flash storage into the 
page cache, because using the kernel page cache has a measurable overhead, and 
(at least in our testing) the performance of NVMe IOPS is actually better 
*without* the page cache because more CPU is available to handle RPCs.  This is 
controlled on the server with 
osd-ldiskfs.*.{read_cache_enable,writethrough_cache_enable}, default to 0 if 
the block device is non-rotational, default to 1 if block device is rotational.


Then my question is, what is it checking to determine non-rotational?

On our systems the NVMe disks have read/writethrough_cache_enable = 1 
(DDN SFA400NVXE) with

===
/dev/sde on /lustre/stor10/ost (NVMe)
cat /sys/block/sde/queue/rotational
0
lctl get_param osd-ldiskfs.*.*cache*enable
osd-ldiskfs.stor10-OST.read_cache_enable=1
osd-ldiskfs.stor10-OST.writethrough_cache_enable=1

EXAScaler SFA CentOS 5.2.3-r5
kmod-lustre-2.12.6_ddn58-1.el7.x86_64
===

--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se  Mobile: +46 70 7716134  Fax: +46 90-580 14
WWW: http://www.hpc2n.umu.se
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Avoiding system cache when using ssd pfl extent

2022-05-20 Thread Andreas Dilger via lustre-discuss
To elaborate a bit on Patrick's answer, there is no mechanism to do this on the 
*client*, because the performance difference between client RAM and server 
storage is still fairly significant, especially if the application is doing 
sub-page read or write operations.

However, on the *server* the OSS and MDS will *not* put flash storage into the 
page cache, because using the kernel page cache has a measurable overhead, and 
(at least in our testing) the performance of NVMe IOPS is actually better 
*without* the page cache because more CPU is available to handle RPCs.  This is 
controlled on the server with 
osd-ldiskfs.*.{read_cache_enable,writethrough_cache_enable}, default to 0 if 
the block device is non-rotational, default to 1 if block device is rotational.

Separately, there is a tunable for avoiding the page cache for large read/write 
RPCs, osd-ldiskfs.*.readcache_max_io_mb=8 by default, so RPCs >= 8MB go 
directly to the disk, to avoid blowing out the page cache on the server.

Cheers, Andreas

> On May 19, 2022, at 12:21, Patrick Farrell via lustre-discuss 
>  wrote:
> 
> Well, you could use two file descriptors, one for O_DIRECT one otherwise. 
> 
> SSD is a fast medium but my instinct is the desirability of having data in 
> RAM is much more about I/O pattern and hard to optimize for in advance - Do 
> you read the data you wrote?  (Or read data repeatedly?)
> 
> In any case, there's no mechanism today.  It's also relatively marginal if 
> we're just doing buffered I/O then forcing the data out - it will reduce 
> memory usage but it won't improve performance.
> 
> -Patrick
> 
> From: John Bauer 
> Sent: Thursday, May 19, 2022 1:16 PM
> To: Patrick Farrell ; lustre-discuss@lists.lustre.org 
> 
> Subject: Re: [lustre-discuss] Avoiding system cache when using ssd pfl extent
>  
> Pat,
> No, not in  general.  It just seems that if one is storing data on an SSD it 
> should be optional to have it not stored in memory ( why store in 2 fast 
> mediums ).
> O_DIRECT is not of value as that would apply to all extents, whether on SSD 
> on HDD.   O_DIRECT on Lustre has been problematic for me in the past, 
> performance wise.
> John
> On 5/19/22 13:05, Patrick Farrell wrote:
>> No, and I'm not sure I agree with you at first glance.
>> 
>> Is this just generally an idea that data stored on SSD should not be in RAM? 
>>  If so, there's no mechanism for that other than using direct I/O.
>> 
>> -Patrick
>> From: lustre-discuss  on behalf of 
>> John Bauer 
>> Sent: Thursday, May 19, 2022 12:48 PM
>> To: lustre-discuss@lists.lustre.org 
>> Subject: [lustre-discuss] Avoiding system cache when using ssd pfl extent
>>  
>> When using PFL, and using an SSD as the first extent, it seems it would 
>> be advantageous to not have that extent's file data consume memory in 
>> the client's system buffers.  It would be similar to using O_DIRECT, but 
>> on a per-extent basis.  Is there a mechanism for that already?
>> 
>> Thanks,
>> 
>> John
>> 
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Avoiding system cache when using ssd pfl extent

2022-05-19 Thread Patrick Farrell via lustre-discuss
Well, you could use two file descriptors, one for O_DIRECT one otherwise. 

SSD is a fast medium but my instinct is the desirability of having data in RAM 
is much more about I/O pattern and hard to optimize for in advance - Do you 
read the data you wrote?  (Or read data repeatedly?)

In any case, there's no mechanism today.  It's also relatively marginal if 
we're just doing buffered I/O then forcing the data out - it will reduce memory 
usage but it won't improve performance.

-Patrick


From: John Bauer 
Sent: Thursday, May 19, 2022 1:16 PM
To: Patrick Farrell ; lustre-discuss@lists.lustre.org 

Subject: Re: [lustre-discuss] Avoiding system cache when using ssd pfl extent


Pat,

No, not in  general.  It just seems that if one is storing data on an SSD it 
should be optional to have it not stored in memory ( why store in 2 fast 
mediums ).

O_DIRECT is not of value as that would apply to all extents, whether on SSD on 
HDD.   O_DIRECT on Lustre has been problematic for me in the past, performance 
wise.

John

On 5/19/22 13:05, Patrick Farrell wrote:
No, and I'm not sure I agree with you at first glance.

Is this just generally an idea that data stored on SSD should not be in RAM?  
If so, there's no mechanism for that other than using direct I/O.

-Patrick

From: lustre-discuss 
<mailto:lustre-discuss-boun...@lists.lustre.org>
 on behalf of John Bauer <mailto:bau...@iodoctors.com>
Sent: Thursday, May 19, 2022 12:48 PM
To: lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org> 
<mailto:lustre-discuss@lists.lustre.org>
Subject: [lustre-discuss] Avoiding system cache when using ssd pfl extent

When using PFL, and using an SSD as the first extent, it seems it would
be advantageous to not have that extent's file data consume memory in
the client's system buffers.  It would be similar to using O_DIRECT, but
on a per-extent basis.  Is there a mechanism for that already?

Thanks,

John

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Avoiding system cache when using ssd pfl extent

2022-05-19 Thread John Bauer

Pat,

No, not in  general.  It just seems that if one is storing data on an 
SSD it should be optional to have it not stored in memory ( why store in 
2 fast mediums ).


O_DIRECT is not of value as that would apply to all extents, whether on 
SSD on HDD.   O_DIRECT on Lustre has been problematic for me in the 
past, performance wise.


John

On 5/19/22 13:05, Patrick Farrell wrote:

No, and I'm not sure I agree with you at first glance.

Is this just generally an idea that data stored on SSD should not be 
in RAM?  If so, there's no mechanism for that other than using direct I/O.


-Patrick

*From:* lustre-discuss  on 
behalf of John Bauer 

*Sent:* Thursday, May 19, 2022 12:48 PM
*To:* lustre-discuss@lists.lustre.org 
*Subject:* [lustre-discuss] Avoiding system cache when using ssd pfl 
extent

When using PFL, and using an SSD as the first extent, it seems it would
be advantageous to not have that extent's file data consume memory in
the client's system buffers.  It would be similar to using O_DIRECT, but
on a per-extent basis.  Is there a mechanism for that already?

Thanks,

John

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Avoiding system cache when using ssd pfl extent

2022-05-19 Thread Patrick Farrell via lustre-discuss
No, and I'm not sure I agree with you at first glance.

Is this just generally an idea that data stored on SSD should not be in RAM?  
If so, there's no mechanism for that other than using direct I/O.

-Patrick

From: lustre-discuss  on behalf of 
John Bauer 
Sent: Thursday, May 19, 2022 12:48 PM
To: lustre-discuss@lists.lustre.org 
Subject: [lustre-discuss] Avoiding system cache when using ssd pfl extent

When using PFL, and using an SSD as the first extent, it seems it would
be advantageous to not have that extent's file data consume memory in
the client's system buffers.  It would be similar to using O_DIRECT, but
on a per-extent basis.  Is there a mechanism for that already?

Thanks,

John

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org