[lustre-discuss] Lustre caching and NUMA nodes

2023-12-07 Thread John Bauer

Peter,

A delayed reply to one more of your questions, "What makes you think 
"lustre" is doing that?" , as I had to make another run and gather OSC 
stats on all the Lustre file systems mounted on the host that I run dd on.


This host has 12 Lustre file systems, comprised of 507 OSTs. While dd 
was running I instrumented the amount of cached data associated with all 
507 OSCs.  That is reflected in the bottom frame of the image below.  
Note that in the top frame there was always about 5GB of free memory, 
and 50GB of cached data.  I believe it has to be a Lustre issue as the 
Linux buffer cache has no knowledge that a page is a Lustre page.  How 
is it that every OSC, on all 12 file systems on the host, has its memory 
dropped to 0, yet all the other 50GB of cached data on the host remains? 
It's as though dropcache is being run on only the lustre file systems.  
My googling around finds no such feature in dropcache that would allow 
file system specific dropping.  Is there some tuneable that gives Lustre 
pages higher potential for eviction than other cached data?


Another subtle point of interest.  Note that dd writing resumes, as 
reflected in the growth of the cached data for its 8 OSTs, before all 
the other OSCs have finished dumping.  This is most visible around 2.1 
seconds into the run.  Also different is that this dumping phenomenon 
happened 3 times in the course of a 10 second run, instead of just 1 as 
in the previous run I was referencing, costing this dd run 1.2 seconds.


John


On 12/6/23 14:24, lustre-discuss-requ...@lists.lustre.org wrote:

Send lustre-discuss mailing list submissions to
lustre-discuss@lists.lustre.org

To subscribe or unsubscribe via the World Wide Web, visit
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
or, via email, send a message with subject or body 'help' to
lustre-discuss-requ...@lists.lustre.org

You can reach the person managing the list at
lustre-discuss-ow...@lists.lustre.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of lustre-discuss digest..."


Today's Topics:

1. Coordinating cluster start and shutdown? (Jan Andersen)
2. Re: Lustre caching and NUMA nodes (Peter Grandi)
3. Re: Coordinating cluster start and shutdown?
   (Bertschinger, Thomas Andrew Hjorth)
4. Lustre server still try to recover the lnet reply to the
   depreciated clients (Huang, Qiulan)


--

Message: 1
Date: Wed, 6 Dec 2023 10:27:11 +
From: Jan Andersen
To: lustre
Subject: [lustre-discuss] Coordinating cluster start and shutdown?
Message-ID:<696fac02-df18-4fe1-967c-02c3bca42...@comind.io>
Content-Type: text/plain; charset=UTF-8; format=flowed

Are there any tools for coordinating the start and shutdown of lustre 
filesystem, so that the OSS systems don't attempt to mount disks before the MGT 
and MDT are online?


--

Message: 2
Date: Wed, 6 Dec 2023 12:40:54 +
From:p...@lustre.list.sabi.co.uk  (Peter Grandi)
To: list Lustre discussion
Subject: Re: [lustre-discuss] Lustre caching and NUMA nodes
Message-ID:<25968.27606.536270.208...@petal.ty.sabi.co.uk>
Content-Type: text/plain; charset=iso-8859-1


I have a an OSC caching question.? I am running a dd process
which writes an 8GB file.? The file is on lustre, striped
8x1M.

How the Lustre instance servers store the data may not have a
huge influence on what happens in the client's system buffer
cache.


This is run on a system that has 2 NUMA nodes (? cpu sockets).
[...] Why does lustre go to the trouble of dumping node1 and
then not use node1's memory, when there was always plenty of
free memory on node0?

What makes you think "lustre" is doing that?

Are you aware of the values of the flusher settings such as
'dirty_bytes', 'dirty_ratio', 'dirty_expire_centisecs'?

Have you considered looking at NUMA policies e.g. as described
in 'man numactl'?

Also while you surely know better I usually try to avoid
buffering large amounts of to-be-written data in RAM (whether on
the OSC or the OSS), and to my taste 8GiB "in-flight" is large.


--

Message: 3
Date: Wed, 6 Dec 2023 16:00:38 +
From: "Bertschinger, Thomas Andrew Hjorth"
To: Jan Andersen, lustre

Subject: Re: [lustre-discuss] Coordinating cluster start and shutdown?
Message-ID:



Content-Type: text/plain; charset="iso-8859-1"

Hello Jan,

You can use the Pacemaker / Corosync high-availability software stack for this: 
specifically, ordering constraints [1] can be used.

Unfortunately, Pacemaker is probably over-the-top if you don't need HA -- its 
configuration is complex and difficult to get right, and it significantly 
complicates system administra

Re: [lustre-discuss] Lustre caching and NUMA nodes

2023-12-06 Thread Peter Grandi via lustre-discuss
> I have a an OSC caching question.  I am running a dd process
> which writes an 8GB file.  The file is on lustre, striped
> 8x1M.

How the Lustre instance servers store the data may not have a
huge influence on what happens in the client's system buffer
cache.

> This is run on a system that has 2 NUMA nodes (  cpu sockets).
> [...] Why does lustre go to the trouble of dumping node1 and
> then not use node1's memory, when there was always plenty of
> free memory on node0?

What makes you think "lustre" is doing that?

Are you aware of the values of the flusher settings such as
'dirty_bytes', 'dirty_ratio', 'dirty_expire_centisecs'?

Have you considered looking at NUMA policies e.g. as described
in 'man numactl'?

Also while you surely know better I usually try to avoid
buffering large amounts of to-be-written data in RAM (whether on
the OSC or the OSS), and to my taste 8GiB "in-flight" is large.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre caching and NUMA nodes

2023-12-05 Thread John Bauer

Andreas,

Thanks for the reply.

Client version is 2.14.0_ddn98. Here is a plot of the 
*write_RPCs_in_flight* plot.  Snapshot every 50ms.  The max for any of 
the samples for any of the OSCs was 1.  No RPCs in flight while the OSCs 
were dumping memory.  The number following the OSC name in the legends 
is the sum of the *write_RPCs_in flight* for all the intervals.  To be 
honest, I have never really looked at the RPCs in flight numbers.  I'm 
running as a lowly user, so I don't have access to any of the server 
data, so I have nothing on osd-ldiskfs.*.brw_stats.


I should also point out that the backing storage on the servers is SSD, 
so I would think the commiting to storage on the server side should be 
pretty quick.


I'm trying to get a handle on how Linux buffer cache works. Everything I 
find on the web is pretty old.  Here's one from 2012. 
https://lwn.net/Articles/495543/


Can someone point me to something more current, and perhaps Lustre related?

As for images, I think the list server strips the images.  In previous 
postings, when I would include images , what I got back when the list 
server broadcast it out had the iamges stripped. I'll include the images 
and also a link to the image on DropBox.


Thanks again,

John

https://www.dropbox.com/scl/fi/fgmz4wazr6it9q2aeo0mb/write_RPCs_in_flight.png?rlkey=d3ri2w2n7isggvn05se4j3a6b&dl=0


On 12/5/23 22:33, Andreas Dilger wrote:


On Dec 4, 2023, at 15:06, John Bauer  wrote:


I have a an OSC caching question.  I am running a dd process which 
writes an 8GB file.  The file is on lustre, striped 8x1M. This is run 
on a system that has 2 NUMA nodes (cpu sockets). All the data is 
apparently stored on one NUMA node (node1 in the plot below) until 
node1 runs out of free memory.  Then it appears that dd comes to a 
stop (no more writes complete) until lustre dumps the data from the 
node1.  Then dd continues writing, but now the data is stored on the 
second NUMA node, node0.  Why does lustre go to the trouble of 
dumping node1 and then not use node1's memory, when there was always 
plenty of free memory on node0?


I'll forego the explanation of the plot.  Hopefully it is clear 
enough.  If someone has questions about what the plot is depicting, 
please ask.


https://www.dropbox.com/scl/fi/pijgnnlb8iilkptbeekaz/dd.png?rlkey=3abonv5tx8w5w5m08bn24qb7x&dl=0 



Hi John,
thanks for your detailed analysis.  It would be good to include the 
client kernel and Lustre version in this case, as the page cache 
behaviour can vary dramatically between different versions.


The allocation of the page cache pages may actually be out of the 
control of Lustre, since they are typically being allocated by the 
kernel VM affine to the core where the process that is doing the IO is 
running.  It may be that the "dd" is rescheduled to run on node0 
during the IO, since the ptlrpcd threads will be busy processing all 
of the RPCs during this time, and then dd will start allocating pages 
from node0.


That said, it isn't clear why the client doesn't start flushing the 
dirty data from cache earlier?  Is it actually sending the data to the 
OSTs, but then waiting for the OSTs to reply that the data has been 
committed to the storage before dropping the cache?


It would be interesting to plot the 
osc.*.rpc_stats::write_rpcs_in_flight and ::pending_write_pages to see 
if the data is already in flight.  The osd-ldiskfs.*.brw_stats on the 
server would also useful to graph over the same period, if possible.


It *does* look like the "node1 dirty" is kept at a low value for the 
entire run, so it at least appears that RPCs are being sent, but there 
is no page reclaim triggered until memory is getting low.  Doing page 
reclaim is really the kernel's job, but it seems possible that the 
Lustre client may not be suitably notifying the kernel about the dirty 
pages and kicking it in the butt earlier to clean up the pages.


PS: my preference would be to just attach the image to the email 
instead of hosting it externally, since it is only 55 KB.  Is this 
blocked by the list server?


Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre caching and NUMA nodes

2023-12-05 Thread Andreas Dilger via lustre-discuss

On Dec 4, 2023, at 15:06, John Bauer 
mailto:bau...@iodoctors.com>> wrote:

I have a an OSC caching question.  I am running a dd process which writes an 
8GB file.  The file is on lustre, striped 8x1M. This is run on a system that 
has 2 NUMA nodes (cpu sockets). All the data is apparently stored on one NUMA 
node (node1 in the plot below) until node1 runs out of free memory.  Then it 
appears that dd comes to a stop (no more writes complete) until lustre dumps 
the data from the node1.  Then dd continues writing, but now the data is stored 
on the second NUMA node, node0.  Why does lustre go to the trouble of dumping 
node1 and then not use node1's memory, when there was always plenty of free 
memory on node0?

I'll forego the explanation of the plot.  Hopefully it is clear enough.  If 
someone has questions about what the plot is depicting, please ask.

https://www.dropbox.com/scl/fi/pijgnnlb8iilkptbeekaz/dd.png?rlkey=3abonv5tx8w5w5m08bn24qb7x&dl=0

Hi John,
thanks for your detailed analysis.  It would be good to include the client 
kernel and Lustre version in this case, as the page cache behaviour can vary 
dramatically between different versions.

The allocation of the page cache pages may actually be out of the control of 
Lustre, since they are typically being allocated by the kernel VM affine to the 
core where the process that is doing the IO is running.  It may be that the 
"dd" is rescheduled to run on node0 during the IO, since the ptlrpcd threads 
will be busy processing all of the RPCs during this time, and then dd will 
start allocating pages from node0.

That said, it isn't clear why the client doesn't start flushing the dirty data 
from cache earlier?  Is it actually sending the data to the OSTs, but then 
waiting for the OSTs to reply that the data has been committed to the storage 
before dropping the cache?

It would be interesting to plot the osc.*.rpc_stats::write_rpcs_in_flight and 
::pending_write_pages to see if the data is already in flight.  The 
osd-ldiskfs.*.brw_stats on the server would also useful to graph over the same 
period, if possible.

It *does* look like the "node1 dirty" is kept at a low value for the entire 
run, so it at least appears that RPCs are being sent, but there is no page 
reclaim triggered until memory is getting low.  Doing page reclaim is really 
the kernel's job, but it seems possible that the Lustre client may not be 
suitably notifying the kernel about the dirty pages and kicking it in the butt 
earlier to clean up the pages.

PS: my preference would be to just attach the image to the email instead of 
hosting it externally, since it is only 55 KB.  Is this blocked by the list 
server?

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lustre caching and NUMA nodes

2023-12-04 Thread John Bauer
I have a an OSC caching question.  I am running a dd process which 
writes an 8GB file.  The file is on lustre, striped 8x1M. This is run on 
a system that has 2 NUMA nodes (  cpu sockets). All the data is 
apparently stored on one NUMA node (node1 in the plot below) until node1 
runs out of free memory.  Then it appears that dd comes to a stop (no 
more writes complete) until lustre dumps the data from the node1.  Then 
dd continues writing, but now the data is stored on the second NUMA 
node, node0.  Why does lustre go to the trouble of dumping node1 and 
then not use node1's memory, when there was always plenty of free memory 
on node0?


I'll forego the explanation of the plot.  Hopefully it is clear enough.  
If someone has questions about what the plot is depicting, please ask.


https://www.dropbox.com/scl/fi/pijgnnlb8iilkptbeekaz/dd.png?rlkey=3abonv5tx8w5w5m08bn24qb7x&dl=0

Thanks for any insight shared,

John

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org