Re: [lustre-discuss] kernel threads for rpcs in flight

2024-04-30 Thread Andreas Dilger via lustre-discuss
On Apr 29, 2024, at 02:36, Anna Fuchs 
mailto:anna.fu...@uni-hamburg.de>> wrote:

Hi Andreas.

Thank you very much, that helps a lot.
Sorry for the confusion, I primarily meant the client. The servers rarely have 
to compete with anything else for CPU resources I guess.

The mechanism to start new threads is relatively simple.  Before a server 
thread is processing a new request, if it is the last thread available, and not 
the maximum number of threads are running, then it will try to launch a new 
thread; repeat as needed.  So the thread  count will depend on the client RPC 
load and the RPC processing rate and lock contention on whatever resources 
those RPCs are accessing.

And what conditions are on the client? Are the threads then driven by the 
workload of the application somehow?

The number of ptlrpc threads per CPT is set by the "ptlrpcd_partner_group_size" 
module parameter, and defaults to 2 threads per CPT, IIRC.  I don't think that 
clients dynamically start/stop ptlrpcd threads at runtime.

Imagine an edge case where all but one core are pinned and at 100% constant 
load and one is dumping RAM to Lustre. Presumably, the available core will be 
taken. But will Lustre or the kernel then spawn additional threads and try to 
somehow interleave them with those of the application, or will it simply handle 
it with 1-2 threads on the available core (assume single stream to single OST)? 
In any case, I suppose the I/O transfer would suffer under the resource 
shortage, but my question would be to what extent it would (try to) hinder the 
application. For latency-critical applications, such small delays can already 
lead to idle waves. And surely, the Lustre threads are usually not CPU-hungry, 
but they will when it comes to encryption and compression.

When there are RPCs in the queue for any ptlrpcd it will be woken up and 
scheduled by the kernel, so it will compete with the application threads.  
IIRC, if a ptlrpcd thread is woken up and there are no RPCs in the local CPT 
queue it will try to steal RPCs from another CPT on the assumption that the 
local CPU is not generating any RPCs so it would be beneficial to offload 
threads on another CPU that *is* generating RPCs.  If the application thread is 
extremely CPU hungry, then the kernel will not schedule the ptlrpcd threads on 
those codes very often, and the "idle" core ptlrpcd threads will be be able to 
run more frequently.

Whether this behavior is optimal or not is subject to debate, and 
investigation/improvements are of course welcome.  Definitely, data checksums 
have some overhead (a few percent), and client-side data compression (which is 
done by ptlrpcd threads) would have a significant usage of CPU cycles, but 
given the large number of CPU cores on client nodes these days this may still 
provide a net performance benefit if the IO bottleneck is on the server.

With max_rpcs_in_flight = 1, multiple cores are loaded, presumably alternately, 
but the statistics are too inaccurate to capture this.  The distribution of 
threads to cores is regulated by the Linux kernel, right? Does anyone have 
experience with what happens when all CPUs are under full load with the 
application or something else?

Note that {osc,mdc}.*.max_rpcs_in_flight is a *per target* parameter, so a 
single client can still have tens or hundreds of RPCs in flight to different 
servers.  The client will send many RPC types directly from the process 
context, since they are waiting on the result anyway.  For asynchronous bulk 
RPCs, the ptlrpcd thread will try to process the bulk IO on the same CPT (= 
Lustre CPU Partition Table, roughly aligned to NUMA nodes) as the userspace 
application was running when the request was created.  This minimizes the 
cross-NUMA traffic when accessing pages for bulk RPCs, so long as those cores 
are not busy with userspace tasks.  Otherwise, the ptlrpcd thread on another 
CPT will steal RPCs from the queues.

Do the Lustre threads suffer? Is there a prioritization of the Lustre threads 
over other tasks?

Are you asking about the client or the server?  Many of the client RPCs are 
generated by the client threads, but for the running ptlrpcd threads do not 
have a higher priority than client application threads.  If the application 
threads are running on some cores, but other cores are idle, then the ptlrpcd 
threads on other cores will try to process the RPCs to allow the application 
threads to continue running there.  Otherwise, if all cores are busy (as is 
typical for HPC applications) then they will be scheduled by the kernel as 
needed.

Are there readily available statistics or tools for this scenario?

What statistics are you looking for?  There are "{osc,mdc}.*.stats" and 
"{osc,mdc}.*rpc_stats" that have aggregate information about RPC counts and 
latency.

Oh, right, these tell a lot. Isn't there also something to log the utilization 
and location of these threads? Otherwise, I'll continue trying with perf, which 
seems 

Re: [lustre-discuss] [EXTERNAL] [BULK] Files created in append mode don't obey directory default stripe count

2024-04-30 Thread Otto, Frank via lustre-discuss
Many thanks, Simon and Andreas.  I agree the behaviour makes sense.  Apologies 
for not checking the documentation.

Kind regards,
Frank

--
Dr. Frank Otto
Senior Research Infrastructure Developer
UCL Centre for Advanced Research Computing
Tel: 020 7679 1506

From: lustre-discuss  on behalf of 
Andreas Dilger via lustre-discuss 
Sent: 29 April 2024 19:29
To: Simon Guilbault 
Cc: lustre-discuss@lists.lustre.org 
Subject: Re: [lustre-discuss] [EXTERNAL] [BULK] Files created in append mode 
don't obey directory default stripe count


⚠ Caution: External sender

Simon is exactly correct.  This is expected behavior for files opened with 
O_APPEND, at least until LU-12738 is implemented.  Since O_APPEND writes are 
(by definition) entirely serialized, having multiple stripes on such files is 
mostly useless and just adds overhead.

Feel free to read https://jira.whamcloud.com/browse/LU-9341 for the very 
lengthy saga on the history of this behavior.

Cheers, Andreas

On Apr 29, 2024, at 10:42, Simon Guilbault 
mailto:simon.guilba...@calculquebec.ca>> wrote:

This is the expected behaviour. In the original implementation of PFL, when a 
file was open in append mode, the lock from 0 to EOF was initializing all 
stripes of the PFL file. We have a PFL layout on our system with 1 stripe up to 
1 GB, then it increased to 4 and then 32 stripes when the file was getting very 
large. This was a problem with software that was creating 4kb log files (like 
slurm.out) because they were creating files with > 32 stripes because of the 
append mode. This was patched a few releases ago, that behaviour can be 
changed, but I would recommend keeping 1 stripe for files that are using append 
mode.

From the manual:
O_APPEND mode. When files are opened for append, they instantiate all 
uninitialized components expressed in the layout. Typically, log files are 
opened for append, and complex layouts can be inefficient.
Note
The mdd.*.append_stripe_count and mdd.*.append_pool options can be used to 
specify special default striping for files created with O_APPEND.

On Mon, Apr 29, 2024 at 11:21 AM Vicker, Darby J. (JSC-EG111)[Jacobs 
Technology, Inc.] via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Wow, I would say that is definitely not expected.  I can recreate this on both 
of our LFS’s.  One is community lustre 2.14, the other is a DDN Exascalar.  
Shown below is our community lustre but we also have a 3-segment PFL on our 
Exascalar and the behavor is the same there.



$ echo > aaa

$ echo >> bbb

$ lfs getstripe aaa bbb

aaa

  lcm_layout_gen:3

  lcm_mirror_count:  1

  lcm_entry_count:   3

lcme_id: 1

lcme_mirror_id:  0

lcme_flags:  init

lcme_extent.e_start: 0

lcme_extent.e_end:   33554432

  lmm_stripe_count:  1

  lmm_stripe_size:   4194304

  lmm_pattern:   raid0

  lmm_layout_gen:0

  lmm_stripe_offset: 6

  lmm_objects:

  - 0: { l_ost_idx: 6, l_fid: [0x10006:0xace8112:0x0] }



lcme_id: 2

lcme_mirror_id:  0

lcme_flags:  0

lcme_extent.e_start: 33554432

lcme_extent.e_end:   10737418240

  lmm_stripe_count:  4

  lmm_stripe_size:   4194304

  lmm_pattern:   raid0

  lmm_layout_gen:0

  lmm_stripe_offset: -1



lcme_id: 3

lcme_mirror_id:  0

lcme_flags:  0

lcme_extent.e_start: 10737418240

lcme_extent.e_end:   EOF

  lmm_stripe_count:  8

  lmm_stripe_size:   4194304

  lmm_pattern:   raid0

  lmm_layout_gen:0

  lmm_stripe_offset: -1



bbb

lmm_stripe_count:  1

lmm_stripe_size:   2097152

lmm_pattern:   raid0

lmm_layout_gen:0

lmm_stripe_offset: 3

obdidx  objidobjid  
  group

 3 179773949   0xab721fd   0





From: lustre-discuss 
mailto:lustre-discuss-boun...@lists.lustre.org>>
 on behalf of Otto, Frank via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>>
Date: Monday, April 29, 2024 at 8:33 AM
To: lustre-discuss@lists.lustre.org 
mailto:lustre-discuss@lists.lustre.org>>
Subject: [EXTERNAL] [BULK] [lustre-discuss] Files created in append mode don't 
obey directory default stripe count

CAUTION: This email originated from outside of NASA.  Please take care when 
clicking links or opening attachments.  Use the "Report Message" button to 
report suspicious messages to the NASA SOC.



See subject. Is it a known issue? Is it expected? Easy to reproduce:





# lfs getstripe .

.

stripe_count:  4 stripe_size:   1048576 pattern:   raid0 stripe_offset: -1



# echo > aaa

# echo >> bbb

# lfs getstripe .

.

stripe_count:  4 stripe_size:   1048576 pattern:   raid0 stripe_offset: -1



./aaa

lmm_stripe_count:  4

lmm_stripe_size:   1048576