Re: [lustre-discuss] fio and lustre performance

2022-08-25 Thread Andreas Dilger via lustre-discuss
No comment on the actual performance issue, but we normally test fio using the 
libaio interface (which is handled in the kernel) instead of posixaio (which is 
handled by threads in userspace, AFAIK), and also use DirectIO to avoid memory 
copies (OK if there are enough IO requests in flight).  it should be a 
relatively easy change to see if that improves the behaviour.

Other things to check - osc.*.max_dirty_mb and llite.*.max_cached_mb are not 
hitting limits and throttling IO until the data is flushed, and 
osc.*.max_rpcs_in_flight across the OSCs are *at least* 64 to keep up with the 
input generation.

When you consider that Lustre (distributed coherent persistent network 
filesystem) is "only half" of the performance (28s vs 13s) of a local ephemeral 
RAM-based filesystem, it isn't doing too badly...

Cheers, Andreas

On Aug 25, 2022, at 11:29, John Bauer 
mailto:bau...@iodoctors.com>> wrote:

Hi all,

I'm trying to figure out an odd behavior when running an fio ( 
https://git.kernel.dk/cgit/fio/  ) benchmark 
on a Lustre file system.

fio--randrepeat=1  \
   --ioengine=posixaio  \
   --buffered=1  \
   --gtod_reduce=1  \
   --name=test  \
   --filename=${fileName}  \
   --bs=1M  \
   --iodepth=64  \
   --size=40G  \
   --readwrite=randwrite

In short, the application queues 40,000 random aio_write64(nbyte=1M) to a 
maximum depth of 64, doing aio_suspend64 followed by aio_write to keep 64 
outstanding aio requests.  My I/O library that processes the aio requests does 
so with 4 pthreads removing aio requests from the queue and doing the I/Os as 
pwrite64()s.  The odd behavior is the intermittent pauses that can been seen in 
the first plot below.  The X-axis is wall clock time, in seconds, and the left 
Y-axis is file postition. The horizontal blue lines indicate the amount of time 
each of the pwrite64 is active and where in the file the I/O is occurring. The 
right Y-axis is the cumulative cpu times for both the process and kernel during 
the run.  There is minimal user cpu time, for either the process or kernel.  
The cumulative system cpu time attributable to the process ( the red line ) 
runs at a slope of ~4 system cpu seconds per wall clock second.  Makes sense 
since there are 4 pthreads at work in the user process.  The cumulative system 
cpu time for the kernel as a whole ( the green line ) is ~12 system cpu seconds 
per wall clock second.  Note that during the pauses the system cpu accumulation 
drops to near zero ( zero slope ).

This is run on a dedicated ivybridge node with 40 cores Intel(R) Xeon(R) CPU 
E5-2680 v2 @ 2.80GHz

The node has 64G of memory.

The file is striped single component PFL, 8x1M.  Lustre version *2.12.8 ddn12*

Does anyone have any ideas what is causing the pauses?  Is there something else 
I could be looking at in the /proc file system to gain insight?

For comparison, the 2nd plot below is when run on /tmp.  Note that there are 
some pwrite64() that take a long time, but a single pwrite64() taking a long 
time does not stop all the other pwrite64() active during the same time period. 
 Elapsed time for /tmp is 13 seconds. Lustre is 28 seconds.  Both are 
essentially memory resident.

Just for completeness I have added a 3rd plot which is the amount of memory 
each of the OSC clients is consuming over the course of the Lustre run.  
Nothing unusual there.  The memory consumption rate slows down during the 
pauses as one would expect.

I don't think the instrumentation is the issue, as there is not much more 
instrumentation occurring in the Lustre run versus /tmp, and they are both less 
than 6MB each in total.

John

In case the images got stripped here are some URLs to dropbox

plot1 : https://www.dropbox.com/s/ey217o053gdyse5/plot1.png?dl=0

plot2 : https://www.dropbox.com/s/vk23vmufa388l7h/plot2.png?dl=0

plot3 : https://www.dropbox.com/s/vk23vmufa388l7h/plot2.png?dl=0





___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] fio and lustre performance

2022-08-25 Thread John Bauer

Hi all,

I'm trying to figure out an odd behavior when running an fio ( 
https://git.kernel.dk/cgit/fio/  ) 
benchmark on a Lustre file system.


fio--randrepeat=1  \
--ioengine=posixaio  \
--buffered=1  \
--gtod_reduce=1  \
--name=test  \
--filename=${fileName}  \
--bs=1M  \
--iodepth=64  \
--size=40G  \
--readwrite=randwrite

In short, the application queues 40,000 random aio_write64(nbyte=1M) to 
a maximum depth of 64, doing aio_suspend64 followed by aio_write to keep 
64 outstanding aio requests.  My I/O library that processes the aio 
requests does so with 4 pthreads removing aio requests from the queue 
and doing the I/Os as pwrite64()s.  The odd behavior is the intermittent 
pauses that can been seen in the first plot below.  The X-axis is wall 
clock time, in seconds, and the left Y-axis is file postition. The 
horizontal blue lines indicate the amount of time each of the pwrite64 
is active and where in the file the I/O is occurring. The right Y-axis 
is the cumulative cpu times for both the process and kernel during the 
run.  There is minimal user cpu time, for either the process or kernel.  
The cumulative system cpu time attributable to the process ( the red 
line ) runs at a slope of ~4 system cpu seconds per wall clock second.  
Makes sense since there are 4 pthreads at work in the user process.  The 
cumulative system cpu time for the kernel as a whole ( the green line ) 
is ~12 system cpu seconds per wall clock second.  Note that during the 
pauses the system cpu accumulation drops to near zero ( zero slope ).


This is run on a dedicated ivybridge node with 40 cores Intel(R) Xeon(R) 
CPU E5-2680 v2 @ 2.80GHz


The node has 64G of memory.

The file is striped single component PFL, 8x1M.  Lustre version *2.12.8 
ddn12*


Does anyone have any ideas what is causing the pauses?  Is there 
something else I could be looking at in the /proc file system to gain 
insight?


For comparison, the 2nd plot below is when run on /tmp.  Note that there 
are some pwrite64() that take a long time, but a single pwrite64() 
taking a long time does not stop all the other pwrite64() active during 
the same time period.  Elapsed time for /tmp is 13 seconds. Lustre is 28 
seconds.  Both are essentially memory resident.


Just for completeness I have added a 3rd plot which is the amount of 
memory each of the OSC clients is consuming over the course of the 
Lustre run.  Nothing unusual there.  The memory consumption rate slows 
down during the pauses as one would expect.


I don't think the instrumentation is the issue, as there is not much 
more instrumentation occurring in the Lustre run versus /tmp, and they 
are both less than 6MB each in total.


John

In case the images got stripped here are some URLs to dropbox

plot1 : https://www.dropbox.com/s/ey217o053gdyse5/plot1.png?dl=0

plot2 : https://www.dropbox.com/s/vk23vmufa388l7h/plot2.png?dl=0

plot3 : https://www.dropbox.com/s/vk23vmufa388l7h/plot2.png?dl=0





___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lproc stats changed snapshot_time from unix-epoch to uptime/monotonic in 2.15

2022-08-25 Thread Ellis Wilson via lustre-discuss
Thanks for confirming Andreas, and will do!

-Original Message-
From: Andreas Dilger  
Sent: Wednesday, August 24, 2022 8:47 PM
To: Ellis Wilson 
Cc: lustre-discuss@lists.lustre.org
Subject: [EXTERNAL] Re: [lustre-discuss] lproc stats changed snapshot_time from 
unix-epoch to uptime/monotonic in 2.15

Ellis, thanks for reporting this.  This looks like it was a mistake. 

The timestamps should definitely be in wallclock time, but this looks to have 
been changed unintentionally to reduce overhead, and use a u64 instead of 
dealing with timespec64 math, while losing the original intent (there are many 
different ktime_get variants, all alike).

I think many monitoring tools will be unaffected because they use the delta 
between successive timestamps, but having timestamps that are relative to boot 
time is problematic since they may repeat or go backward after a reboot, and 
some tools may use this timestamp when inserting into a tsdb to avoid 
processing lag. 

Please file a ticket, and ideally if you can submit a patch that converts 
ktime_get() to ktime_get_real_ns() for the places that are changed in the patch 
(with a "Fixes:" line to track it against the original patch, which was commit 
ea2cd3af7b).

Cheers, Andreas

> On Aug 24, 2022, at 14:50, Ellis Wilson via lustre-discuss 
>  wrote:
> 
> Hi all,
> 
> One of my colleagues noticed that in testing 2.15.1 out the stats returned 
> include snapshot_time showing up in a different fashion than before.  
> Previously, ktime_get_real_ts64 was used to get the current timestamp and 
> that was presented when stats were printed, whereas now uptime is used as 
> returned by ktime_get.  Is there a monotonic requirement to snapshot_time 
> that I'm not thinking about that makes ktime_get more useful?  The previous 
> behavior of getting the current time alongside the stats so you could reason 
> about when they were gotten made more sense to me.  But perhaps Andreas had a 
> different vision for use of snapshot_time given that he was the one who 
> revised it?
> 
> I'm glad to open a bug and propose a patch if this was just a mistake, but 
> figured I'd ask first.
> 
> Best,
> 
> ellis
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.orgdata=05%7C01%7Celliswilson%40microsoft.com%7C211612b72e78476e056008da8633576a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637969852332016835%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7Csdata=Ycw4J8CUQ1WC9c96G2B0ko1gwPO1A4sj9ThFz3xuxuQ%3Dreserved=0
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org