Re: [lustre-discuss] Joining files

2023-03-30 Thread Patrick Farrell via lustre-discuss
Sven,

It would not be possible to implement purely in userspace today, no.  The 
layout manipulation primitives provided by Lustre don't currently include what 
would be needed for a join.  We would to provide some sort of new layout 
primitive(s) and there would be some userspace work to hook them together 
correctly to get the desired result.

I will defer to Andreas about what those primitives would be.

-Patrick



From: Sven Willner
Sent: Thursday, March 30, 2023 1:47 AM
To: Andreas Dilger
Cc: Patrick Farrell; lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Joining files

Dear Patrick and Anders,

Thank you very much for your quick and comprehensive replies.

My motivation behind this issues is the following:
At my institute (research around a large earth system/climate model) we are 
evaluating using zarr (https://zarr.readthedocs.io) for outputing large 
multi-dimensional arrays. This currently results in a huge number of small 
files as the responsibility of parallel writing is fully shifted to the file 
system. However, after closing the respective datasets we could merge those 
files again to reduce the metadata burden onto the file system and for easier 
archival if needed at a later point. Ideally without copying the large amount 
of data again. For read access I would simply create an appropriate 
index/lookup table for the resulting large file - hence holes/gaps in the file 
are not a problem as such.

As Patrick writes
>Layout: 1 1 1 1 1 1 1 ... 20 MiB 2 2 2 2 2 2  35 MiB
>
>With data from 0-10 MiB and 20 - 30 MiB.
that would be the resulting layout (I guess, minimizing holes could be achieved 
by appropriate striping of the original files and/or a layout adjustment during 
the merge, if possible).

>My expectation is that "join" of two files would be handled at the file EOF 
>and *not* at the layout boundary.  Based on the original description from 
>Sven, I'd think that small gaps in the file (e.g. 4KB for page alignment, 64KB 
>for minimum layout alignment, or 1MB for stripe alignment) would be OK, but 
>tens or hundreds of MB holes would be inefficient for processing.
(Andreas)

Apart from archival, the resulting file would only be accessed locally in the 
boundaries of the orginial smaller files, so I would expect the performance 
costs of the gaps to be not that critical.

>while I think it is possible to implement this in Lustre, I'd have to ask what 
>requirements are driving your request?  Is this just something you want to 
>test, or is there some real-world usage demand for this (e.g. specific 
>application workload, usage in some popular library, etc)?
(Andreas)

At this stage I am just looking into possibilites to handle this situation - I 
am neither an expert in zarr nor in Lustre.

If such a merge on the file system level turns out to be route worth taking, I 
would be happy to work on an implementation. However, yes, I would need some 
guidance there. Also, at this point I cannot estimate the amount of work needed 
even to test this approach.

Would the necessary layout manipulation be possible in userspace? (I will have 
a look into the implementations of `lfs migrate` and `lfs mirror extend`).

Thanks a lot!
Best,
Sven

On Wed, Mar 29, 2023 at 07:41:56PM +, Andreas Dilger wrote:
[-- Type: text/plain; charset=utf-8, Encoding: base64, Size: 8.2K --]
>Patrick,
>once upon a time there was "file join" functionality in Lustre that was 
>ancient and complex, and was finally removed in 2009.  There are still a few 
>remnants of this like "MDS_OPEN_JOIN_FILE" and "LOV_MAGIC_JOIN_V1" defined, 
>but unused.   That functionality long predated composite file layouts (PFL, 
>FLR), and used an external llog file *per file* to declare a series of other 
>files that described the layout.  It was extremely fragile and complex and 
>thankfully never got into widespread usage.
>
>I think with the advent of composite file layout that it should be _possible_ 
>to implement this kind of functionality purely with layout changes, similar to 
>"lfs migrate" doing layout swap, or "lfs mirror extend" merging the layout of 
>a victim file into another file to create a mirror.
>
>My expectation is that "join" of two files would be handled at the file EOF 
>and *not* at the layout boundary.  Based on the original description from 
>Sven, I'd think that small gaps in the file (e.g. 4KB for page alignment, 64KB 
>for minimum layout alignment, or 1MB for stripe alignment) would be OK, but 
>tens or hundreds of MB holes would be inefficient for processing.
>
>My guess, based on similar requests I've seen previously, and Sven's email 
>address, is that this relates to merging video streams from different files 
>into a single file?
>
>Sven,
>while I think it is 

Re: [lustre-discuss] Joining files

2023-03-29 Thread Patrick Farrell via lustre-discuss
Sven,

The "combining layouts without any data movement" part isn't currently 
possible.  It's probably possible in theory, but it's never been implemented.  
(I'm curious what your use case is?)

Even allowing for data movement, there's no tool to do this for you.  Depending 
what you mean by combining, it's possible to do this with Linux tools (see the 
end of my note), but you're going to have data copying.

It's a bit of an odd requirement, with some inherent questions - For example, 
file layouts generally go to infinity, because if they don't, you will get IO 
errors when you 'run off the end', ie, go past the defined layout, so the last 
component is usually defined to go to infinity.

That poses obvious questions when combining files.

If you're looking to combine files with layouts that do not go to infinity, 
then it's at least straightforward to see how you'd concatenate them.  But 
presumably the data in each file doesn't go to the very end of the layout?  So 
do you want the empty parts of the layout included?

Say file 1 is 10 MiB in size but the layout goes to 20 MiB (again, layouts 
normally should go to infinity) and file 2 is also 10 MiB in size but the 
layout goes to, say, 15 MiB.  Should the result look like this?

Layout: 1 1 1 1 1 1 1 ... 20 MiB 2 2 2 2 2 2  35 MiB

With data from 0-10 MiB and 20 - 30 MiB.

That's something you'd have to write a tool for, so it could write the data at 
your specified offset for putting in the second file (and third, etc...).  You 
could also do something like:

lfs setstripe [your layout] combined file; cat file 1 > combined file; truncate 
[combined file] 20 MiB (the end of the file 1 layout); cat file 2 > 
combined_file", etc.

So, you definitely can't avoid data copying here.  But that's how you could do 
it with simple Linux tools (which you could probably have drawn up yourself :)).

-Patrick


From: lustre-discuss  on behalf of 
Sven Willner 
Sent: Wednesday, March 29, 2023 7:58 AM
To: lustre-discuss@lists.lustre.org 
Subject: [lustre-discuss] Joining files

[You don't often get email from sven.will...@mpimet.mpg.de. Learn why this is 
important at https://aka.ms/LearnAboutSenderIdentification ]

Dear all,

I am looking for a way to join/merge/concatenate several files into one, whose 
layout is just the concatenation of the layouts of the respective files - 
ideally without any copying/moving on the data side (even if this would result 
in "holes" in the joined file).

I would very much appreciate any hints to tools or ideas of how to achieve such 
a join. As I understand that has been a `join` command for `lfs`, which is now 
deprecated (however, I am not sure if a use case like mine has been its purpose 
or why it has been deprecated).

Thanks a lot!
Best regards,
Sven

--
Dr. Sven Willner
Scientific Computing Lab (SCLab)
Max Planck Institute for Meteorology
Bundesstraße 53, D-20146 Hamburg, Germany
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Mounting lustre on block device

2023-03-16 Thread Patrick Farrell via lustre-discuss
Lustre doesn't show up in lsblk on the client because it isn't a block device 
on the client.  NFS and other network file systems also don't show up lsblk, 
for the same reason.

-Patrick

From: lustre-discuss  on behalf of 
Shambhu Raje via lustre-discuss 
Sent: Thursday, March 16, 2023 2:36 PM
To: lustre-discuss@lists.lustre.org 
Subject: [lustre-discuss] Mounting lustre on block device

When we mount a lustre file system on client, the lustre file system does not 
use block device on client side. Instead it uses virtual file system namespace. 
Mounting point will not be shown when we do 'lsblk'. As it only show on 'df-hT'.

How can we mount lustre file system on block such that when we write something 
with lusterfs then it can be shown in block device??
Can share command??

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Question regarding user access during recovery and journal replay

2023-03-14 Thread Patrick Farrell via lustre-discuss
Marc,


[Re-posting to the list...]



No, it’s fine to have interaction during those times. The system is designed to 
do that work online.  Depending what you’re trying to do and what you’re 
accessing, some client operations will experience delays, but that’s it.  For 
example, during failover/recovery for a particular OST or MDT, no new IO to 
that target will complete.  But the user programs will just wait - it’s safe to 
leave them running.



So recovery, etc, will show up to users as delays in some requests, but it’s 
safe to do with users accessing the system.



Regards,

Patrick


From: lustre-discuss  on behalf of 
Marc O'Brien via lustre-discuss 
Sent: Tuesday, March 14, 2023 7:24 AM
To: lustre-discuss@lists.lustre.org 
Subject: [lustre-discuss] Question regarding user access during recovery and 
journal replay


Hi,

When I was first taught some Lustre file system administration, it was stressed 
that when recovering a Lustre file system and while the journal replay was 
occurring on each host, there should be no user interaction with the file 
system. Any recovery was done with cluster access denied to HPC users, or when 
the cluster was deemed to be quiescent. This seemed to make sense as during 
journal replay the file system is in R/W state, but the distributed file system 
may not have reached a stable state. We now have multiple Lustre file systems 
(2 Ext4 based and 1 ZFS based) and evicting users or finding a quiescent time 
is problematic (luckily there are maintenance windows for the routine stuff).

I have searched online and have yet to see in print that there should be no 
user interaction with Lustre during recovery or journal replay (I may have 
missed it).

So, my question is, is the no cluster user interaction during recovery and 
journal replay restriction, actually a thing?

Thanks in advance for any enlightenment :)

Marc


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Avoiding system cache when using ssd pfl extent

2022-05-19 Thread Patrick Farrell via lustre-discuss
Well, you could use two file descriptors, one for O_DIRECT one otherwise. 

SSD is a fast medium but my instinct is the desirability of having data in RAM 
is much more about I/O pattern and hard to optimize for in advance - Do you 
read the data you wrote?  (Or read data repeatedly?)

In any case, there's no mechanism today.  It's also relatively marginal if 
we're just doing buffered I/O then forcing the data out - it will reduce memory 
usage but it won't improve performance.

-Patrick


From: John Bauer 
Sent: Thursday, May 19, 2022 1:16 PM
To: Patrick Farrell ; lustre-discuss@lists.lustre.org 

Subject: Re: [lustre-discuss] Avoiding system cache when using ssd pfl extent


Pat,

No, not in  general.  It just seems that if one is storing data on an SSD it 
should be optional to have it not stored in memory ( why store in 2 fast 
mediums ).

O_DIRECT is not of value as that would apply to all extents, whether on SSD on 
HDD.   O_DIRECT on Lustre has been problematic for me in the past, performance 
wise.

John

On 5/19/22 13:05, Patrick Farrell wrote:
No, and I'm not sure I agree with you at first glance.

Is this just generally an idea that data stored on SSD should not be in RAM?  
If so, there's no mechanism for that other than using direct I/O.

-Patrick

From: lustre-discuss 
<mailto:lustre-discuss-boun...@lists.lustre.org>
 on behalf of John Bauer <mailto:bau...@iodoctors.com>
Sent: Thursday, May 19, 2022 12:48 PM
To: lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org> 
<mailto:lustre-discuss@lists.lustre.org>
Subject: [lustre-discuss] Avoiding system cache when using ssd pfl extent

When using PFL, and using an SSD as the first extent, it seems it would
be advantageous to not have that extent's file data consume memory in
the client's system buffers.  It would be similar to using O_DIRECT, but
on a per-extent basis.  Is there a mechanism for that already?

Thanks,

John

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Avoiding system cache when using ssd pfl extent

2022-05-19 Thread Patrick Farrell via lustre-discuss
No, and I'm not sure I agree with you at first glance.

Is this just generally an idea that data stored on SSD should not be in RAM?  
If so, there's no mechanism for that other than using direct I/O.

-Patrick

From: lustre-discuss  on behalf of 
John Bauer 
Sent: Thursday, May 19, 2022 12:48 PM
To: lustre-discuss@lists.lustre.org 
Subject: [lustre-discuss] Avoiding system cache when using ssd pfl extent

When using PFL, and using an SSD as the first extent, it seems it would
be advantageous to not have that extent's file data consume memory in
the client's system buffers.  It would be similar to using O_DIRECT, but
on a per-extent basis.  Is there a mechanism for that already?

Thanks,

John

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Write Performance is Abnormal for max_dirty_mb Value of 2047

2022-03-27 Thread Patrick Farrell via lustre-discuss
Hasan,

Historically, there have been several bugs related to write grant when 
max_dirty_mb is set to large values (depending on a few other details of system 
setup).

Write grant allows the client to write data in to memory and write it out 
asynchronously.  When write grant is not available to the client, the client is 
forced to do sync writes at small sizes.  The result looks exactly like this, 
write performance drops severely.

Depending on what version you're running, you may not have fixes for these 
bugs.  You could either try a newer Lustre version (you didn't mention what 
you're running) or just use a smaller value of max_dirty_mb.

I am surprised to see you're still seeing a speedup from max_dirty_mb values 
over 1 GiB in size.

Can you describe your system a bit more?  How many OSTs do you have and how 
many stripes are you using?  max_dirty_mb is a per OST value on the client, not 
a global one.

-Patrick

From: lustre-discuss  on behalf of 
Hasan Rashid via lustre-discuss 
Sent: Friday, March 25, 2022 11:45 AM
To: lustre-discuss@lists.lustre.org 
Subject: [lustre-discuss] Write Performance is Abnormal for max_dirty_mb Value 
of 2047

Hi Everyone,

As the manual suggests, the valid value range for max_dirty_mb is the values 
larger than 0 and smaller than the lesser of 2048 MiB or 1/4 of client RAM. In 
my system, the client's RAM is 196 GiB. So, the maximum valid value for 
max_dirty_mb(mdm) is 2047 MiB.

However, when we set the max_dirty_mb value to 2047, we see very low write 
throughput for multiple Filebench workloads that we have tested so far. I am 
providing details for one example of the tested workload below.

Workload Detail: We are doing only random write operation of 1MiB size from one 
process and one thread to a single large file of 5GiB size.

Observed Result: As you can see from the below diagram, as we increase the mdm 
value from 768 to 1792 by an amount of 256 in each step, the write throughput 
has increased gradually. However, for the mdm value of 2047, the result dropped 
very significantly. The observation holds true for all the workloads we tested 
so far.


[https://lh3.googleusercontent.com/iEqpGNZhI9r9jJCLq0rWPvFADJRXkKKKZnyCV_8m3nhiHggNqWU9d_7WTUU0yeb011nxjULF4_iLkI7TIc0qe5el11PJI3i9Jot9KveXUil98A_UEnBojFqAHfK94ve1foQT39m2]

I am unable to figure out why we would have such low performance at the mdm 
value of 2047. Please share any insights you have that would be helpful for me 
to understand the aforementioned scenario.

Best Wishes,
Md Hasanur Rashid
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] 2.14 against mofed 5.5.1.0.3.2-rhel7.9

2022-03-07 Thread Patrick Farrell via lustre-discuss
Michael,

Perhaps more importantly, Lustre 2.15 hasn't been released yet.  (In general, 
the recommended matrix is maintenance release to maintenance release - So 2.15 
clients and 2.12 servers will be a recommended configuration, once 2.15 is 
released.)

-Patrick

From: lustre-discuss  on behalf of 
Michael DiDomenico via lustre-discuss 
Sent: Monday, March 7, 2022 12:44 PM
To: Nathan Dauchy 
Cc: lustre-discuss@lists.lustre.org 
Subject: Re: [lustre-discuss] 2.14 against mofed 5.5.1.0.3.2-rhel7.9

On Mon, Mar 7, 2022 at 12:28 PM Nathan Dauchy via lustre-discuss
 wrote:
>
> Likely your issue will be resolved in 2.15.  You could try applying the patch 
> from here:
> https://jira.whamcloud.com/browse/LU-15417
> Build lustre on MOFED 5.5

ah, okay, looks like someone beat me to it...

i'm not sure i'm ready to switch up to 2.15.  does anyone know how far
the server and clients version can deviate?  i'm on 2.12 srv and 2.14
client.  i seem to recall a chart sometime back, but i can't find it
now
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] RE-Fortran IO problem

2022-02-03 Thread Patrick Farrell via lustre-discuss
Denis,

FYI, the git link you provided seems to be non-public - it asks for a GSI login.

Fortran is widely used for applications on Lustre, so it's unlikely to be a 
fortran specific issue.  If you're seeing I/O rates drop suddenly during​ 
activity, rather than being reliably low for some particular operation, I would 
look to the broader Lustre system.  It may be suddenly extremely busy or there 
could be, eg, a temporary network issue - Assuming this is a system belonging 
to your institution, I'd check with your admins.

Regards,
Patrick

From: lustre-discuss  on behalf of 
Bertini, Denis Dr. 
Sent: Thursday, February 3, 2022 6:43 AM
To: lustre-discuss@lists.lustre.org 
Subject: [lustre-discuss] RE-Fortran IO problem


Hi,


Just as an add-on to my previous mail, the problem shows up also

with intel fortran  and it not specific to gnu fortran compiler.

So it seems to be linked to how the fortran IO is handled which

seems to be sub-optimal in cas of a Lustre filesystem.


I would be grateful if one can confirm/disconfirm  that.


Here again the access to the code i used for my benchmarks:


https://git.gsi.de/hpc/cluster/ci_ompi/-/tree/main/f/src


Best,

Denis


-
Denis Bertini
Abteilung: CIT
Ort: SB3 2.265a

Tel: +49 6159 71 2240
Fax: +49 6159 71 2986
E-Mail: d.bert...@gsi.de

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de

Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock
Chairman of the GSI Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
Ministerialdirigent Dr. Volkmar Dietz
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre Client Lockup Under Buffered I/O (2.14/2.15)

2022-01-19 Thread Patrick Farrell via lustre-discuss
Ellis,

As you may have guessed, that function just set looks like a node which is 
doing buffered I/O and thrashing for memory.  No particular insight available 
from the count of functions there.

Would you consider opening a bug report in the Whamcloud JIRA?  You should have 
enough for a good report, here's a few things that would be helpful as well:

It sounds like you can hang the node on demand.  If you could collect stack 
traces with:

echo t > /proc/sysrq-trigger

after creating the hang, that would be useful.  (It will print to dmesg.)

You've also collected debug logs - Could you include, say, the last 100 MiB of 
that log set?  That should be reasonable to attach if compressed.

Regards,
Patrick


From: lustre-discuss  on behalf of 
Ellis Wilson via lustre-discuss 
Sent: Wednesday, January 19, 2022 8:32 AM
To: Andreas Dilger 
Cc: lustre-discuss@lists.lustre.org 
Subject: Re: [lustre-discuss] Lustre Client Lockup Under Buffered I/O 
(2.14/2.15)


Hi Andreas,



Apologies in advance for the top-post.  I’m required to use Outlook for work, 
and it doesn’t handle in-line or bottom-posting well.



Client-side defaults prior to any tuning of mine (this is a very minimal 
1-client, 1-MDS/MGS, 2-OSS cluster):

~# lctl get_param llite.*.max_cached_mb

llite.lustrefs-8d52a9c52800.max_cached_mb=

users: 5

max_cached_mb: 7748

used_mb: 0

unused_mb: 7748

reclaim_count: 0

~# lctl get_param osc.*.max_dirty_mb

osc.lustrefs-OST-osc-8d52a9c52800.max_dirty_mb=1938

osc.lustrefs-OST0001-osc-8d52a9c52800.max_dirty_mb=1938

~# lctl get_param osc.*.max_rpcs_in_flight

osc.lustrefs-OST-osc-8d52a9c52800.max_rpcs_in_flight=8

osc.lustrefs-OST0001-osc-8d52a9c52800.max_rpcs_in_flight=8

~# lctl get_param osc.*.max_pages_per_rpc

osc.lustrefs-OST-osc-8d52a9c52800.max_pages_per_rpc=1024

osc.lustrefs-OST0001-osc-8d52a9c52800.max_pages_per_rpc=1024



Thus far I’ve reduced the following to what I felt were really conservative 
values for a 16GB RAM machine:



~# lctl set_param llite.*.max_cached_mb=1024

llite.lustrefs-8d52a9c52800.max_cached_mb=1024

~# lctl set_param osc.*.max_dirty_mb=512

osc.lustrefs-OST-osc-8d52a9c52800.max_dirty_mb=512

osc.lustrefs-OST0001-osc-8d52a9c52800.max_dirty_mb=512

~# lctl set_param osc.*.max_pages_per_rpc=128

osc.lustrefs-OST-osc-8d52a9c52800.max_pages_per_rpc=128

osc.lustrefs-OST0001-osc-8d52a9c52800.max_pages_per_rpc=128

~# lctl set_param osc.*.max_rpcs_in_flight=2

osc.lustrefs-OST-osc-8d52a9c52800.max_rpcs_in_flight=2

osc.lustrefs-OST0001-osc-8d52a9c52800.max_rpcs_in_flight=2



This slows down how fast I get to basically OOM from <10 seconds to more like 
25 seconds, but the trend is identical.



As an example of what I’m seeing on the client, you can see below we start with 
most free, and then iozone rapidly (within ~10 seconds) causes all memory to be 
marked used, and that stabilizes at about 140MB free until at some point it 
stalls for 20 or more seconds and then some has been synced out:

~# dstat --mem

--memory-usage-

used  free  buff  cach

1029M 13.9G 2756k  215M

1028M 13.9G 2756k  215M

1028M 13.9G 2756k  215M

1088M 13.9G 2756k  215M

2550M 11.5G 2764k 1238M

3989M 10.1G 2764k 1236M

5404M 8881M 2764k 1239M

6831M 7453M 2772k 1240M

8254M 6033M 2772k 1237M

9672M 4613M 2772k 1239M

10.6G 3462M 2772k 1240M

12.1G 1902M 2772k 1240M

13.4G  582M 2772k 1240M

13.9G  139M 2488k 1161M

13.9G  139M 1528k 1174M

13.9G  140M  896k 1175M

13.9G  139M  676k 1176M

13.9G  142M  528k 1177M

13.9G  140M  484k 1188M

13.9G  139M  492k 1188M

13.9G  139M  488k 1188M

13.9G  141M  488k 1186M

13.9G  141M  480k 1187M

13.9G  139M  492k 1188M

13.9G  141M  600k 1188M

13.9G  139M  580k 1187M

13.9G  140M  536k 1186M

13.9G  141M  668k 1186M

13.9G  139M  580k 1188M

13.9G  140M  568k 1187M

12.7G 1299M 2064k 1197M missed 20 ticks <-- client is totally unresponsive 
during this time

11.0G 2972M 5404k 1238M^C



Additionally, I’ve messed with sysctl settings.  Defaults:

vm.dirty_background_bytes = 0

vm.dirty_background_ratio = 10

vm.dirty_bytes = 0

vm.dirty_expire_centisecs = 3000

vm.dirty_ratio = 20

vm.dirty_writeback_centisecs = 500



Revised to conservative values:

vm.dirty_background_bytes = 1073741824

vm.dirty_background_ratio = 0

vm.dirty_bytes = 2147483648

vm.dirty_expire_centisecs = 200

vm.dirty_ratio = 0

vm.dirty_writeback_centisecs = 500



No observed improvement.



I’m going to trawl two logs today side-by-side, one with ldiskfs backing the 
OSTs, and one with zfs backing the OSTs, and see if I can see what the 
differences are since the zfs-backed version never gave us this problem.  The 
only other potentially useful thing I can share right now is that when I turned 
on full debug logging and ran the test until I hit OOM, the following were the 
most frequently hit functions in the logs (count, descending, is the first 
column).  

Re: [lustre-discuss] CPU soft lockup on mkfs.lustre

2019-09-11 Thread Patrick Farrell
Tamas, Aurélien,

Would one of you mind opening an LU on this?

Thanks,
- Patrick

From: lustre-discuss  on behalf of 
Tamas Kazinczy 
Sent: Wednesday, September 11, 2019 1:32:09 AM
To: Degremont, Aurelien ; lustre-discuss@lists.lustre.org 

Subject: Re: [lustre-discuss] CPU soft lockup on mkfs.lustre



On 2019. 09. 10. 19:22, Degremont, Aurelien wrote:

I saw the same issue and downgraded my kernel back to RHEL kernel 3.10.0-957.



But you probably wants to keep 7.7 kernel ? :)


If possible, yes. :)


Cheers,

--
Tamás Kazinczy
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Group and Project quota enforcement semantics

2019-08-05 Thread Patrick Farrell
Steve,


The Lustre quota behavior is the standard Linux file system quota behavior - 
All data written by a user/group/in a project directory counts against all 
applicable quotas.  You'll see the same if using quotas on EXT4, XFS, etc.


Additionally (you didn't ask, but this is a common related question...), 
whenever you hit a quota limit, that's it - That means the effective limit is: 
MIN(user, project, group) (assuming project applies to the location where 
you're writing data).


- Patrick


From: lustre-discuss  on behalf of 
Steve Barnet 
Sent: Monday, August 5, 2019 9:59:10 AM
To: lustre-discuss@lists.lustre.org 
Subject: [lustre-discuss] Group and Project quota enforcement semantics

Hi all,

   I am starting to work with project and group quotas since they
seem like they would be a good fit for one of our newer filesystems.

I think that I have the system configured correctly, and project
quotas enabled on all OSTs and the MDT (lustre version 2.12.2).
Basic tests appear to work correctly so far.

   The question I have is related to how group/project quotas
relate to user quotas. In particular, from the manual, it states
that if a group quota exists, the blocks written count against
that quota. My assumption was that those blocks would not also
count against the user's individual quota. However, it appears
that this is not the case, and in fact the quota applies to both
the user and the group.

So the question is: does the behavior above sound correct, or have
I missed something?

And generally, it looks like the filesystem configuration itself is
correct, but if I need to supply more detail, please let me know.

Thanks much!

Best,

---Steve

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
lustre-discuss Info 
Page
lists.lustre.org
To see the collection of prior postings to the list, visit the lustre-discuss 
Archives.. Using lustre-discuss: To post a message to all the list members, 
send email to lustre-discuss@lists.lustre.org. You can subscribe to the list, 
or change your existing subscription, in the sections below.


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lnet.service reporting failure on start

2019-06-24 Thread Patrick Farrell
This is correct, and normal.  It’s not really a failure (in the sense of a 
being a problem), it’s just that you’re using modules that aren’t signed with 
the key used by your kernel vendor.  In general, if you’re getting third party 
modules (ie not from your kernel vendor), this happens, because the point of 
the signing is your kernel vendor verifying/certifying the module is theirs, 
and, well, it’s not.

A version of this message has probably been present every time you’ve loaded 
Lustre since you first used a kernel with signing enabled, which I think came 
in somewhere in the kernels used by RHEL6/SLES11, so some years back.

Nothing to worry about.

From: lustre-discuss  on behalf of 
Shaun Tancheff 
Sent: Monday, June 24, 2019 4:21:18 PM
To: Kurt Strosahl; lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] lnet.service reporting failure on start


My understanding, in general terms, is that the messages:



[   26.653699] libcfs: loading out-of-tree module taints kernel.

[   26.665567] libcfs: module verification failed: signature and/or required 
key missing - tainting kernel



Suggests that the libcfs.ko kernel module is not signed.



Note that kernel module signing is different from the rpm package signing.



From: lustre-discuss  on behalf of 
Kurt Strosahl 
Date: Monday, June 24, 2019 at 3:03 PM
To: "lustre-discuss@lists.lustre.org" 
Subject: [lustre-discuss] lnet.service reporting failure on start



Good Afternoon,



I recently stood up a new lustre client, when I run systemctl start 
lnet.service systemd reports a failure.  However an lnetctl net show displays 
all the right information and I can reach other nodes with lclt ping.  Further 
the lustre file system mounts properly.



Looking in dmesg I see right as the lnet service starts to mount the following:



[   26.653699] libcfs: loading out-of-tree module taints kernel.

[   26.665567] libcfs: module verification failed: signature and/or required 
key missing - tainting kernel

[   26.683188] LNet: HW NUMA nodes: 1, HW CPU cores: 8, npartitions: 2

[   26.725826] alg: No test for adler32 (adler32-zlib)

[   27.674142] Lustre: Lustre: Build Version: 2.10.7



The system is RHEL 7.6 kernel 3.10.0-957.21.2.el7.x86_64

lustre-client-dkms-2.10.7-1.el7.noarch

lustre-client-2.10.7-1.el7.x86_64



Has anyone seen this message or knows what might be causing it, or be able to 
suggest why systemd is reporting a failure even though it clearly works.



w/r,

Kurt J. Strosahl
System Administrator: Lustre, HPC
Scientific Computing Group, Thomas Jefferson National Accelerator Facility
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Stop writes for users

2019-05-14 Thread Patrick Farrell
Connections on demand has been done and is not relevant here – It just idles 
unused connections to save resources, no impact on ability to write, etc.


  *   Patrick

From: lustre-discuss  on behalf of 
Alexander I Kulyavtsev 
Date: Tuesday, May 14, 2019 at 11:42 AM
To: "lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] Stop writes for users


There was feature request, and there were corresponding LU:

LU-5703 - Lustre quiesce

LU-7236 - connections on demand



Alex.


From: lustre-discuss  on behalf of 
Robert Redl 
Sent: Tuesday, May 14, 2019 10:36 AM
To: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Stop writes for users


I don't know is this really works for this use case, but newer Lustre versions 
have the possibility to create a write barrier, which is normally part of the 
snapshot process.

Have a look at lctl barrier_freeze.
On 5/14/19 5:25 PM, Mohr Jr, Richard Frank (Rick Mohr) wrote:


On May 13, 2019, at 6:51 PM, Fernando Pérez 
 wrote:



Is there a way to stop file writes for all users or for groups without using 
quotes?



We have a lustre filesystem with corrupted quotes and I need to stop the write 
for all users (or for some users).

There are ways to deactivate OSTs, but those are intended to stop creation of 
new file objects on those OSTs and don’t actually stop writes to existing 
files.  I don’t think that mounting OSTs read-only  (with “mount -t lustre -o 
ro …”) works because Lustre updates some info when it mounts the target (but 
this might be based on old info so I could be wrong).  You could remount all 
the clients read-only, but I don’t know if this is practical for you.



The only other option I can think of would be if there was a client-side 
parameter that could be set via “lctl conf_param” that might cause the clients 
to treat all the targets as read-only.  But if there is such a parameter, I am 
not familiar with it.



--

Rick Mohr

Senior HPC System Administrator

National Institute for Computational Sciences

http://www.nics.tennessee.edu



___

lustre-discuss mailing list

lustre-discuss@lists.lustre.org

http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
--

Dr. Robert Redl
Scientific Programmer, "Waves to Weather" (SFB/TRR165)
Meteorologisches Institut
Ludwig-Maximilians-Universität München
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] 2.10 <-> 2.12 interoperability?

2019-05-03 Thread Patrick Farrell
Thomas,


As a general rule, Lustre only supports mixing versions on servers for rolling 
upgrades.


- Patrick


From: lustre-discuss  on behalf of 
Thomas Roth 
Sent: Wednesday, April 24, 2019 3:54:09 AM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] 2.10 <-> 2.12 interoperability?

Hi all,

OS=CentOS 7.5
Lustre 2.10.6

One of the OSS (one OST only) was upgraded to zfs 0.7.13, and LU-11507 forced 
an upgrade of Lustre to 2.12

Mounts, reconnects, recovers, but then is unusable, and the MDS reports:

Lustre: 13650:0:(mdt_handler.c:5350:mdt_connect_internal()) test-MDT: client
test-MDT-lwp-OST0002_UUID does not support ibits lock, either very old or 
an invalid client: flags
0x204140104320


So far I have not found any hints that these versions would not cooperate, or 
that I should have set a
certain parameter.
LU-10175 indicates that the ibits have some connection to data-on-mdt which we 
don't use.

Any suggestions?


Regards,
Thomas

--

Thomas Roth
Department: Informationstechnologie
Location: SB3 2.291

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de

Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Ursula Weyrich, Jörg Blaurock
Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
State Secretary / Staatssekretär Dr. Georg Schütte
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre client memory and MemoryAvailable

2019-04-29 Thread Patrick Farrell
Neil,

My understanding is marking the inode cache reclaimable would make Lustre 
unusual/unique among Linux file systems.  Is that incorrect?

- Patrick

From: lustre-discuss  on behalf of 
NeilBrown 
Sent: Monday, April 29, 2019 8:53:43 PM
To: Jacek Tomaka
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Lustre client memory and MemoryAvailable

On Mon, Apr 29 2019, Jacek Tomaka wrote:

>> so lustre_inode_cache is the real culprit when signal_cache appears to
>>  be large.
>> This cache is slaved on the common inode cache, so there should be one
>> entry for each lustre inode that is in memory.
>> These inodes should get pruned when they've been inactive for a while.
>
> What triggers the prunning?
>

Memory pressure.
The approx approach is try to free some unused pages and about 1/2000th of
the entries in each slab.  Then if that hasn't made enough space
available, try again.

>>If you look in /proc/sys/fs/inode-nr  there should be two numbers:
>>  The first is the total number of in-memory inodes for all filesystems.
>>  The second is the number of "unused" inodes.
>>
>>  When you write "3" to drop_caches, the second number should drop down to
>> nearly zero (I get 95 on my desktop, down from 6524).
>
> Ok, that is useful to know but echoing 3 to drop_cache or generating memory
> pressure
> clears most of the signal_cache (inode) as well as other lustre objects, so
> this is working fine.

Oh good, I hadn't remembered clearly what the issue was.

>
> The issue that remains is that they are marked as SUnreclaim vs
> SReclaimable.

Yes, I think lustre_inode_cache should certainly be flagged as
SLAB_RECLAIM_ACCOUNT.
If the SReclaimable value is too small (and there aren't many
reclaimable pagecache pages), vmscan can decide not to bother.  This is
probably a fairly small risk but it is possible that the missing
SLAB_RECLAIM_ACCOUNT flag can result in memory not being reclaimed when
it could be.

Thanks,
NeilBrown


> So i do not think there is a memory leak per se.
>
> Regards.
> Jacek Tomaka
>
> On Mon, Apr 29, 2019 at 1:39 PM NeilBrown  wrote:
>
>>
>> Thanks Jacek,
>>  so lustre_inode_cache is the real culprit when signal_cache appears to
>>  be large.
>>  This cache is slaved on the common inode cache, so there should be one
>>  entry for each lustre inode that is in memory.
>>  These inodes should get pruned when they've been inactive for a while.
>>
>>  If you look in /proc/sys/fs/inode-nr  there should be two numbers:
>>   The first is the total number of in-memory inodes for all filesystems.
>>   The second is the number of "unused" inodes.
>>
>>  When you write "3" to drop_caches, the second number should drop down to
>>  nearly zero (I get 95 on my desktop, down from 6524).
>>
>>  When signal_cache stays large even after the drop_caches, it suggest
>>  that there are lots of lustre inodes that are thought to be still
>>  active.   I'd have to do a bit of digging to understand what that means,
>>  and a lot more to work out why lustre is holding on to inodes longer
>>  than you would expect (if that actually is the case).
>>
>>  If an inode still has cached data pages attached that cannot easily be
>>  removed, it will not be purged even if it is unused.
>>  So if you see the "unused" number remaining high even after a
>>  "drop_caches", that might mean that lustre isn't letting go of cache
>>  pages for some reason.
>>
>> NeilBrown
>>
>>
>>
>> On Mon, Apr 29 2019, Jacek Tomaka wrote:
>>
>> > Wow, Thanks Nathan and NeilBrown.
>> > It is great to learn about slub merging. It is awesome to have a
>> > reproducer.
>> > I am yet to trigger my original problem with slurm_nomerge but
>> > slabinfo tool (in kernel sources) can actually show merged caches:
>> > kernel/3.10.0-693.5.2.el7/tools/slabinfo  -a
>> >
>> > :t-112   <- sysfs_dir_cache kernfs_node_cache blkdev_integrity
>> > task_delay_info
>> > :t-144   <- flow_cache cl_env_kmem
>> > :t-160   <- sigqueue lov_object_kmem
>> > :t-168   <- lovsub_object_kmem osc_extent_kmem
>> > :t-176   <- vvp_object_kmem nfsd4_stateids
>> > :t-192   <- ldlm_resources kiocb cred_jar inet_peer_cache key_jar
>> > file_lock_cache kmalloc-192 dmaengine-unmap-16 bio_integrity_payload
>> > :t-216   <- vvp_session_kmem vm_area_struct
>> > :t-256   <- biovec-16 ip_dst_cache bio-0 ll_file_data kmalloc-256
>> > sgpool-8 filp request_sock_TCP rpc_tasks request_sock_TCPv6
>> > skbuff_head_cache pool_workqueue lov_thread_kmem
>> > :t-264   <- osc_lock_kmem numa_policy
>> > :t-328   <- osc_session_kmem taskstats
>> > :t-576   <- kioctx xfrm_dst_cache vvp_thread_kmem
>> > :t-0001152   <- signal_cache lustre_inode_cache
>> >
>> > It is not on a machine that had the problem i described before but the
>> > kernel version is the same so I am assuming the cache merges are the
>> same.
>> >
>> > Looks like signal_cache points to lustre_inode_cache.
>> > Regards.
>> > Jacek Tomaka
>> >
>> >
>> > 

Re: [lustre-discuss] lfs find

2019-04-26 Thread Patrick Farrell
Would you mind listing current lfs find options to help kickstart discussion?

It seems like I might want it for lots of them, maybe close to all - For 
example, stripe size seems at first (to me) it wouldn't be useful, but what if 
I want to check to see if anyone is using a weird stripe size?  I expect stripe 
size to be 1 MiB, and if I can search for ! that, then I can find users who set 
weird stripe sizes and help them fix it.


- Patrick


From: lustre-discuss  on behalf of 
Vitaly Fertman 
Sent: Friday, April 26, 2019 10:35:54 AM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] lfs find

Hi

during a discussion of a bug in lfs find, an improvement idea appeared, it is 
well
described by Andreas below, and this thread is to discuss which options may 
need this
functionality.


> On 26 Apr 2019, at 03:41, Andreas Dilger  wrote:
>
>  lfs find ! --pool HDD ...
>
> should IMHO find files that do not have any instantiated components in pool 
> HDD, rather than files that have any component not on HDD.
>
> That said, I could imagine that we may need to make some parameters more 
> flexible, like adding "--pool ="  to allow specifying all 
> components on the specified pool, and possibly + to specify "at least one 
> component" (which would be the same as without "+" but may be more clear to 
> some users)?
>
> A similar situation arose with "-mode" for regular find (any vs. all bits) 
> that took a while to sort out, so we should learn from what they did and get 
> it right.

—
Vitaly Fertman
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] State of arm client?

2019-04-25 Thread Patrick Farrell
Also, you’ll need (I think?) fairly new Pis - Lustre only supports ARM64 and 
older ones were 32 bit.

- Patrick

From: lustre-discuss  on behalf of 
Peter Jones 
Sent: Wednesday, April 24, 2019 11:08:38 PM
To: Andrew Elwell; lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] State of arm client?


Andrew



You will need to use 2.12.x (and 2.12.1 is in final release testing so would be 
a good bet if you can wait a short while)



Peter



From: lustre-discuss  on behalf of 
Andrew Elwell 
Date: Wednesday, April 24, 2019 at 7:31 PM
To: "lustre-discuss@lists.lustre.org" 
Subject: [lustre-discuss] State of arm client?



Hi folks,



I remember seeing a press release by DDN/Whamcloud last November that they were 
going to support ARM, but can anyone point me to the current state of client?



I'd like to deploy it onto a raspberry pi cluster (only 4-5 nodes) ideally on 
raspbian for demo / training purposes. (Yes I know it won't *quite* be 
infiniband performance, but as it's hitting a VM based set of lustre servers, 
that's the least of my worries). Ideally 2.10.x, but I'd take a 2.12 client if 
it can talk to 2.10.x servers





Many thanks

Andrew
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre client memory and MemoryAvailable

2019-04-14 Thread Patrick Farrell
Jacek,

“Accounting looks better when Lustre is not involved ;) Seriosly, how
can i help? Should i raise a bug? Try to provide a patch?”
A patch is always welcome.

We require a bug in our JIRA (jira.whamcloud.com) to submit a patch against.  
(See here for instructions for our Gerrit: 
https://wiki.whamcloud.com/plugins/servlet/mobile?contentId=725#content/view/725)

For this one, if we’re going Neil’s suggested direction, I’d love a little bit 
of convincing that other file systems mark their shrinker associated caches as 
reclaimable.  (Though if Neil insists they *should* do so, then that counts for 
a lot.)

No idea about the signal cache (I believe it’s for timers, and Lustre shouldn’t 
be using an unusual number of those), would be interested if Neil has anything 
to add there.

- Patrick

From: Jacek Tomaka 
Sent: Sunday, April 14, 2019 9:10:32 PM
To: Patrick Farrell
Cc: NeilBrown; lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Lustre client memory and MemoryAvailable

Thanks Patrick for getting the ball rolling!

>1/ w.r.t drop_caches, "2" is *not* "inode and dentry".  The '2' bit
>  causes all registered shrinkers to be run, until they report there is
>  nothing left that can be discarded.  If this is taking 10 minutes,
>  then it seems likely that some shrinker is either very inefficient, or
>  is reporting that there is more work to be done, when really there
>  isn't.

This is pretty common problem on this hardware. KNL's CPU is running
at ~1.3GHz so anything that is not multi threaded can take a few times more
than on "normal" XEON. While it would be nice to improve this (by running it in 
mutliple threads),
this is not the problem here. However i can provide you with kernel call stack
next time i see it if you are interested.


> 1a/ "echo 3 > drop_caches" does the easy part of memory reclaim: it
>   reclaims anything that can be reclaimed immediately.

Awesome. I would just like to know how much easily available memory
there is on the system without actually reclaiming it and seeing, ideally using
normal kernel mechanisms but if lustre provides a procfs entry where i can
get it, it will solve my immediate problem.

>4/ Patrick is right that accounting is best-effort.  But we do want it
>  to improve.

Accounting looks better when Lustre is not involved ;) Seriosly, how
can i help? Should i raise a bug? Try to provide a patch?

>Just last week there was a report
>  https://lwn.net/SubscriberLink/784964/9ddad7d7050729e1/
>  about making slab-allocated objects movable.  If/when that gets off
>  the ground, it should help the fragmentation problem, so more of the
>  pages listed as reclaimable should actually be so.

This is a very interesting article. While memory fragmentation makes it more
difficult to use huge pages, it is not directly related to the problem of 
lustre kernel
memory allocation accounting. It will be good to see movable slabs, though.

Also i am not sure how the high signal_cache can be explained and if anything 
can be
done on the Lustre level?

Regards.
Jacek Tomaka



On Mon, Apr 15, 2019 at 8:55 AM Patrick Farrell 
mailto:pfarr...@whamcloud.com>> wrote:
1. Good to know, thank you.  I hadn’t looked at the code, I was unaware it runs 
through all the sprinklers.

2. Right, I know - the article was about it when it was an alias for 
reclaimable, and hence describes some of the behavior of reclaimable.

3. Interesting, that’s good to know.  I would note that it doesn’t seem to be 
standard practice in other file systems, though I didn’t look at how many 
shrinkers they’re registering.  Perhaps having special shrinkers is what’s 
unusual.

BTW, re: mailing list, this is the first devel appropriate thing I’ve seen on 
discuss in a long while.  I should instead have encouraged Jacek to use 
lustre-devel :)

- Patrick
________
From: NeilBrown mailto:ne...@suse.com>>
Sent: Sunday, April 14, 2019 6:38:47 PM
To: Patrick Farrell; Jacek Tomaka; 
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] Lustre client memory and MemoryAvailable


(that for the Cc Patrick - maybe I should subscribe to
lustre-discuss...)

1/ w.r.t drop_caches, "2" is *not* "inode and dentry".  The '2' bit
  causes all registered shrinkers to be run, until they report there is
  nothing left that can be discarded.  If this is taking 10 minutes,
  then it seems likely that some shrinker is either very inefficient, or
  is reporting that there is more work to be done, when really there
  isn't.

  lustre registers 5 shrinkers.  Any memcache which is not affected by
  those shrinkers should *not* be marked SLAB_RECLAIM_ACCOUNT (unless
  they are indirectly shrunk by a system shrinker - e.g. if they are
  slaves to the icache or dcache).  Any which are pr

Re: [lustre-discuss] Lustre client memory and MemoryAvailable

2019-04-14 Thread Patrick Farrell
1. Good to know, thank you.  I hadn’t looked at the code, I was unaware it runs 
through all the sprinklers.

2. Right, I know - the article was about it when it was an alias for 
reclaimable, and hence describes some of the behavior of reclaimable.

3. Interesting, that’s good to know.  I would note that it doesn’t seem to be 
standard practice in other file systems, though I didn’t look at how many 
shrinkers they’re registering.  Perhaps having special shrinkers is what’s 
unusual.

BTW, re: mailing list, this is the first devel appropriate thing I’ve seen on 
discuss in a long while.  I should instead have encouraged Jacek to use 
lustre-devel :)

- Patrick

From: NeilBrown 
Sent: Sunday, April 14, 2019 6:38:47 PM
To: Patrick Farrell; Jacek Tomaka; lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Lustre client memory and MemoryAvailable


(that for the Cc Patrick - maybe I should subscribe to
lustre-discuss...)

1/ w.r.t drop_caches, "2" is *not* "inode and dentry".  The '2' bit
  causes all registered shrinkers to be run, until they report there is
  nothing left that can be discarded.  If this is taking 10 minutes,
  then it seems likely that some shrinker is either very inefficient, or
  is reporting that there is more work to be done, when really there
  isn't.

  lustre registers 5 shrinkers.  Any memcache which is not affected by
  those shrinkers should *not* be marked SLAB_RECLAIM_ACCOUNT (unless
  they are indirectly shrunk by a system shrinker - e.g. if they are
  slaves to the icache or dcache).  Any which are probably can be.

1a/ "echo 3 > drop_caches" does the easy part of memory reclaim: it
   reclaims anything that can be reclaimed immediately.  It doesn't
   trigger write-back, and it doesn't start the oom-killer, but all
   caches are flushed of everything that is not currently in use, and
   does not need to be written-out first.  If you run "sync" first,
   there should be nothing to write out, so it should drop a lot more.

2/ GFP_TEMPORARY is gone, it was never really well defined.  Best to
   ignore it.

3/ I don't *think* __GFP_RECLAIMABLE has a very big effect.  It
   primarily tries to keep non-reclaimable allocations together so they
   don't cause too much fragmentation.  To do this, it groups them
   separately from reclaimable allocations.  So if a shrinker is
   expected to do anything useful, then it makes sense to tag the
   related slabs as RECLAIMABLE.

4/ Patrick is right that accounting is best-effort.  But we do want it
   to improve.  Just last week there was a report
 https://lwn.net/SubscriberLink/784964/9ddad7d7050729e1/
   about making slab-allocated objects movable.  If/when that gets off
   the ground, it should help the fragmentation problem, so more of the
   pages listed as reclaimable should actually be so.

NeilBrown

On Sun, Apr 14 2019, Patrick Farrell wrote:

> echo 1 > drop_caches does not generate memory pressure - it requests that the 
> page cache be cleared.  It would not be expected to affect slab caches much.
>
> You could try 3 (1+2 in this case, where 2 is inode and dentry).  That might 
> do a bit more because some (maybe many?) of those objects you're looking at 
> would go away if the associated inodes or dentries were removed.  But 
> fundamentally, drop caches does not generate memory pressure, and does not 
> force reclaim.  It drops specific, identified caches.
>
> The only way to force *reclaim* is memory pressure.
>
> Your note that a lot more memory than expected was freed under pressure does 
> tell us something, though.
>
> It's conceivable Lustre needs to set SLAB_RECLAIM_ACCOUNT on more of its slab 
> caches, so this piqued my curiosity.  My conclusion is no, here's why:
>
> The one quality reference I was quickly able to find suggests setting 
> SLAB_RECLAIM_ACCOUNT wouldn't be so simple:
> https://lwn.net/Articles/713076/
>
> GFP_TEMPORARY is - in practice - just another name for __GFP_RECLAIMABLE, and 
> setting SLAB_RECLAIM_ACCOUNT is equivalent to setting __GFP_RECLAIMABLE.  
> That article suggests caution is needed, as this should only be used for 
> memory that is certain to be easily available, because using this flag 
> changes the allocation behavior on the assumption that the memory can be 
> quickly freed at need.  That is often not true of these Lustre objects.
>
> An easy way to learn more about this sort of question is to compare to other 
> actively developed file systems in the kernel...
>
> Comparing to other file systems, we see that in general, only the inode cache 
> is allocated with SLAB_RECLAIM_ACCOUNT (it varies a bit).
>
> XFS, for example, has only one use of KM_ZONE_RECLAIM, its name for this flag 
> - the inode cache:
> "
> xfs_inode_zone =
> kmem_zone_ini

Re: [lustre-discuss] Lustre client memory and MemoryAvailable

2019-04-14 Thread Patrick Farrell
echo 1 > drop_caches does not generate memory pressure - it requests that the 
page cache be cleared.  It would not be expected to affect slab caches much.

You could try 3 (1+2 in this case, where 2 is inode and dentry).  That might do 
a bit more because some (maybe many?) of those objects you're looking at would 
go away if the associated inodes or dentries were removed.  But fundamentally, 
drop caches does not generate memory pressure, and does not force reclaim.  It 
drops specific, identified caches.

The only way to force *reclaim* is memory pressure.

Your note that a lot more memory than expected was freed under pressure does 
tell us something, though.

It's conceivable Lustre needs to set SLAB_RECLAIM_ACCOUNT on more of its slab 
caches, so this piqued my curiosity.  My conclusion is no, here's why:

The one quality reference I was quickly able to find suggests setting 
SLAB_RECLAIM_ACCOUNT wouldn't be so simple:
https://lwn.net/Articles/713076/

GFP_TEMPORARY is - in practice - just another name for __GFP_RECLAIMABLE, and 
setting SLAB_RECLAIM_ACCOUNT is equivalent to setting __GFP_RECLAIMABLE.  That 
article suggests caution is needed, as this should only be used for memory that 
is certain to be easily available, because using this flag changes the 
allocation behavior on the assumption that the memory can be quickly freed at 
need.  That is often not true of these Lustre objects.

An easy way to learn more about this sort of question is to compare to other 
actively developed file systems in the kernel...

Comparing to other file systems, we see that in general, only the inode cache 
is allocated with SLAB_RECLAIM_ACCOUNT (it varies a bit).

XFS, for example, has only one use of KM_ZONE_RECLAIM, its name for this flag - 
the inode cache:
"
xfs_inode_zone =
kmem_zone_init_flags(sizeof(xfs_inode_t), "xfs_inode",
KM_ZONE_HWALIGN | KM_ZONE_RECLAIM | KM_ZONE_SPREAD,
xfs_fs_inode_init_once);
"

btrfs is the same, just the inode cache.  EXT4 has a *few* more caches marked 
this way, but not everything.

So, no - I don't think so.  It would be atypical for Lustre to set 
SLAB_RECLAIM_ACCOUNT on its slab caches for internal objects.  Presumably this 
sort of thing is not considered reclaimable enough for this purpose.

I believe if you tried similar tests with other complex file systems (XFS might 
be a good start), you'd see broadly similar behavior.  (Lustre is probably a 
bit worse because it has a more complex internal object model, so more slab 
caches.)

VM accounting is distinctly imperfect.  The design is such that it's often 
impossible to know how much memory could be made available without actually 
going and trying to free it.  There are good, intrinsic reasons for some of 
that, and some of that is design artifacts...

I've copied in Neil Brown, who I think only reads lustre-devel, just in case he 
has some particular input on this.

Regards,
- Patrick

From: lustre-discuss  on behalf of 
Jacek Tomaka 
Sent: Sunday, April 14, 2019 3:12:51 AM
To: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Lustre client memory and MemoryAvailable

Actually i think it is just a bug with the way slab caches are created. Some of 
them should be passed a flag that they are reclaimable.
i.e. something like:
https://patchwork.kernel.org/patch/9360819/

Regards.
Jacek Tomaka

On Sun, Apr 14, 2019 at 3:27 PM Jacek Tomaka 
mailto:jac...@dug.com>> wrote:
Hello,

TL;DR;
Is there a way to figure out how much memory Lustre will make available under 
memory pressure?

Details:
We are running lustre client on a machine with 128GB of memory (Centos 7) Intel 
Phi KNL machines and at certain situations we see that there can be ~10GB+ of 
memory allocated on the kernel side i.e. :

vvp_object_kmem   3535336 3536986176   462 : tunables000 : 
slabdata  76891  76891  0
ll_thread_kmem 33511  33511344   474 : tunables000 : 
slabdata713713  0
lov_session_kmem   34760  34760592   558 : tunables000 : 
slabdata632632  0
osc_extent_kmem   3549831 3551232168   482 : tunables000 : 
slabdata  73984  73984  0
osc_thread_kmem14012  14116   2832   118 : tunables000 : 
slabdata   1286   1286  0
osc_object_kmem   3546640 3548350304   534 : tunables000 : 
slabdata  66950  66950  0
signal_cache  3702537 3707144   1152   288 : tunables000 : 
slabdata 132398 132398  0

/proc/meminfo:
MemAvailable:   114196044 kB
Slab:   11641808 kB
SReclaimable:1410732 kB
SUnreclaim: 10231076 kB

After executing

echo 1 >/proc/sys/vm/drop_caches

the slabinfo values don't change but when i actually generate memory pressure 
by:

java -Xmx117G -Xms117G -XX:+AlwaysPreTouch -version

lots of memory gets freed:
vvp_object_kmem   127650 127880   

Re: [lustre-discuss] EINVAL error when writing to a PFL file (lustre 2.12.0)

2019-03-29 Thread Patrick Farrell
Hmm.  I think because users can append to any file at any time, and also append 
to a file then write to it normally, we might override the users preferred 
layout for a file where appending is just a small part of the plan.  (And of 
course since we can’t control when users do append, it doesn’t let us handle 
the locking differently.)

From: lustre-discuss  on behalf of 
Degremont, Aurelien 
Sent: Friday, March 29, 2019 4:53:42 AM
To: Andreas Dilger; LEIBOVICI Thomas
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] EINVAL error when writing to a PFL file (lustre 
2.12.0)




Another thought I just had while re-reading LU-9341 is whether it would be 
better to have the MDS always create files opened with O_APPEND with 
stripe_count=1?  There is no write parallelism for O_APPEND files, so having 
multiple stripes doesn't help the writer.  Because the writer also always locks 
the whole file [0,EOF] then there is no read-write parallelism either, so 
creating only a single file stripe simplifies things significantly with no real 
loss.

Having several stripes if still useful if you want to distribute space usage 
among several OSTs and also it could help if you have multiple readers later 
for this file.



Aurélien



Cheers, Andreas

On Feb 22, 2019, at 10:09, LEIBOVICI Thomas  wrote:
>
> Hello Patrick,
>
> Thank you for the quick reply.
> No, I have no particular use-case in mind, I'm just playing around with 
PFL.
>
> If this is currently not properly supported, a quick fix could be to 
prevent the user from creating such incomplete layouts?
>
> Regards,
> Thomas
    >
    > On 2/22/19 5:33 PM, Patrick Farrell wrote:
>> Thomas,
>>
>> This is expected, but it's also something we'd like to fix - See LU-9341.
>>
>> Basically, append tries to instantiate the layout from 0 to infinity, 
and it fails because your layout is incomplete (ie doesn't go to infinity).
>>
>> May I ask why you're creating a file with an incomplete layout?  Do you 
have a use case in mind?
>>
>> - Patrick
>> From: lustre-discuss  on behalf 
of LEIBOVICI Thomas 
>> Sent: Friday, February 22, 2019 10:27:48 AM
>> To: lustre-discuss@lists.lustre.org
>> Subject: [lustre-discuss] EINVAL error when writing to a PFL file 
(lustre 2.12.0)
>>
>> Hello,
>>
>> Is it expected to get an error when appending a PFL file made of 2
>> regions [0 - 1M] and [1M to 6M]
>> even if writing in this range?
>>
>> I get an error when appending it, even when writting in the very first
>> bytes:
>>
>> [root@vm0]# lfs setstripe  -E 1M -c 1 -E 6M -c 2 /mnt/lustre/m_fou3
>>
>> [root@vm0]# lfs getstripe /mnt/lustre/m_fou3
>> /mnt/lustre/m_fou3
>>lcm_layout_gen:2
>>lcm_mirror_count:  1
>>lcm_entry_count:   2
>>  lcme_id: 1
>>  lcme_mirror_id:  0
>>  lcme_flags:  init
>>  lcme_extent.e_start: 0
>>  lcme_extent.e_end:   1048576
>>lmm_stripe_count:  1
>>lmm_stripe_size:   1048576
>>lmm_pattern:   raid0
>>lmm_layout_gen:0
>>lmm_stripe_offset: 3
>>lmm_objects:
>>- 0: { l_ost_idx: 3, l_fid: [0x10003:0x9cf:0x0] }
>>
>>  lcme_id: 2
>>  lcme_mirror_id:  0
>>  lcme_flags:  0
>>  lcme_extent.e_start: 1048576
>>  lcme_extent.e_end:   6291456
>>lmm_stripe_count:  2
>>lmm_stripe_size:   1048576
>>lmm_pattern:   raid0
>>lmm_layout_gen:0
>>lmm_stripe_offset: -1
>>
>> [root@vm0]# stat -c %s /mnt/lustre/m_fou3
>> 14
>>
>> * append fails:
>>
>> [root@vm0]# echo qsdkjqslkdjkj >> /mnt/lustre/m_fou3
>> bash: echo: write error: Invalid argument
>>
>> # strace indicates that write() gets the error:
>>
>> write(1, "qsdkjqslkdjkj\n", 14) = -1 EINVAL (Invalid argument)
>>
>> * no error in case of an open/truncate:
>>
>> [root@vm0]# echo qsdkjqslkdjkj > /mnt/lustre/m_fou3
>>
>> OK
>>
>> Is it expected or should I open a ticket?
>>
>> Thomas
>>
>> ___
>> lustre-discuss mailing list
>> lustre-disc

Re: [lustre-discuss] Compiling lustre-2.10.6

2019-03-12 Thread Patrick Farrell
Hsieh,


We have instructions for compiling from source here on our Wiki:

https://wiki.whamcloud.com/display/PUB/Building+Lustre+from+Source


Are you following those?  If not, I'd suggest it - Your problem looks likely to 
be an error in the build process.


We also have prebuilt 2.10.6 packages for many platforms:
https://downloads.whamcloud.com/public/lustre/


- Patrick




From: lustre-discuss  on behalf of 
Tung-Han Hsieh 
Sent: Tuesday, March 12, 2019 9:22:57 AM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Compiling lustre-2.10.6

Dear All,

I am trying to compile lustre-2.10.6 from source code. During
compilation, there are undefined symbols:

==
  CC [M]  
/home/thhsieh/lustre/L-2.10.6/lustre-2.10.6/lustre/osd-ldiskfs/osd_handler.o
In file included from 
/home/thhsieh/lustre/L-2.10.6/lustre-2.10.6/lustre/osd-ldiskfs/osd_handler.c:72:0:
/home/thhsieh/lustre/L-2.10.6/lustre-2.10.6/lustre/osd-ldiskfs/osd_internal.h: 
In function 'ldiskfs_get_htree_eof':
/home/thhsieh/lustre/L-2.10.6/lustre-2.10.6/lustre/osd-ldiskfs/osd_internal.h:1254:10:
 error: 'LDISKFS_HTREE_EOF_32BIT' undeclared (first use in this function)
   return LDISKFS_HTREE_EOF_32BIT;
  ^
/home/thhsieh/lustre/L-2.10.6/lustre-2.10.6/lustre/osd-ldiskfs/osd_internal.h:1254:10:
 note: each undeclared identifier is reported only once for each function it 
appears in
/home/thhsieh/lustre/L-2.10.6/lustre-2.10.6/lustre/osd-ldiskfs/osd_internal.h:1256:10:
 error: 'LDISKFS_HTREE_EOF_64BIT' undeclared (first use in this function)
   return LDISKFS_HTREE_EOF_64BIT;
  ^
/home/thhsieh/lustre/L-2.10.6/lustre-2.10.6/lustre/osd-ldiskfs/osd_handler.c: 
In function 'osd_check_lmv':
/home/thhsieh/lustre/L-2.10.6/lustre-2.10.6/lustre/osd-ldiskfs/osd_handler.c:968:19:
 error: 'LDISKFS_HTREE_EOF_64BIT' undeclared (first use in this function)
filp->f_pos != LDISKFS_HTREE_EOF_64BIT);
   ^
In file included from 
/home/thhsieh/lustre/L-2.10.6/lustre-2.10.6/lustre/osd-ldiskfs/osd_handler.c:72:0:
/home/thhsieh/lustre/L-2.10.6/lustre-2.10.6/lustre/osd-ldiskfs/osd_internal.h: 
In function 'ldiskfs_get_htree_eof':
/home/thhsieh/lustre/L-2.10.6/lustre-2.10.6/lustre/osd-ldiskfs/osd_internal.h:1257:1:
 error: control reaches end of non-void function [-Werror=return-type]
 }
 ^
cc1: all warnings being treated as errors
scripts/Makefile.build:321: recipe for target 
'/home/thhsieh/lustre/L-2.10.6/lustre-2.10.6/lustre/osd-ldiskfs/osd_handler.o' 
failed
==

I searched the source code, only find the definition of LDISKFS_HTREE_EOF
in ldiskfs/ldiskfs.h:

#define LDISKFS_HTREE_EOF   0x7fff

but cannot find the definition of LDISKFS_HTREE_EOF_32BIT and
LDISKFS_HTREE_EOF_64BIT. Could anyone tell me how to fix this problem ?

Thanks very much.

T.H.Hsieh
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.12.0 and locking problems

2019-03-05 Thread Patrick Farrell
Riccardo,

Since 2.12 is still a relatively new maintenance release, it would be helpful 
if you could open an LU and provide more detail there - Such as what clients 
were doing, if you were using any new features (like DoM or FLR), and full 
dmesg from the clients and servers involved in these evictions.

- Patrick

On 3/5/19, 11:50 AM, "lustre-discuss on behalf of Riccardo Veraldi" 
 wrote:

Hello,

I have quite a big issue on my Lustre 2.12.0 MDS/MDT.

Clients moving data to the OSS occur into a locking problem I never met 
before.

The clients are mostly 2.10.5 except for one which is 2.12.0 but 
regardless the client version the problem is still there.

So these are the errors I see on hte MDS/MDT. When this happens 
everything just hangs. If I reboot the MDS everything is back to 
normality but it happened already 2 times in 3 days and it is disrupting.

Any hints ?

Is it feasible to downgrade from 2.12.0 to 2.10.6 ?

thanks

Mar  5 11:10:33 psmdsana1501 kernel: Lustre: 
7898:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has 
failed due to network error: [sent 1551813033/real 1551813033] 
req@9fdcbecd0300 x1626845000210688/t0(0) 
o104->ana15-MDT@172.21.52.87@o2ib:15/16 lens 296/224 e 0 to 1 dl 
1551813044 ref 1 fl Rpc:eX/0/ rc 0/-1
Mar  5 11:10:33 psmdsana1501 kernel: Lustre: 
7898:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 50552576 
previous similar messages
Mar  5 11:13:03 psmdsana1501 kernel: LustreError: 
7898:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 
172.21.52.87@o2ib) failed to reply to blocking AST (req@9fdcbecd0300 
x1626845000210688 status 0 rc -110), evict it ns: mdt-ana15-MDT_UUID 
lock: 9fde9b6873c0/0x9824623d2148ef38 lrc: 4/0,0 mode: PR/PR res: 
[0x213a9:0x1d347:0x0].0x0 bits 0x13/0x0 rrc: 5 type: IBT flags: 
0x6020040020 nid: 172.21.52.87@o2ib remote: 0xd8efecd6e7621e63 
expref: 8 pid: 7898 timeout: 333081 lvb_type: 0
Mar  5 11:13:03 psmdsana1501 kernel: LustreError: 138-a: ana15-MDT: 
A client on nid 172.21.52.87@o2ib was evicted due to a lock blocking 
callback time out: rc -110
Mar  5 11:13:03 psmdsana1501 kernel: LustreError: 
5321:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer 
expired after 150s: evicting client at 172.21.52.87@o2ib ns: 
mdt-ana15-MDT_UUID lock: 9fde9b6873c0/0x9824623d2148ef38 lrc: 
3/0,0 mode: PR/PR res: [0x213a9:0x1d347:0x0].0x0 bits 0x13/0x0 rrc: 
5 type: IBT flags: 0x6020040020 nid: 172.21.52.87@o2ib remote: 
0xd8efecd6e7621e63 expref: 9 pid: 7898 timeout: 0 lvb_type: 0
Mar  5 11:13:04 psmdsana1501 kernel: Lustre: ana15-MDT: Connection 
restored to 59c5a826-f4e9-0dd0-8d4f-08c204f25941 (at 172.21.52.87@o2ib)
Mar  5 11:15:34 psmdsana1501 kernel: LustreError: 
7898:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 
172.21.52.142@o2ib) failed to reply to blocking AST 
(req@9fde2d393600 x1626845000213776 status 0 rc -110), evict it ns: 
mdt-ana15-MDT_UUID lock: 9fde9b6858c0/0x9824623d2148efee lrc: 
4/0,0 mode: PR/PR res: [0x213ac:0x1:0x0].0x0 bits 0x13/0x0 rrc: 3 
type: IBT flags: 0x6020040020 nid: 172.21.52.142@o2ib remote: 
0xbb35541ea6663082 expref: 9 pid: 7898 timeout: 333232 lvb_type: 0
Mar  5 11:15:34 psmdsana1501 kernel: LustreError: 138-a: ana15-MDT: 
A client on nid 172.21.52.142@o2ib was evicted due to a lock blocking 
callback time out: rc -110
Mar  5 11:15:34 psmdsana1501 kernel: LustreError: 
5321:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer 
expired after 151s: evicting client at 172.21.52.142@o2ib ns: 
mdt-ana15-MDT_UUID lock: 9fde9b6858c0/0x9824623d2148efee lrc: 
3/0,0 mode: PR/PR res: [0x213ac:0x1:0x0].0x0 bits 0x13/0x0 rrc: 3 
type: IBT flags: 0x6020040020 nid: 172.21.52.142@o2ib remote: 
0xbb35541ea6663082 expref: 10 pid: 7898 timeout: 0 lvb_type: 0
Mar  5 11:15:34 psmdsana1501 kernel: Lustre: ana15-MDT: Connection 
restored to 9d49a115-646b-c006-fd85-000a4b90019a (at 172.21.52.142@o2ib)
Mar  5 11:20:33 psmdsana1501 kernel: Lustre: 
7898:0:(client.c:2132:ptlrpc_expire_one_request()) @@@ Request sent has 
failed due to network error: [sent 1551813633/real 1551813633] 
req@9fdcc2a95100 x1626845000222624/t0(0) 
o104->ana15-MDT@172.21.52.87@o2ib:15/16 lens 296/224 e 0 to 1 dl 
1551813644 ref 1 fl Rpc:eX/2/ rc 0/-1
Mar  5 11:20:33 psmdsana1501 kernel: Lustre: 
7898:0:(client.c:2132:ptlrpc_expire_one_request()) Skipped 23570550 
previous similar messages
Mar  5 11:22:46 psmdsana1501 kernel: LustreError: 
7898:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 
172.21.52.87@o2ib) failed to reply to blocking 

Re: [lustre-discuss] Data migration from one OST to anther

2019-03-03 Thread Patrick Farrell
Hsieh,

This sounds similar to a bug with pre-2.5 servers and 2.7 (or newer) clients.  
The client and server have a disagreement about which does the delete, and the 
delete doesn’t happen.  Since you’re running 2.5, I don’t think you should see 
this, but the symptoms are the same.   You can temporarily fix things by 
restarting/remounting your OST(s), which will trigger orphan cleanup.  But if 
that works, the only long term fix is to upgrade your servers to a version that 
is expected to work with your clients.  (The 2.10 maintenance release is nice 
if you are not interested in the newest features, otherwise, 2.12 is also an 
option.)

I would also recommend where possible that you keep clients and servers in sync 
- we do interop testing, but same version on both is much more widely used.

- Patrick

From: lustre-discuss  on behalf of 
Tung-Han Hsieh 
Sent: Sunday, March 3, 2019 4:00:17 AM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Data migration from one OST to anther

Dear All,

We have a problem of data migration from one OST two another.

We have installed Lustre-2.5.3 on the MDS and OSS servers, and Lustre-2.8
on the clients. We want to migrate some data from one OST to another in
order to re-balance the data occupation among OSTs. In the beginning we
follow the old method (i.e., method found in Lustre-1.8.X manuals) for
the data migration. Suppose we have two OSTs:

root@client# /opt/lustre/bin/lfs df
UUID   1K-blocksUsed   Available Use% Mounted on
chome-OST0028_UUID7692938224  724670914855450156  99% /work[OST:40]
chome-OST002a_UUID   14640306852  7094037956  6813847024  51% /work[OST:42]

and we want to migrate data from chome-OST0028_UUID to chome-OST002a_UUID.
Our procedures are:

1. We deactivate chome-OST0028_UUID:
   root@mds# echo 0 > /opt/lustre/fs/osc/chome-OST0028-osc-MDT/active

2. We find all files located in chome-OST0028_UUID:
   root@client# /opt/lustre/bin/lfs find --obd chome-OST0028_UUID /work > list

3. In each file listed in the file "list", we did:

cp -a  .tmp
mv .tmp 

During the migration, we really saw that more and more data written into
chome-OST002a_UUID. But we did not see any disk release in chome-OST0028_UUID.
In Lustre-1.8.X, doing this way we did saw that chome-OST002a_UUID has
more data coming in, and chome-OST0028_UUID has more and more free space.

It looks like that the data files referenced by MDT have copied to
chome-OST002a_UUID, but the junks still remain in chome-OST0028_UUID.
Even though we activate chome-OST0028_UUID after migration, the situation
is still the same:

root@mds# echo 1 > /opt/lustre/fs/osc/chome-OST0028-osc-MDT/active

Is there any way to cure this problem ?


Thanks very much.

T.H.Hsieh
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre client 2.11.0 with Lustre server lustre-2.12.0-ib

2019-03-02 Thread Patrick Farrell
Parag,

I would be interested to know more about the application compatibility issues, 
but you should be OK with 2.11 clients and a 2.12 server.  In general, newer 
clients are tested with older servers, much the other way around, but 
especially with adjacent major releases you should be fine.

- Patrick

From: lustre-discuss  on behalf of 
Parag Khuraswar 
Sent: Saturday, March 2, 2019 8:07:32 AM
To: 'Lustre discussion'
Subject: [lustre-discuss] Lustre client 2.11.0 with Lustre server 
lustre-2.12.0-ib


Hi,



Due to some application compatibility issues I have to use “lustre-2.12.0-ib” 
on Lustre server side and “Lustre client 2.11.0” on Lustre client side.

Is there any issue if I use two different Lustre version’s on Lustre server & 
client side ?



Regards,

Parag




___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Draining and replacing OSTs with larger volumes

2019-02-28 Thread Patrick Farrell
Scott,


This sounds great.  Slower, but safer.  You might want to integrate the pool 
suggestion Jongwoo made in the other recent thread in order to control 
allocations to your new OSTs (assuming you're trying to stay live during most 
of this).


- Patrick


From: Scott Wood 
Sent: Thursday, February 28, 2019 6:15:54 PM
To: Patrick Farrell; Jongwoo Han
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Draining and replacing OSTs with larger volumes

My Thanks to both Jongwoo and Patrick for your responses.

Great advice to do a practice run in a virtual environment but I'm lucky enough 
to have a physical one. I have a testbed that has the same versions of all 
software but with iscsi targets as the OSTs, rather than physical arrays, and 
not so many OSTs (8 in the testbed and 60 in production)  I do use it for test 
upgrades and fully intend to do a dry run there.

Jongwoo, to address your point, yes the rolling migration is forced, as we only 
have two new arrays, and 10 existing arrays which we can upgrade the drives in. 
 You asked about OST sizes.  OSTs are 29TB, six per array, two arrays per OSS 
pair, 5 OSS pairs.  I also expect the migrate-replace-migrate-replace to be 
painfully slow, but with the hardware at hand, it's the only option.  I was 
figuring they may take a few weeks to drain each pair of arrays.  As for the 
rolling upgrade, based on yours and Patrick's responses, we'll skip that to 
keep things cleaner.

Taking your points in to consideration, the amended plan will be:

1) Deploy a new HA pair of OSSs with arrays populated with OSTs that are twice 
the size of our current ones, but stick with the existing v2.10.3
2) Remove the 12 OSTs that are connected to my oldest HA pair of OSSs as 
described in 14.9.3, using 12 parallel migrate processes across 12 clients
3) Repopulate those arrays with the larger drives and make new 12 OSTs from 
scratch, with fresh indices, and bring them online
4) Repeat steps 2 and 3 for the four remaining original HA pairs of OSSs
5) Take a break and let the dust settle
6) At a later date, have a scheduled outage and upgrade from 2.10.3 to whatever 
the current maintenance release is

Again, you feedback is appreciated.

Cheers
Scott

From: Patrick Farrell 
Sent: Thursday, 28 February 2019 11:06 PM
To: Jongwoo Han; Scott Wood
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Draining and replacing OSTs with larger volumes

Scott,

I’d like to strongly second all of Jongwoo’s advice, particularly that about 
adding new OSTs rather than replacing existing ones, if possible.  That 
procedure is so much simpler and involves a lot less messing around “under the 
hood”.  It takes you from a complex procedure with many steps to, essentially, 
copying a bunch of data around while your file system remains up, and adding 
and removing a few OSTs at either end.

It would also be non-destructive for your existing data.  One of the scary 
things about the original proposed process is that if something goes wrong 
partway through, the original data is already gone (or at least very hard to 
get).

Regards,
- Patrick

From: lustre-discuss  on behalf of 
Jongwoo Han 
Sent: Thursday, February 28, 2019 5:36:54 AM
To: Scott Wood
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Draining and replacing OSTs with larger volumes



On Thu, Feb 28, 2019 at 11:09 AM Scott Wood 
mailto:woodystr...@hotmail.com>> wrote:
Hi folks,

Big upgrade process in the works and I had some questions.  Our current 
infrastructure has 5 HA pairs of OSSs and arrays with an HA pair of management 
and metadata servers who also share an array, all running lustre 2.10.3.  
Pretty standard stuff.  Our upgrade plan is as follows:

1) Deploy a new HA pair of OSSs with arrays populated with OSTs that are twice 
the size of our originals.
2) Follow the process in section 14.9 of the lustre docs to drain all OSTs in 
one of existing the HA pairs' arrays
3) Repopulate the first old pair of deactivated and drained arrays with new 
larger drives
4) Upgrade the offline OSSs from 2.10.3 to 2.10.latest?
5) Return them to service
6) Repeat steps 2-4 for the other 4 old HA pairs of OSSs and OSTs

I'd expect this would be doable without downtime as we'd only be taking arrays 
offline that have no objects on them, and we've added new arrays and OSSs 
before with no issues.  I have a few questions before we begin the process:

1) My interpretation of the docs is that  we OK to install them with 2.10.6 (or 
2.10.7, if it's out), as rolling upgrades withing X.Y are supported.  Is that 
correct?

In theory, rolling upgrade should work, but generally recommended upgrade 
procedure is to stop filesystem and unmount all MDS and OSS, upgrade package 
and bring them up. This will prevent human errors during repeated per-server 
upgrade.
When it is done correctly, It wil

Re: [lustre-discuss] Suspended jobs and rebooting lustre servers

2019-02-28 Thread Patrick Farrell
This is very good advice, and you can also vary it to aid in removing old OSTs 
(thinking of the previous message) - simply take the old ones you wish to 
remove out of the pool, then new files will not be created there.  Makes 
migration easier.

One thing though:
Setting a default layout everywhere may be prohibitively slow for a large fs.

If you set a default layout on the root of the file system, it is automatically 
used as the default for all directories that do not have another default set.

So if you have not previously set a default layout on any directories, there is 
no need to go through the fs changing them like this.  (And perhaps if you 
have, you can find those directories and handle them manually, rather than 
setting a pool on every directory.)

- Patrick

From: lustre-discuss  on behalf of 
Jongwoo Han 
Sent: Thursday, February 28, 2019 6:09:18 AM
To: Stephane Thiell
Cc: lustre-discuss
Subject: Re: [lustre-discuss] Suspended jobs and rebooting lustre servers


My strategy for adding new OSTs on live filesystem is to define a pool with 
currently running OST and apply pool stripe (lfs setstripe -p [live-ost-pool]) 
on all existing directories. It is better when it is done at first filesystem 
creation.

After that, you can safely add new OSTs without newly created files filling in 
like flood - newly added OST will remain silently until you add them to pool.

Try failover tests with new OSTs and OSSes while it do not store files. After 
the failover/restart test is done on new OSS and OSTs, you can add new OSTs to 
the pool then they will start to store files shortly after.

If you did not create a pool, create a pool with old OSTs and

# lfs find  -type d | while read DIR ; do echo "processing :" $DIR; 
lfs setstripe -p  $DIR ; done

will mark all subdirectories on the pool, so newly added OSTs are safe from 
files coming in until these new OSTs are added to the pool.

I always expand live filesystem in this manner, not to worry about heavily 
loaded situation.

On Thu, Feb 28, 2019 at 1:02 AM Stephane Thiell 
mailto:sthi...@stanford.edu>> wrote:
On one of our filesystem, we add a few new OSTs almost every month with no 
downtime, this is very convenient. The only thing that I would recommend is to 
avoid doing that during a peak of I/Os on your filesystem (we usually do it as 
early as possible in the morning), as the added OSTs will immediately an heavy 
I/O load, likely because they are empty.

Best,

Stephane


> On Feb 22, 2019, at 2:03 PM, Andreas Dilger 
> mailto:adil...@whamcloud.com>> wrote:
>
> This is not really correct.
>
> Lustre clients can handle the addition of OSTs to a running filesystem. The 
> MGS will register the new OSTs, and the clients will be notified by the MGS 
> that the OSTs have been added, so no need to unmount the clients during this 
> process.
>
>
> Cheers, Andreas
>
> On Feb 21, 2019, at 19:23, Raj 
> mailto:rajgau...@gmail.com>> wrote:
>
>> Hello Raj,
>> It’s best and safe to unmount from all the clients and then do the upgrade. 
>> Your FS is getting more OSTs and changing conf in the existing ones, your 
>> client needs to get the new layout by remounting it.
>> Also you mentioned about client eviction, during eviction the client has to 
>> drop it’s dirty pages and all the open file descriptors in the FS will be 
>> gone.
>>
>> On Thu, Feb 21, 2019 at 12:25 PM Raj Ayyampalayam 
>> mailto:ans...@gmail.com>> wrote:
>> What can I expect to happen to the jobs that are suspended during the file 
>> system restart?
>> Will the processes holding an open file handle die when I unsuspend them 
>> after the filesystem restart?
>>
>> Thanks!
>> -Raj
>>
>>
>> On Thu, Feb 21, 2019 at 12:52 PM Colin Faber 
>> mailto:cfa...@gmail.com>> wrote:
>> Ah yes,
>>
>> If you're adding to an existing OSS, then you will need to reconfigure the 
>> file system which requires writeconf event.
>>
>> On Thu, Feb 21, 2019 at 10:00 AM Raj Ayyampalayam 
>> mailto:ans...@gmail.com>> wrote:
>> The new OST's will be added to the existing file system (the OSS nodes are 
>> already part of the filesystem), I will have to re-configure the current HA 
>> resource configuration to tell it about the 4 new OST's.
>> Our exascaler's HA monitors the individual OST and I need to re-configure 
>> the HA on the existing filesystem.
>>
>> Our vendor support has confirmed that we would have to restart the 
>> filesystem if we want to regenerate the HA configs to include the new OST's.
>>
>> Thanks,
>> -Raj
>>
>>
>> On Thu, Feb 21, 2019 at 11:23 AM Colin Faber 
>> mailto:cfa...@gmail.com>> wrote:
>> It seems to me that steps may still be missing?
>>
>> You're going to rack/stack and provision the OSS nodes with new OSTs'.
>>
>> Then you're going to introduce failover options somewhere? new osts? 
>> existing system? etc?
>>
>> If you're introducing failover with the new OST's and leaving the existing 
>> system in place, you should be able to accomplish this 

Re: [lustre-discuss] Draining and replacing OSTs with larger volumes

2019-02-28 Thread Patrick Farrell
Scott,

I’d like to strongly second all of Jongwoo’s advice, particularly that about 
adding new OSTs rather than replacing existing ones, if possible.  That 
procedure is so much simpler and involves a lot less messing around “under the 
hood”.  It takes you from a complex procedure with many steps to, essentially, 
copying a bunch of data around while your file system remains up, and adding 
and removing a few OSTs at either end.

It would also be non-destructive for your existing data.  One of the scary 
things about the original proposed process is that if something goes wrong 
partway through, the original data is already gone (or at least very hard to 
get).

Regards,
- Patrick

From: lustre-discuss  on behalf of 
Jongwoo Han 
Sent: Thursday, February 28, 2019 5:36:54 AM
To: Scott Wood
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Draining and replacing OSTs with larger volumes



On Thu, Feb 28, 2019 at 11:09 AM Scott Wood 
mailto:woodystr...@hotmail.com>> wrote:
Hi folks,

Big upgrade process in the works and I had some questions.  Our current 
infrastructure has 5 HA pairs of OSSs and arrays with an HA pair of management 
and metadata servers who also share an array, all running lustre 2.10.3.  
Pretty standard stuff.  Our upgrade plan is as follows:

1) Deploy a new HA pair of OSSs with arrays populated with OSTs that are twice 
the size of our originals.
2) Follow the process in section 14.9 of the lustre docs to drain all OSTs in 
one of existing the HA pairs' arrays
3) Repopulate the first old pair of deactivated and drained arrays with new 
larger drives
4) Upgrade the offline OSSs from 2.10.3 to 2.10.latest?
5) Return them to service
6) Repeat steps 2-4 for the other 4 old HA pairs of OSSs and OSTs

I'd expect this would be doable without downtime as we'd only be taking arrays 
offline that have no objects on them, and we've added new arrays and OSSs 
before with no issues.  I have a few questions before we begin the process:

1) My interpretation of the docs is that  we OK to install them with 2.10.6 (or 
2.10.7, if it's out), as rolling upgrades withing X.Y are supported.  Is that 
correct?

In theory, rolling upgrade should work, but generally recommended upgrade 
procedure is to stop filesystem and unmount all MDS and OSS, upgrade package 
and bring them up. This will prevent human errors during repeated per-server 
upgrade.
When it is done correctly, It will take not more than 2 hours.

2) Until the whole process is complete, we'll have imbalanced OSTs.  I know 
that's not ideal, but is it all that big an issue

Rolling upgrade will cause imbalance, but after long run, the files will be 
assigned will be evenly distributed. No need to worry about it on one-shot 
upgrade scenario.

3) When draining the OSTs of files, section 14.9.3, point 2.a. states that the 
lfs find |lfs migrate can take multiple OSTs as args, but I thought it would be 
better to run one instance of that per OST and distribute them across multiple 
clients .  Is that reasonable (and faster)?

Parallel redistribute is generally faster than one-by-one. If the MDT can 
endure scanning load, run multiple migrate processes each for against one OST
4) When the drives are replaced with bigger ones, can the original OST 
configuration files be restored to them as described in Docs section 14.9.5, or 
due the the size mismatch, will that be bad?

Since this process will treat objects as files, the configurations should go as 
same.

5) What questions should I be asking that I haven't thought of?


I do not know the size of OSTs to deal with, but I think 
migrate(empty)-replace-migrate-replace is really painful process as it will 
take long time. If circumtances allow, I suggest add all new OST arrays to OSS 
with new OST nums, migrate OST objects, deactivate and remove old OSTs.

If that all goes well, and we did upgrade the OSSs to a newer 2.10.x, we'd 
follow it up with a migration of the MGT and MDT to one of the management 
servers, upgrade the other, fail them back, upgrade the second, and rebalance 
the MDT and MGT services back across the two.  We'd expect the usual pause in 
services as those migrate but other than that, fingers crossed, should all be 
good.  Are we missing anything?


If this plan is forced, rolling migrate and upgrade should be planned 
carefully. It will be better to set up correct procedure checklist by 
practicing on a virtual environment with identical versions.

Cheers
Scott
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


--
Jongwoo Han
+82-505-227-6108
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Which release to use?

2019-02-22 Thread Patrick Farrell
Please file bugs if anything does!

- Patrick

From: lustre-discuss  on behalf of 
Nathan R Crawford 
Sent: Friday, February 22, 2019 7:04:24 PM
To: Peter Jones
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Which release to use?

Thanks for the link! I'll spin up a test system with 2.12.0 and see what breaks.
-Nate

On Fri, Feb 22, 2019 at 12:20 PM Peter Jones 
mailto:pjo...@whamcloud.com>> wrote:

Nathan



Yes 2.12 is an LTS branch. We’re planning on putting out both 2.10.7 and 2.12.1 
this quarter but have been focusing on the former first to allow for more time 
to receive feedback from early adopters on 2.12.0. You can see the patches that 
will land starting to accumulate here - 
https://review.whamcloud.com/#/q/status:open+project:fs/lustre-release+branch:b2_12
 . I guess what I am trying to say is “be patient” ☺



Peter



From: Nathan R Crawford mailto:nrcra...@uci.edu>>
Reply-To: "nathan.crawf...@uci.edu" 
mailto:nathan.crawf...@uci.edu>>
Date: Friday, February 22, 2019 at 11:31 AM
To: Peter Jones mailto:pjo...@whamcloud.com>>
Cc: "lustre-discuss@lists.lustre.org" 
mailto:lustre-discuss@lists.lustre.org>>
Subject: Re: [lustre-discuss] Which release to use?



Hi Peter,



  Somewhat related: where should we be looking for the commits leading up to 
2.12.1? The b2_12 branch 
(https://git.whamcloud.com/?p=fs/lustre-release.git;a=shortlog;h=refs/heads/b2_12)
 has no activity since 2.12.0 was released. I assumed that if 2.12 is a LTS 
branch like 2.10, there would be something by now. Commits started appearing on 
b2_10 after a week or so.



  "Be patient" is an acceptable response :)



-Nate



On Fri, Feb 22, 2019 at 10:51 AM Peter Jones 
mailto:pjo...@whamcloud.com>> wrote:

2.12.0 is relatively new. It does have some improvements over 2.10.x (notably 
Data on MDT) but if those are not an immediate requirement then using 2.10.6 
would be a proven and more comnservative option. 2.12.51 is an interim 
development build for 2.13 and should absolutely not be used for production 
purposes.

On 2019-02-22, 10:07 AM, "lustre-discuss on behalf of Bernd Melchers" 
mailto:lustre-discuss-boun...@lists.lustre.org>
 on behalf of melch...@zedat.fu-berlin.de> 
wrote:

Hi all,
in the git repository i find v2.10.6, v2.12.0 and v2.12.51. Which version 
should
i compile and use for my productive CentOS 7.6 System?

Mit freundlichen Grüßen
Bernd Melchers

--
Archiv- und Backup-Service | 
fab-serv...@zedat.fu-berlin.de
Freie Universität Berlin   | Tel. +49-30-838-55905
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org




--

Dr. Nathan Crawford  
nathan.crawf...@uci.edu

Modeling Facility Director

Department of Chemistry

1102 Natural Sciences II Office: 2101 Natural Sciences II

University of California, Irvine  Phone: 949-824-4508

Irvine, CA 92697-2025, USA


--

Dr. Nathan Crawford  
nathan.crawf...@uci.edu
Modeling Facility Director
Department of Chemistry
1102 Natural Sciences II Office: 2101 Natural Sciences II
University of California, Irvine  Phone: 949-824-4508
Irvine, CA 92697-2025, USA
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] EINVAL error when writing to a PFL file (lustre 2.12.0)

2019-02-22 Thread Patrick Farrell
Thomas,


Yeah, that's possible, though I don't think people are regularly running in to 
this.  FWIW, I think the intention was to allow extension of layouts component 
by component.  An incomplete layout can have components added to it.


I can conceive of someone using an incomplete layout to enforce a file size 
limit, but that's about it.


- Patrick


From: LEIBOVICI Thomas 
Sent: Friday, February 22, 2019 11:09:03 AM
To: Patrick Farrell; lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] EINVAL error when writing to a PFL file (lustre 
2.12.0)

Hello Patrick,

Thank you for the quick reply.
No, I have no particular use-case in mind, I'm just playing around with PFL.

If this is currently not properly supported, a quick fix could be to prevent 
the user from creating such incomplete layouts?

Regards,
Thomas

On 2/22/19 5:33 PM, Patrick Farrell wrote:

Thomas,


This is expected, but it's also something we'd like to fix - See LU-9341.


Basically, append tries to instantiate the layout from 0 to infinity, and it 
fails because your layout is incomplete (ie doesn't go to infinity).

May I ask why you're creating a file with an incomplete layout?  Do you have a 
use case in mind?


- Patrick


From: lustre-discuss 
<mailto:lustre-discuss-boun...@lists.lustre.org>
 on behalf of LEIBOVICI Thomas 
<mailto:thomas.leibov...@cea.fr>
Sent: Friday, February 22, 2019 10:27:48 AM
To: lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: [lustre-discuss] EINVAL error when writing to a PFL file (lustre 
2.12.0)

Hello,

Is it expected to get an error when appending a PFL file made of 2
regions [0 - 1M] and [1M to 6M]
even if writing in this range?

I get an error when appending it, even when writting in the very first
bytes:

[root@vm0]# lfs setstripe  -E 1M -c 1 -E 6M -c 2 /mnt/lustre/m_fou3

[root@vm0]# lfs getstripe /mnt/lustre/m_fou3
/mnt/lustre/m_fou3
   lcm_layout_gen:2
   lcm_mirror_count:  1
   lcm_entry_count:   2
 lcme_id: 1
 lcme_mirror_id:  0
 lcme_flags:  init
 lcme_extent.e_start: 0
 lcme_extent.e_end:   1048576
   lmm_stripe_count:  1
   lmm_stripe_size:   1048576
   lmm_pattern:   raid0
   lmm_layout_gen:0
   lmm_stripe_offset: 3
   lmm_objects:
   - 0: { l_ost_idx: 3, l_fid: [0x10003:0x9cf:0x0] }

 lcme_id: 2
 lcme_mirror_id:  0
 lcme_flags:  0
 lcme_extent.e_start: 1048576
 lcme_extent.e_end:   6291456
   lmm_stripe_count:  2
   lmm_stripe_size:   1048576
   lmm_pattern:   raid0
   lmm_layout_gen:0
   lmm_stripe_offset: -1

[root@vm0]# stat -c %s /mnt/lustre/m_fou3
14

* append fails:

[root@vm0]# echo qsdkjqslkdjkj >> /mnt/lustre/m_fou3
bash: echo: write error: Invalid argument

# strace indicates that write() gets the error:

write(1, "qsdkjqslkdjkj\n", 14) = -1 EINVAL (Invalid argument)

* no error in case of an open/truncate:

[root@vm0]# echo qsdkjqslkdjkj > /mnt/lustre/m_fou3

OK

Is it expected or should I open a ticket?

Thomas

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Upgrade without losing data from 1.8.6 to 2.5.2 and back if necessary

2019-02-04 Thread Patrick Farrell
Wow, 1.8 is pretty old these days, as is 2.5 (first release 6 years ago!).  I 
hope you're planning on upgrading past 2.5 once you've upgraded to it.  
(Honestly, this is all so old at this point you might consider letting your 
existing system reach EOL on 1.8.x and building a new file system (2.10 is a 
maintenance branch...) & migrating data.)


Upgrade is possible and relatively easy.  Just shut down all your servers, 
upgrade the Lustre version, then start them as normal.  It is a good idea to 
upgrade your clients at the same time, but you can use 1.8.x clients with 2.5.  
Not everything works - quota, notably, will not work right - but it is mostly 
fine.


That's the first step of upgrading, and at this point you can easily roll back 
to 1.8.x.


After that, it's more complicated.  I won't write it all up for you, but the 
manual has information on enabling dirdata and doing lfsck namespace, and also 
on enabling quotas, which is different in 2.5.  Once you enable dirdata (even 
without doing the namespace lfsck, which you should do), you cannot go back to 
1.8.x.  (It is probably not safe to upgrade beyond 2.5.x without enabling 
dirdata, nor to downgrade from > 2.5.x to 1.8.x.  Those *might* work, but I 
wouldn't bet my data on it.)


Section 17.2 in the manual has the relevant info.


- Patrick


From: lustre-discuss  on behalf of 
Richard Chang 
Sent: Monday, February 4, 2019 6:38:57 AM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Upgrade without losing data from 1.8.6 to 2.5.2 and 
back if necessary

Hi,

I currently have Lustre 1.8.6 running without much problem. I would like to 
upgrade to version 2.5.2, but without losing any data.

Is it possible to upgrade from 1.8.6 => 2.5.2 and back to older version, in 
case of any problems.

I am unable to find any documentation or any pointers for anyone having done 
any such thing. I mean, upgrade, and if facing any problem, revert back to 
older version.

Is it possible at all, without losing any data ?

Thanks & regards,
Rick.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LFS tuning hierarchy question

2019-01-24 Thread Patrick Farrell
Ah, I understand.  Yes, that’s correct.  You can also set the value on the MGS 
for that file system with lctl set_param -P mdc.*.max_rpcs_in_flight=32, that 
will apply on the clients,

How are you checking the value on the server?  There should be no MDC there.  
If it is instead associated with the MDT, then that is, I believe, a maximum 
and not a default.

From: Ms. Megan Larko 
Sent: Thursday, January 24, 2019 8:24:31 PM
To: Lustre User Discussion Mailing List; Patrick Farrell
Subject: [lustre-discuss] LFS tuning hierarchy question

Thank you for the information, Patrick.

On my current Lustre client all Lustre File Systems mounted (called /mnt/foo 
and /mnt/bar in my example) display a connection value for max_rpcs_in_flight = 
8 for both file systems--the /mnt/foo on which the server has 
max_rpcs_in_flight = 8 and also for /mnt/bar on which the Lustre server 
indicates max_rpcs_in_flight = 32.

So using the Lustre 2.7.2 client default behavior all of the Lustre mounts 
viewed on the client are max_rpcs_in_flight = 8.

I am assuming that I will need to set the value for max_rpcs_in_flight to 32 on 
the client and then the client will pick up the 32 where 32 is possible from 
the Lustre File System server and 8 on those Lustre File Systems where the 
servers have not increased the default value for that parameter.

Is this correct?

Cheers,
megan
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LFS tuning hierarchy question

2019-01-24 Thread Patrick Farrell
It varies by value.  If the server has a value set (with lctl set_param -P on 
the MGS), it will override the client value.  Otherwise you'll get the default 
value.  (Max pages per RPC is a bit of an exception in that the client and 
server will negotiate to "highest mutually supported" value for that.)


It's absolutely possible to have different settings for different file system, 
and you're already doing it if those values are what you're getting from lctl 
get_param..


From: lustre-discuss  on behalf of Ms. 
Megan Larko 
Sent: Thursday, January 24, 2019 1:53:19 PM
To: Lustre User Discussion Mailing List
Subject: [lustre-discuss] LFS tuning hierarchy question

Halloo---  People!

I am seeking confirmation of an observed behavior in Lustre.

I have a Lustre client.   This client is running Lustre 2.7.2.  Mounted onto 
this client I have /mnt/foo (Lustre server 2.7.2) and /mnt/bar (lustre 2.10.4).

Servers for /mnt/foo have max_rpcs_in_flight=8  (the default value)
Servers for /mnt/bar have max_rpcs_in_flight=32

On the Lustre client, the command "lctl get_param mdc.*.max_rpcs_in_flight" 
show both file systems using max_rpcs_in_flight=8.

Is it correct that the client uses the lowers value for a Lustre tunable 
presented from a Lustre file system server?   OR...is it the case that the 
client needs to be tuned so that it may use "up to" the maximum value of the 
mounted file systems if the specific Lustre server supports that value?

Really I am wondering if it is possible to have, in this case, a 
"max_rpcs_in_flight" to be 32 for the /mnt/bar Lustre File System while still 
using a more-limited max_rpcs_in_flight of 8 for /mnt/foo.

TIA,
megan
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ldiskfs performance degradation due to kernel swap hugging cpu

2018-12-28 Thread Patrick Farrell


Abe,

You gave some general info, but unless I missed something, nothing specific to 
show any involvement by swap.  How did you determine that?  Can you share that 
data?  And what performance are you expecting here?

- Patrick

From: lustre-devel  on behalf of Abe 
Asraoui 
Sent: Friday, December 28, 2018 6:42:50 PM
To: Lustre Developement; lustre-discuss@lists.lustre.org; Abe Asraoui
Subject: Re: [lustre-devel] ldiskfs performance degradation due to kernel swap 
hugging cpu



+ lustre-discuss



 Hi All.
We are seeing low performance with lustre2.11 in ldiskfs configuration with 
obdfilter survey, not sure if this is a known issue.

obdfilter survery under ldiskfs performance is impacted by kernel swap 
hugging cpu usage, current configurations is as follows:
2 osts: ost1,ost2
/dev/sdc on /mnt/mdt type lustre 
(ro,context=unconfined_u:object_r:user_tmp_t:s0,svname=tempAA-MDT,mgs,osd=osd-ldiskfs,user_xattr,errors=remount-ro)
/dev/sdb on /mnt/ost1 type lustre 
(ro,context=unconfined_u:object_r:user_tmp_t:s0,svname=tempAA-OST0001,mgsnode=10.10.10.168@o2ib,osd=osd-ldiskfs,errors=remount-ro)
/dev/sda on /mnt/ost2 type lustre 
(ro,context=unconfined_u:object_r:user_tmp_t:s0,svname=tempAA-OST0002,mgsnode=10.10.10.168@o2ib,osd=osd-ldiskfs,errors=remount-ro)
[root@oss100 htop-2.2.0]#
[root@oss100 htop-2.2.0]# dkms status
lustre-ldiskfs, 2.11.0, 3.10.0-693.21.1.el7_lustre.x86_64, x86_64: installed
spl, 0.7.6, 3.10.0-693.21.1.el7_lustre.x86_64, x86_64: installed
[root@oss100 htop-2.2.0]#
sh ./obdsurvey-script.sh
Mon Dec 10 17:19:52 PST 2018 Obdfilter-survey for case=disk from oss100
ost 2 sz 51200K rsz 1024K obj 2 thr 2 write 134.52 [ 49.99, 101.96] 
rewrite 132.09 [ 49.99, 78.99] read 2566.74 [ 258.96, 2068.71]
ost 2 sz 51200K rsz 1024K obj 2 thr 4 write 195.73 [ 76.99, 128.98] 
rewrite
root@oss100 htop-2.2.0]# lctl dl
0 UP osd-ldiskfs tempAA-MDT-osd tempAA-MDT-osd_UUID 9
1 UP mgs MGS MGS 4
2 UP mgc MGC10.10.10.168@o2ib 65f231a0-8fd8-001d-6b0f-3e986f914178 4
3 UP mds MDS MDS_uuid 2
4 UP lod tempAA-MDT-mdtlov tempAA-MDT-mdtlov_UUID 3
5 UP mdt tempAA-MDT tempAA-MDT_UUID 8
6 UP mdd tempAA-MDD tempAA-MDD_UUID 3
7 UP qmt tempAA-QMT tempAA-QMT_UUID 3
8 UP lwp tempAA-MDT-lwp-MDT tempAA-MDT-lwp-MDT_UUID 4
9 UP osd-ldiskfs tempAA-OST0001-osd tempAA-OST0001-osd_UUID 4
10 UP ost OSS OSS_uuid 2
11 UP obdfilter tempAA-OST0001 tempAA-OST0001_UUID 5
12 UP lwp tempAA-MDT-lwp-OST0001 tempAA-MDT-lwp-OST0001_UUID 4
13 UP osp tempAA-OST0001-osc-MDT tempAA-MDT-mdtlov_UUID 4
14 UP echo_client tempAA-OST0001_ecc tempAA-OST0001_ecc_UUID 2
15 UP osd-ldiskfs tempAA-OST0002-osd tempAA-OST0002-osd_UUID 4
16 UP obdfilter tempAA-OST0002 tempAA-OST0002_UUID 5
17 UP lwp tempAA-MDT-lwp-OST0002 tempAA-MDT-lwp-OST0002_UUID 4
18 UP osp tempAA-OST0002-osc-MDT tempAA-MDT-mdtlov_UUID 4
19 UP echo_client tempAA-OST0002_ecc tempAA-OST0002_ecc_UUID 2
[root@oss100 htop-2.2.0]#
root@oss100 htop-2.2.0]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 152.8T 0 disk /mnt/ost2
sdb 8:16 0 152.8T 0 disk /mnt/ost1
sdc 8:32 0 931.5G 0 disk /mnt/mdt
sdd 8:48 0 465.8G 0 disk
\u251c\u2500sdd1 8:49 0 200M 0 part /boot/efi
\u251c\u2500sdd2 8:50 0 1G 0 part /boot
\u2514\u2500sdd3 8:51 0 464.6G 0 part
\u251c\u2500centos-root 253:0 0 50G 0 lvm /
\u251c\u2500centos-swap 253:1 0 4G 0 lvm [SWAP]
\u2514\u2500centos-home 253:2 0 410.6G 0 lvm /home
nvme0n1 259:2 0 372.6G 0 disk
\u2514\u2500md124 9:124 0 372.6G 0 raid1
nvme1n1 259:0 0 372.6G 0 disk
\u2514\u2500md124 9:124 0 372.6G 0 raid1
nvme2n1 259:3 0 372.6G 0 disk
\u2514\u2500md125 9:125 0 354G 0 raid1
nvme3n1 259:1 0 372.6G 0 disk
\u2514\u2500md125 9:125 0 354G 0 raid1

thanks,
Abe




___
lustre-devel mailing list
lustre-de...@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] LU-8964/pio feature usage

2018-12-23 Thread Patrick Farrell
Good afternoon,

There was a recent discussion on the lustre-devel mailing list in which I 
floated removing the 'pio' feature from Lustre.  This is a client side i/o 
parallelization feature (splitting i/o in kernel space & using multiple worker 
threads) which is off by default and must be enabled manually with a setting in 
proc.

It is not on by default because while it helps performance for some workloads, 
it harms performance for many other common workloads.  I studied making it fit 
to be on by default some time ago (item #1 would be making sure it did not slow 
down any common workloads), and found a fair bit of work to do to make that 
possible.

In the time since then, the work to extend the utility of this feature to reads 
was unsuccessful, and a credible alternate proposal for writes ("write 
containers", basically simplifying the write path dramatically for sequential 
i/o, which should be able to offer similar or larger benefits at a lower cost 
in CPU usage) has come about.

I could go on - But I am really wondering this:
Are there any active users of this feature?  If so, what's your use case and 
what's the benefit you see?  Keep in mind it must be turned on explicitly, so 
if you are not familiar with it, it's pretty unlikely you're using it.

For those who'd like to learn more, I'd refer you to the original bug:
https://jira.whamcloud.com/browse/LU-8964

My LAD talk from a few years ago:
https://www.eofs.eu/_media/events/devsummit17/patrick_farrell_laddevsummit_pio.pdf

And the recent mailing list discussion:
http://lists.lustre.org/pipermail/lustre-devel-lustre.org/2018-November/thread.html
(Look for [PATCH 10/12] lustre: clio: Introduce parallel tasks framework 

 , and for my comments specifically)

  *   Patrick
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] no more free slots in catalog

2018-12-17 Thread Patrick Farrell

Julien,

Could you share the details (LBUG plus full back trace, primarily) with the 
list?  It would be good to know if it’s a known problem or not.

Thanks!


From: lustre-discuss  on behalf of 
Julien Rey 
Sent: Monday, December 17, 2018 3:40:56 AM
To: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] no more free slots in catalog

Le 11/12/2018 15:47, quentin.bou...@cea.fr a 
écrit :
Le 11/12/2018 à 15:32, Julien Rey a écrit :
Le 11/12/2018 14:13, quentin.bou...@cea.fr a 
écrit :
Le 11/12/2018 à 10:28, Julien Rey a écrit :
Le 10/12/2018 13:33, quentin.bou...@cea.fr a 
écrit :
Le 10/12/2018 à 12:00, Julien Rey a écrit :
Hello,

We are running lustre 2.8.0-RC5--PRISTINE-2.6.32-573.12.1.el6_lustre.x86_64.

Since thursday we are getting a "bad address" error when trying to write on the 
lustre volume.

Looking at the logs on the MDS, we are getting this kind of messages :

Dec 10 06:26:18 localhost kernel: Lustre: 
9593:0:(llog_cat.c:93:llog_cat_new_log()) lustre-MDD: there are no more 
free slots in catalog
Dec 10 06:26:18 localhost kernel: Lustre: 
9593:0:(llog_cat.c:93:llog_cat_new_log()) Skipped 45157 previous similar 
messages
Dec 10 06:26:18 localhost kernel: LustreError: 
9593:0:(mdd_dir.c:887:mdd_changelog_ns_store()) lustre-MDD: cannot store 
changelog record: type = 6, name = 
'PEPFOLD-00016_bestene1-mc-SC-min-grompp.log', t = [0x2a58f:0x858e:0x0], p 
= [0x2a57d:0x17fd9:0x0]: rc = -28
Dec 10 06:26:18 localhost kernel: LustreError: 
9593:0:(mdd_dir.c:887:mdd_changelog_ns_store()) Skipped 45157 previous similar 
messages


I saw here that this issue was supposed to be solved in 2.8.0:
https://jira.whamcloud.com/browse/LU-6556

Could someone help us unlocking this situation ?

Thanks.


Hello,

The log messages don't point at a "bad address" issue but rather at a "no space 
left on device" one ("rc = -28" --> -ENOSPC).

You most likely have, at some point, registered a changelog user on your mds 
and that user is not consuming changelogs.

You can check this by running:

[mds0]# lctl get_param mdd.*.changelog_users
mdd.lustre-MDT.changelog_users=
current index: 3
IDindex
cl1   0


The most important thing to look for is the distance between "current index" 
and the index for "cl1", "cl2", ...
I expect for at least one changelog user, that distance is 2^32 (the maximum 
number of changelog records).
Note that changelog indexes wrap around (0, 1, 2, ..., 4294967295, 0, 1, ...).

If I am right, then you can either deregister the changelog user:

[mds0]# lctl --device lustre-MDT changelog_deregister cl1


or acknowledge the records:

[client]# lfs changelog_clear lustre-MDT cl1 0


(clearing with index 0 is a shortcut for "acknowledge every changelog records")

Both those options may take a while.

There is a third one that might yield faster result, but it is also much more 
dangerous to use (you might want to check with your support first) :

[mds0]# umount /dev/mdt0
[mds0]# mount -t ldiskfs /dev/mdt0 /mnt/lustre-mdt0
[mds0]# rm /mnt/lustre-mdt0/changelog_catalog
[mds0]# rm /mnt/lustre-mdt0/changelog_users
[mds0]# umount /dev/mdt0
[mds0]# mount -t lustre /dev/mdt0 <...> # remount the mdt where it was


I cannot garantee this will not trash your filesystem. Use at your own risk.

---

In recent versions (2.12, maybe even 2.10), lustre comes with a builtin garbage 
collector for slow/inactive changelog users.

Regards,
Quentin Bouget

Hello Quentin,

Many thanks for your quick reply.

This is what I got when I issued the command you suggested:


[root@lustre-mds]# lctl get_param mdd.*.changelog_users

mdd.lustre-MDT.changelog_users=

current index: 4160462682

IDindex

cl1   21020582

I then issued the following command:

[root@lustre-mds]# lctl --device lustre-MDT changelog_deregister cl1

It's been running for almost 20 hours now. Do you have an estimation of the 
time it could take ?
When you deregister a changelog user: every changelog record has to be 
invalidated (maybe this is batched, but I don't know enough about the on-disk 
structure to say).

I do not recall ever waiting that long. Then again, I never personally 
deregistered a changelog users with that many pending changelog records.

The changelog_deregister command still hasn't finished yet. Is there any way to 
track the state of the purge of records ?
I believe there is an "llog_reader" implemented in the lustre sources, but I 
never really used it.

If you just want to make sure Lustre is doing something, you can have a look at 
your mdt0: invalidating changelog records should generate a high load of small 
random writes.
If the device is idle, something is probably wrong.

Hard to tell. iostat doesn't show much I/O.

Is your filesystem still unavailable?

The following command doesn't show any registered changelog user:

cat 

Re: [lustre-discuss] How to solve issue when OSS is turned off?

2018-11-11 Thread Patrick Farrell

Default Lustre striping is just straight RAID0, so the data on (say) OST0 is 
not anywhere else.  You can still access data and files on other OSTs, and you 
can create files that live on other OSTs, so I don’t think the MDS is useless.  
But this is the reason for failover - to ensure you can still access your data 
despite this sort of issue.

The FLR feature in Lustre 2.11 does allow mirroring of files, but requires 
manual resyncing of mirrors, so it’s powerful but limited.

To be honest, if you’re in a situation where you have highly unreliable 
hardware/power, I would say there are other file systems (such as Ceph) that 
will serve you better.  Lustre has significant resiliency capabilities but it 
is designed first for performance and does require failover (and the extra 
setup and cabling it requires).  Systems like Ceph are designed specifically 
with reliability as the first priority, using things like erasure coding to 
provide data availability through disk target failure.  (They can’t match 
Lustre on scalability and high end performance.)


From: lustre-discuss  on behalf of 
shirshak bajgain 
Sent: Sunday, November 11, 2018 7:49:19 AM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] How to solve issue when OSS is turned off?

We frequently have power cut and we are on testing phase. Suppose one OSS is 
poweroff. It means we cannot mount anything on lusture client right? And cannot 
lusture work on another poweron oss?

Like

OSS1 -> OST0 OST1 OST2
OSS2 -> OST3 OST4 OST5
OSS3 -> OST6 OST7 OST8

Is is due to striping like a file is stripped to parts and stored on multiple 
OST. So if one OSS fail (without failover oss etc.) it means that mdt/mgs is 
useless?

Thanks.


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Usage for lfs setstripe -o ost_indices

2018-11-09 Thread Patrick Farrell
“I am not able to specify -o to an existing file.”
Yes, that’s expected - As with any other setstripe command, you cannot apply it 
to existing files which already have stripe information.  (The exception is 
files created with LOV_DELAY_CREATE or mknod(), which do not have striping 
information until they are written to.)
If you instead use lfs setstripe -o to create a file, that should work.

  *   Patrick

From: lustre-discuss  on behalf of 
"Ms. Megan Larko" 
Date: Friday, November 9, 2018 at 2:20 PM
To: Lustre User Discussion Mailing List 
Subject: Re: [lustre-discuss] Usage for lfs setstripe -o ost_indices

Responding to A. Dilger (orig e-mail copied below)
I am not sure what the overall objective is trying to achieve in specifically 
identifying to which OSTs to write; it was a question from one in our user 
community.   I am not able to specify -o to an existing file.  I have not tried 
to use the lustrelibapi to specify OST layout during the write.
I concur that LU-8417 points out a very significant disadvantage to having 
users employ the -o option to "lfs setstripe" and that using Lustre Pools is a 
better idea for the file system. (I'm speculating that perhaps the users 
themselves want to be able to create such Lustre Pool-like areas and currently 
only sysadmins may create Lustre Pools.  Avoid the middle-man/woman!  Smile!)
Let me get back to my users to better understand what it is that needs to be 
done causing them to wish to invoke the -o option to "lfs set-stripe".
Thanks,
megan

A. Dilger wrote:
[Image removed by sender.]
Andreas Dilger


12:51 PM (2 hours ago)






to Mohr, me, Lustre
[Image removed by sender.]


This is https://jira.whamcloud.com/browse/LU-8417 "setstripe -o does not work 
on directories", which has not been implemented yet.

That said, setting the default striping to specific OSTs on a directory is 
usually not the right thing to do. That will result in OST imbalance.

Equivalent mechanisms include OST pools (which also allow a subset of OSTs to 
be used, unlike -o currently does), and has the benefit of labeling files with 
the pool to find them easier in the future (eg. for migrating out of the pool).

What is the end goal that you are trying to achieve?

Cheers, Andreas
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] dd oflag=direct error (512 byte Direct I/O)

2018-10-30 Thread Patrick Farrell
Andreas,

An interesting thought on this, as the same limitation came up recently in 
discussions with a Cray customer.  Strictly honoring the direct I/O 
expectations around data copying is apparently optional.  GPFS is a notable 
example – It allows non page-aligned/page-size direct I/O, but it apparently 
(This is second hand from a GPFS knowledgeable person, so take with a grain of 
salt) uses the buffered path (data copy, page cache, etc) and flushes it, 
O_SYNC style.  My understanding from conversations is this is the general 
approach taken by file systems that support unaligned direct I/O – they cheat a 
little and do buffered I/O in those cases.

So rather than refusing to perform unaligned direct I/O, we could emulate the 
approach taken by (some) other file systems.  There’s no clear standard here, 
but this is an option others have taken that might improve the user experience. 
 (I believe we persuaded our particular user to switch their code away from 
direct I/O, since they had no real reason to be using it.)


  *   Patrick

From: lustre-discuss  on behalf of 김형근 

Date: Sunday, October 28, 2018 at 11:40 PM
To: Andreas Dilger 
Cc: "lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] dd oflag=direct error (512 byte Direct I/O)


The software I use is RedHat Virtualization. When using Posix compatible FS, it 
seems to perform direct I / O with a block size of 256512 bytes.



If I can't resolve the issue with my storage configuration, I will contact 
RedHat.



Your answer was very helpful.

Thank you.













보내는사람 : Andreas Dilger 



받는사람 : 김형근 



참조 : lustre-discuss@lists.lustre.org 



보낸 날짜 : 2018-10-25 16:47:58







제목 : Re: [lustre-discuss] dd oflag=direct error (512 byte Direct I/O)







On Oct 25, 2018, at 15:05, 김형근
wrote:
>
> Hi.
> It's a pleasure to meet you, the lustre specialists.
> (I do not speak English well ... Thank you for your understanding!)

Your english is better than my Korean. :-)

> I used the dd command in lustre mount point. (using the oflag = direct option)
>
> 
> dd if = / dev / zero of = / mnt / testfile oflag = direct bs = 512 count = 1
> 
>
> I need direct I / O with 512 byte block size.
> This is a required check function on the software I use.

What software is it? Is it possible to change the application to use
4096-byte alignment?

> But unfortunately, If the direct option is present,
> bs must be a multiple of 4K (4096) (for 8K, 12K, 256K, 1M, 8M, etc.) for 
> operation.
> For example, if you enter a value such as 512 or 4095, it will not work. The 
> error message is as follows.
>
> 'error message: dd: error writing [filename]: invalid argument'
>
> My test system is all up to date. (RHEL, lustre-server, client)
> I have used both ldiskfs and zfs for backfile systems. The result is same.
>
>
> My question is simply two.
>
> 1. Why does DirectIO work only in 4k multiples block size?

The client PAGE_SIZE on an x86 system is 4096 bytes. The Lustre client
cannot cache data smaller than PAGE_SIZE, so the current implementation
is limited to have O_DIRECT read/write being a multiple of PAGE_SIZE.

I think the same would happen if you try to use O_DIRECT on a disk with
4096-byte native sector drive 
(https://en.wikipedia.org/w/index.php?title=Advanced_Format§ion=5#4K_native )?

> 2. Can I change the settings of the server and client to enable 512bytes of 
> DirectIO?

This would not be possible without changing the Lustre client code.
I don't know how easily this is possible to do and still ensure that
the 512-byte writes are handled correctly.

So far we have not had other requests to change this limitation, so
it is not a high priority to change on our side, especially since
applications will have to deal with 4096-byte sectors in any case.

Cheers, Andreas
---
Andreas Dilger
Principal Lustre Architect
Whamcloud








___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5

2018-10-23 Thread Patrick Farrell

Andreas,

Somewhat worse - no crash is required on the server.  The LDLM locks from that 
target held by the client are destroyed on eviction, which also destroys the 
pages under them.  So any data that is not synced/persistent at the time of 
eviction is lost.

(And, for others reading:)
But that’s required by the purpose of eviction, which is mostly to allow the 
file system to make forward progress in face of a misbehaving client (rather 
than just deadlock forever).  And if your FS and clients are healthy, you 
shouldn’t normally have any evictions.  Notice we’re only talking about this in 
the context of a bug.

- Patrick


From: Andreas Dilger 
Sent: Monday, October 22, 2018 8:55:57 PM
To: Marion Hakanson
Cc: Patrick Farrell; lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5

On Oct 23, 2018, at 09:25, Marion Hakanson  wrote:
>
> I think Patrick's warning of data loss on a local ZFS filesystem is not
> quite right.  It's a design feature of ZFS that it flushes caches upon
> committing writes before returning a "write complete" back to the
> application.  Data loss can still happen if the storage lies to ZFS
> about having sent the data to stable storage.

Just to clarify, even ZFS on a local node does not avoid data loss if
the file is written only to RAM, and is not sync'd to disk.  That is
true of any filesystem, unless your writes are all O_SYNC (which can
hurt performance significantly), or until NVRAM is used exclusively to
store data.

There is some time after the write() syscall returns to an application
before the filesystem will even _start_ to write to the disk, to allow
it to aggregate data from multiple write() syscalls for efficiency.
Once the data is sent from RAM to disk, the disk should not ack the write
until it is persistent.  If sync() (or variant) is called by userspace,
that should not return until the data is persistent, which is true with
Lustre as well.

What Patrick was referencing is if the server crashes after the client
write() has received the data, but before it is persistent on disk, and
*then* the client is evicted from the server, the data would be lost.
It would still return an error if fsync() is called on the file handle,
but this is often not done by applications.  The same is true if a local
disk disconnects from the node before the data is persistent (e.g. USB
device unplug, cable failure, external RAID enclosure power failure, etc).

Cheers, Andreas

> Anyway, thanks, Andreas and others, for clarifying about the use of
> abort_recovery.  Using it turns out to not have been helpful in our
> situation so far, but this has been a useful discussion about the
> risks of data loss, etc.
>
> Thanks and regards,
>
> Marion
>
>
>> From: Patrick Farrell 
>> To: "Mohr Jr, Richard Frank (Rick Mohr)" , Marion Hakanson
>>   
>> CC: "lustre-discuss@lists.lustre.org" 
>> Subject: Re: [lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5
>> Date: Fri, 19 Oct 2018 17:36:56 +
>>
>> There is a somewhat hidden danger with eviction: You can get silent data 
>> loss.  The simplest example is buffered (ie, any that aren't direct I/O) 
>> writes - Lustre reports completion (ie your write() syscall completes) once 
>> the data is in the page cache on the client (like any modern file system, 
>> including local ones - you can get silent data loss on EXT4, XFS, ZFS, etc, 
>> if your disk becomes unavailable before data is written out of the page 
>> cache).
>>
>> So if that client with pending dirty data is evicted from the OST the data 
>> is destined for - which is essentially what abort recovery does - that data 
>> is lost, and the application doesn't get an error (because the write() call 
>> has already completed).
>>
>> A message is printed to the console on the client in this case, but you have 
>> to know to look for it.  The application will run to completion, but you may 
>> still experience data loss, and not know it.  It's just something to keep in 
>> mind - applications that run to completion despite evictions may not have 
>> completed *successfully*.
>>
>> - Patrick
>>
>> On 10/19/18, 11:42 AM, "lustre-discuss on behalf of Mohr Jr, Richard Frank 
>> (Rick Mohr)" > rm...@utk.edu> wrote:
>>
>>
>>> On Oct 19, 2018, at 10:42 AM, Marion Hakanson  wrote:
>>>
>>> Thanks for the feedback.  You're both confirming what we've learned so far, 
>>> that we had to unmount all the clients (which required rebooting most of 
>>> them), then reboot all the storage servers, to get things unstuck until the 
>>> problem recurred.
>>>
>>> I tried ab

Re: [lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5

2018-10-19 Thread Patrick Farrell
There is a somewhat hidden danger with eviction: You can get silent data loss.  
The simplest example is buffered (ie, any that aren't direct I/O) writes - 
Lustre reports completion (ie your write() syscall completes) once the data is 
in the page cache on the client (like any modern file system, including local 
ones - you can get silent data loss on EXT4, XFS, ZFS, etc, if your disk 
becomes unavailable before data is written out of the page cache).

So if that client with pending dirty data is evicted from the OST the data is 
destined for - which is essentially what abort recovery does - that data is 
lost, and the application doesn't get an error (because the write() call has 
already completed).

A message is printed to the console on the client in this case, but you have to 
know to look for it.  The application will run to completion, but you may still 
experience data loss, and not know it.  It's just something to keep in mind - 
applications that run to completion despite evictions may not have completed 
*successfully*.

- Patrick

On 10/19/18, 11:42 AM, "lustre-discuss on behalf of Mohr Jr, Richard Frank 
(Rick Mohr)"  wrote:


> On Oct 19, 2018, at 10:42 AM, Marion Hakanson  wrote:
> 
> Thanks for the feedback.  You're both confirming what we've learned so 
far, that we had to unmount all the clients (which required rebooting most of 
them), then reboot all the storage servers, to get things unstuck until the 
problem recurred.
> 
> I tried abort_recovery on the clients last night, before rebooting the 
MDS, but that did not help.  Could well be I'm not using it right:

Aborting recovery is a server-side action, not something that is done on 
the client.  As mentioned by Peter, you can abort recovery on a single target 
after it is mounted by using “lctl —device  abort_recover”.  But you can 
also just skip over the recovery step when the target is mounted by adding the 
“-o abort_recov” option to the mount command.  For example, 

mount -t lustre -o abort_recov /dev/my/mdt /mnt/lustre/mdt0

And similarly for OSTs.  So you should be able to just unmount your MDT/OST 
on the running file system and then remount with the abort_recov option.  From 
a client perspective, the lustre client will get evicted but should 
automatically reconnect.   

Some applications can ride through a client eviction without causing 
issues, some cannot.  I think it depends largely on how the application does 
its IO and if there is any IO in flight when the eviction occurs.  I have had 
to do this a few times on a running cluster, and in my experience, we have had 
good luck with most of the applications continuing without issues.  Sometimes 
there are a few jobs that abort, but overall this is better than having to stop 
all jobs and remount lustre on all the compute nodes.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5

2018-10-19 Thread Patrick Farrell

Marion,

You note the deadlock reoccurs on server reboot, so you’re really stuck.  This 
is most likely due to recovery where operations from the clients are replayed.

If you’re fine with letting any pending I/O fail in order to get the system 
back up, I would suggest a client side action: unmount (-f, and be patient) and 
/or shut down all of your clients.  That will discard things the clients are 
trying to replay, (causing pending I/O to fail).  Then shut down your servers 
and start them up again.  With no clients, there’s (almost) nothing to replay, 
and you probably won’t hit the issue on startup.  (There’s also the 
abort_recovery option covered in the manual, but I personally think this is 
easier.)

There’s no guarantee this avoids your deadlock happening again, but it’s highly 
likely it’ll at least get you running.

If you need to save your pending I/O, you’ll have to install patched software 
with a fix for this (sounds like WC has identified the bug) and then reboot.

Good luck!
- Patrick

From: lustre-discuss  on behalf of 
Marion Hakanson 
Sent: Friday, October 19, 2018 1:32:10 AM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5

This issue is really kicking our behinds:
https://jira.whamcloud.com/browse/LU-11465

While we're waiting for the issue to get some attention from Lustre developers, 
are there suggestions on how we can recover our cluster from this kind of 
deadlocked, stuck-threads-on-the-MDS (or OSS) situation?  Rebooting the storage 
servers does not clear the hang-up, as upon reboot the MDS quickly ends up with 
the same number of D-state threads (around the same number as we have clients). 
 It seems to me like there is some state stashed away in the filesystem which 
restores the deadlock as soon as the MDS comes up.

Thanks and regards,

Marion

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] limit on number of oss/ost's?

2018-10-11 Thread Patrick Farrell
The 160 limit has been raised.  I don't know what the new one is, but it is 
*quite* large.  I'm pretty sure it's beyond practical interest today.

There are a few issues with having extremely large numbers of OSTs, especially 
if you are explicitly trading off 1 vs many OSTs.

There are no particular scaling issues with number of OSTs of an OSS, so if you 
took the same storage and subdivided it to create more OSTs, there's no 
particular concern there.  But that assumes you're taking the same storage and 
deciding how to subdivide it - Obviously, a given amount of CPU/RAM/network on 
the OSS can only "feed" so much storage, so if you're just *adding* storage, 
you will quickly exhaust your OSS resources.  (Generally speaking one tries to 
match the two, so one does not have too much CPU/RAM/network bandwidth OR too 
much storage.)

The two main problems I see with "many" OSTs are:
1. They can get rather small, and so they can fill up relatively easily.  If 
your OSTs are really small and a few of the files assigned to that OST become 
large (so, they're assigned there when the OST is mostly empty, and then grow 
large), you'll run out of space on that OST and will no longer be able to write 
to files striped there.
2. As file stripe counts go up, the file layout - basically, the mapping from 
the byte range as seen in userspace to the actual objects on the OSTs - can 
become large enough that sending it around is a performance bottleneck.  
Opening a single file with hundreds of stripes from thousands of clients - like 
a large supercomputer center might do - can take a significant period of time.

That second is the only scaling issue with OST *count* that I'm aware of, other 
than that there is a bit of memory overhead for tracking each OST - so 10 OSTs 
instead of 1 OST will use marginally more memory on servers and clients.  This 
is pretty small, though.

So in general, I would say you'd be happier with fewer & faster, rather than 
more & slower, especially when talking about very large OST counts.  There are 
some performance issues with multiple writers to single files with low stripe 
counts, so it doesn't hold in extremis.  This is all to say you'd be much 
better served with 10 OSTs than with *1*, but 100 is probably not a better idea 
than 10.

- Patrick

On 10/11/18, 1:07 PM, "lustre-discuss on behalf of Michael Di Domenico" 
 
wrote:

Is there a limit on the number of oss servers you can have in a single
filesystem?  is there one for ost's?

I'm curious of the performance implications between two different
configurations (this is just theory mind you)...

1000 oss with 1 ost each
vs
100 oss with 10 ost each

one could scale this up further 2000, 5000, 1 oss's with 1 single ost 
each

i did note two references one from 2012 by Oleg Drokin that tested
1300 OST's at ornl which "mostly worked" and a note from Andreas last
year that quoted 2000 OST's before scaling issues.

I'm curious if there's a fundamental issue with scaling lustre, which
might be based on the presumption that oss's are typically fatter
nodes (getting fatter everyday) rather than large quantities of skinny
nodes

I'm also curious if there is still a 160 OST limit for file striping as 
well.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lfs mirror create directory

2018-10-01 Thread Patrick Farrell

George,

Your mirror is stale - look at the output.  Mirroring in Lustre is currently a 
manual process, you have to manually resync a file after writing to it.  lfs 
mirror resync is the lfs command.

If your mirror is in sync, you should get the behavior you’re looking for.

- Patrick


From: lustre-discuss  on behalf of 
George Melikov 
Sent: Monday, October 1, 2018 6:17:44 AM
To: Jian Yu; lustre-discuss
Subject: Re: [lustre-discuss] lfs mirror create directory

Thank you, Jian, it works this way.

I want to read/write to Lustre even if one of OSTs with file is unavailable.

But now if i unmount one of OSTs with file, I can't read/write from/to this 
file.

Lustre v2.11.52

Is this use case works in Lustre?

# lfs getstripe ./1
./1
  lcm_layout_gen:3
  lcm_mirror_count:  2
  lcm_entry_count:   2
lcme_id: 65537
lcme_mirror_id:  1
lcme_flags:  init
lcme_extent.e_start: 0
lcme_extent.e_end:   EOF
  lmm_stripe_count:  1
  lmm_stripe_size:   1048576
  lmm_pattern:   raid0
  lmm_layout_gen:0
  lmm_stripe_offset: 2
  lmm_objects:
  - 0: { l_ost_idx: 2, l_fid: [0x10002:0x5:0x0] }

lcme_id: 131074
lcme_mirror_id:  2
lcme_flags:  init,stale
lcme_extent.e_start: 0
lcme_extent.e_end:   EOF
  lmm_stripe_count:  1
  lmm_stripe_size:   1048576
  lmm_pattern:   raid0
  lmm_layout_gen:0
  lmm_stripe_offset: 0
  lmm_objects:
  - 0: { l_ost_idx: 0, l_fid: [0x1:0x44:0x0] }
# umount /mnt/lustre/ost2
# cat ./1
...infinite wait...

So I've unmounted OST with index 2 and can't even `cat` the file.



28.09.2018, 20:17, "Jian Yu" :
> Hi George,
>
> Please run "mkdir ./mirrored/" first, then run "lfs mirror create":
>
> # mkdir ./mirrored/
>
> # lfs mirror create -N2 ./mirrored/
>
> # lfs getstripe ./mirrored/
> ./mirrored/
>   lcm_layout_gen: 0
>   lcm_mirror_count: 2
>   lcm_entry_count: 2
> lcme_id: N/A
> lcme_mirror_id: N/A
> lcme_flags: 0
> lcme_extent.e_start: 0
> lcme_extent.e_end: EOF
>   stripe_count: 1 stripe_size: 1048576 pattern: raid0 stripe_offset: -1
>
> lcme_id: N/A
> lcme_mirror_id: N/A
> lcme_flags: 0
> lcme_extent.e_start: 0
> lcme_extent.e_end: EOF
>   stripe_count: 1 stripe_size: 1048576 pattern: raid0 stripe_offset: -1
>
> --
> Best regards,
> Jian Yu
>
> -Original Message-
> From: lustre-discuss  on behalf of 
> George Melikov 
> Date: Friday, September 28, 2018 at 8:05 AM
> To: lustre-discuss 
> Subject: [lustre-discuss] lfs mirror create directory
>
> Does `lfs mirror create` work for directories in 2.11?
> Tried it, nope :(
>
> # lfs mirror create -N2 ./mirrored/
> lfs mirror create: cannot create composite file './mirrored/': Is a 
> directory
>
> But documentation says it's good to go...
> > lfs mirror create <--mirror-count|-N[mirror_count]
> > [setstripe_options|[--flags<=flags>]]> ... 
>
> 
> Sincerely,
> George Melikov,
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Experience with resizing MDT

2018-09-27 Thread Patrick Farrell
Andreas,

Take a closer look.  It doesn't look to be connected to anything (this is 
current master).  This is all the instances of it I see:

C symbol: mdt_enable_remote_dir

  File   Function  Line
0 mdt_internal.h251 mdt_enable_remote_dir:1,
1 mdt_lproc.c   627 LPROC_SEQ_FOPS(mdt_enable_remote
_dir);
2 mdt_handler.c  mdt_init0 5057 m->mdt_enable_remote_dir = 0;
3 mdt_lproc.cmdt_enable_remote_dir_seq  606 seq_printf(m, "%u\n",
mdt->mdt_enable_remote_dir);
4 mdt_lproc.cmdt_enable_remote_dir_seq  624 mdt->mdt_enable_remote_dir =
val;

It's there.  It's set at init, and it can be read out and set in proc...  But 
it's not connected to anything any more, unless there's an obscure macro I 
missed.  The actual checking of it was removed in the patch Cory mentioned:
https://review.whamcloud.com/#/c/12282/48/lustre/mdt/mdt_reint.c

mdt_enable_remote_dir_gid still looks to be working as expected.

- Patrick

On 9/27/18, 4:52 AM, "lustre-discuss on behalf of Andreas Dilger" 
 
wrote:

On Sep 27, 2018, at 04:13, Cory Spitz  wrote:
> 
> Hello, all.
> 
>>  If you set mdt.*.enable_remote_dir=1 then you can create directories 
that point back and forth across MDTs
> 
> I thought enable_remote_dir would be useful too, but it turns out that it 
has changed.  Patrick F. pointed out to me that it was gutted when LU-3537 was 
landed for L2.8.0.  Setting the option does nothing to change the behavior, 
which defaults to the behavior formally made possible with enable_remote_dir=1.
> 
> Please take a look at LU-11429, which I filed to have the parameter 
removed.  The assessment may be wrong, please let us know.

The LU-3537 patch is only removing the restriction on remote rename and 
hard links, which were not allowed with DNE1 due to lack of recovery, but are 
handled correctly with DNE2 distributed transactions.

As far as I can see, "mdt_remote_dir" checks still exist in the master code.

Cheers, Andreas

> On 9/21/18, 11:28 PM, "Andreas Dilger"  wrote:
> 
>On Sep 20, 2018, at 16:38, Mohr Jr, Richard Frank (Rick Mohr) 
 wrote:
>> 
>> 
>>> On Sep 19, 2018, at 8:09 PM, Colin Faber  wrote:
>>> 
>>> Why wouldn't you use DNE?
>> 
>> I am considering it as an option, but there appear to be some potential 
drawbacks.
>> 
>> If I use DNE1, then I have to manually create directories on specific 
MDTs.  I will need to monitor MDT usage and make adjustments as necessary 
(which is not the end of the world, but still involves some additional work).  
This might be fine when I am creating new top-level directories for new 
users/projects, but any existing directories created before we add a new MDT 
will still only use MDT0.  Since the bulk of our user/project directories will 
be created early on, we still have the potential issue of running out of inodes 
on MDT0.
> 
>Note that it is possible to create remote directories at any point in 
the filesystem.  If you set mdt.*.enable_remote_dir=1 then you can create 
directories that point back and forth across MDTs.  If you also set
>mdt.*.enable_remote_dir_gid=-1 then all users can create remote 
directories.
> 
>> Based on that, I think DNE2 would be the better alternative, but it 
still has similar limitations.  The directories created initially will still be 
only striped over a single MDT.  When another MDT is added, I would need to 
recursively adjust all the existing directories to have a stripe count of 2 (or 
risk having MDT0 run out of inodes).  Based on my understanding of how the 
striped directories work, all the files in a striped directory are about evenly 
split across all the MDTs that the directory is striped across (which doesn’t 
work very well if MDT0 is mostly full and MDT1 is mostly empty).  Most likely 
we would want to have every directory striped across all MDTs, but there is a 
note in the lustre manual explicitly mentioning that it’s not a good idea to do 
this.
> 
>Yes, since remote and particularly striped directory creation has a 
non-zero overhead due to distributed transactions and ongoing extra RPC counts 
to access, it is better to limit remote and striped directories to ones that 
need it.
> 
>We're working on automating the use of DNE remote/striped directories. 
 In 2.12 it is possible to use "lfs mkdir -i -1" and "lfs mkdir -c N" to 
automatically select one or more "good" MDT(s) (where "good" == least full 
right now), or "lfs mkdir -i m,n,p,q" to select a disjoint list of MDTs.
> 
>> So that is why I was thinking that resizing the MDT might be the 
simplest approach.   Of course, I might be mistunderstanding something about 
DNE2, and if that is 

Re: [lustre-discuss] Second read or write performance

2018-09-21 Thread Patrick Farrell
Firat,

I strongly suspect that careful remeasurement of flock on/off will show that 
removing the flock option had no effect at all.  It simply doesn’t DO anything 
like that - it controls a single flag that says, if you use flock operations, 
they work one way, or if it is not set, they work another way.  It does nothing 
else, and has no impact on any part of file system operation except when flocks 
are used, and dd does not use flocks. It is simply impossible for the setting 
of the flock option to affect dd or performance level or variation, unless 
something using flocks is running at the same time.  (And even then, it would 
be affecting it indirectly)

I’m pushing back strongly because I’ve repeatedly seen people on the mailing 
speculate about turning flock off as a way to increase performance, and it 
simply isn’t.

- Patrick



From: fırat yılmaz 
Sent: Friday, September 21, 2018 7:50:51 PM
To: Patrick Farrell
Cc: adil...@whamcloud.com; lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Second read or write performance

The problem solved by adding lustre fine tuning parameter  to oss servers
lctl set_param obdfilter.lı-lustrefolder-OST*.brw_size=16

The flock is required by the application running in the filesystem so flock 
option is enabled

removing flock decrased the divergence of the flactuations and about %5 
performance gain from iml dashboard

Best Regards.

On Sat, Sep 22, 2018 at 12:56 AM Patrick Farrell 
mailto:p...@cray.com>> wrote:
Just 300 GiB, actually.  But that's still rather large and could skew things 
depending on OST size.

- Patrick

On 9/21/18, 4:43 PM, "lustre-discuss on behalf of Andreas Dilger" 
mailto:lustre-discuss-boun...@lists.lustre.org>
 on behalf of adil...@whamcloud.com<mailto:adil...@whamcloud.com>> wrote:

On Sep 21, 2018, at 00:43, fırat yılmaz 
mailto:firatyilm...@gmail.com>> wrote:
>
> Hi Andreas,
> Tests are made with dd, The test folder is created by the related 
application company, i will check that when i have connection. OST's has  
%85-86 free space  and filesystem mounted with flock option, i will ask for it 
to remove and test again.

The "flock" option shouldn't make any difference, unless the application is 
actually doing userspace file locking in the code.  Definitely "dd" will not be 
using it.

What does "lfs getstripe" on the first and second file as well as the 
parent directory show, and "lfs df" for the filesystem?

> Read test dd if=/vol1/test_read/dd.test.`hostname` of=/dev/null bs=1M 
count=30
>
> Write test dd if=/dev/zero of=/vol1/test_read/dd.test.2.`hostname` bs=1M 
count=30

This is creating a single file of 300TB in size, so that is definitely 
going to skew the space allocation.

Cheers, Andreas

>
> On Thu, Sep 20, 2018 at 10:57 PM Andreas Dilger 
mailto:adil...@whamcloud.com>> wrote:
> On Sep 20, 2018, at 03:07, fırat yılmaz 
mailto:firatyilm...@gmail.com>> wrote:
> >
> > Hi all,
> >
> > OS=Redhat 7.4
> > Lustre Version: Intel® Manager for Lustre* software 4.0.3.0
> > İnterconnect: Mellanox OFED, ConnectX-5
> > 72 OST over 6 OSS with HA
> > 1mdt and 1 mgt on 2 MDS with HA
> >
> > Lustre servers fine tuning parameters:
> > lctl set_param timeout=600
> > lctl set_param ldlm_timeout=200
> > lctl set_param at_min=250
> > lctl set_param at_max=600
> > lctl set_param obdfilter.*.read_cache_enable=1
> > lctl set_param obdfilter.*.writethrough_cache_enable=1
> > lctl set_param obdfilter.lfs3test-OST*.brw_size=16
> >
> > Lustre clients fine tuning parameters:
> > lctl set_param osc.*.checksums=0
> > lctl set_param timeout=600
> > lctl set_param at_min=250
> > lctl set_param at_max=600
> > lctl set_param ldlm.namespaces.*.lru_size=2000
> > lctl set_param osc.*OST*.max_rpcs_in_flight=256
> > lctl set_param osc.*OST*.max_dirty_mb=1024
> > lctl set_param osc.*.max_pages_per_rpc=1024
> > lctl set_param llite.*.max_read_ahead_mb=1024
> > lctl set_param llite.*.max_read_ahead_per_file_mb=1024
> >
> > Mountpoint stripe count:72 stripesize:1M
> >
> > I have a 2Pb lustre filesystem, In the benchmark tests i get the 
optimum values for read and write, but when i start a concurrent I/O operation, 
second job throughput stays around 100-200Mb/s. I have tried lovering the 
stripe count to 36 but since the concurrent operations will not occur in a way 
that keeps OST volume inbalance, i think that its not a good way to move on, 
secondly i saw some discussion about turning off flock 

Re: [lustre-discuss] separate SSD only filesystem including HDD

2018-08-28 Thread Patrick Farrell
Hmm – It’s possible you’ve got an issue, but I think more likely is that your 
chosen benchmarks aren’t capable of showing the higher speed.

I’m not really sure about your fio test - writing 4K random blocks will be 
relatively slow and might not speed up with more disks, but I can’t speak to it 
in detail for fio.  I would try a much larger size and possibly more processes 
(is numjobs the number of concurrent processes?)…

But I am sure about your other two:
Both of those tests (dd and cp) are single threaded, and if they’re running to 
Lustre (rather than to the ZFS volume directly), 1.3 GB/s is around the maximum 
expected speed.  On a recent Xeon, one process can write a maximum of about 
1-1.5 GB/s to Lustre, depending on various details.  Improving disk speed won’t 
affect that limit for one process, it’s a client side thing.  Try several 
processes at once, ideally from multiple clients (and definitely writing to 
multiple files), if you really want to see your OST bandwidth limit.

Also, a block size of 10GB is *way* too big for DD and will harm performance.  
It’s going to cause slowdown vs a smaller block size, like 16M or something.

There’s also limit on how fast /dev/zero can be read, especially with really 
large block sizes [it cannot provide 10 GiB of zeroes at a time, that’s why you 
had to add the “fullblock” flag, which is doing multiple reads (and writes)].  
Here’s a quick sample on a system here, writing to /dev/null (so there is no 
real limit on the write bandwidth of the destination):
dd if=/dev/zero bs=10G of=/dev/null count=1
0+1 records in
0+1 records out
2147479552 bytes (2.1 GB, 2.0 GiB) copied, 1.66292 s, 1.3 GB/s

Notice that 1.3 GB/s, the same as your result.

Try 16M instead:
saturn-p2:/lus # dd if=/dev/zero bs=16M of=/dev/null count=1024
1024+0 records in
1024+0 records out
17179869184 bytes (17 GB, 16 GiB) copied, 7.38402 s, 2.3 GB/s

Also note that multiple dds reading from /dev/zero will run in to issues with 
the bandwidth of /dev/zero.  /dev/zero is different than most people assume – 
One would think it just magically spews zeroes at any rate needed, but it’s not 
really designed to be read at high speed and actually isn’t that fast.  If you 
really want to test high speed storage, you may need a tool that allocates 
memory and writes that out, not just dd.  (ior is one example)

From: Zeeshan Ali Shah 
Date: Tuesday, August 28, 2018 at 9:52 AM
To: Patrick Farrell 
Cc: "lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] separate SSD only filesystem including HDD

1) fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k 
--direct=0 --size=20G --numjobs=4 --runtime=240 --group_reporting

2) time cp x x2

3) and dd if=/dev/zero of=/ssd/d.data bs=10G count=4 iflag=fullblock

any other way to test this plz let me know

/Zee



On Tue, Aug 28, 2018 at 3:54 PM Patrick Farrell 
mailto:p...@cray.com>> wrote:
How are you measuring write speed?


From: lustre-discuss 
mailto:lustre-discuss-boun...@lists.lustre.org>>
 on behalf of Zeeshan Ali Shah 
mailto:javacli...@gmail.com>>
Sent: Tuesday, August 28, 2018 1:30:03 AM
To: lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: [lustre-discuss] separate SSD only filesystem including HDD

Dear All, I recently deployed 10PB+ Lustre solution which is working fine. 
Recently for  genomic pipeline we acquired another racks with dedicated compute 
nodes with single 24-NVME SSD Servers/Rack .  Each SSD server connected to 
Compute Node via 100 G Omnipath.

Issue 1:  is that when I combined SSDs in stripe mode using zfs we  are not 
linearly scaling in terms of performance . for e..g single SSD write speed is 
1.3GB/sec , adding 5 of those in stripe mode should give us atleast 1.3x5 or 
less , but we still get 1.3 GB out of those 5 SSD .

Issue 2: if we resolve issue #1, 2nd challenge is to allow 24 NVMEs to compute 
nodes distributed and parallel wise , NFS not an option .. tried glusterfs but 
due to its DHT it is slow..

I am thinking to add another Filesystem to our existing MDT and install 
OSTs/OSS over the NVME server.. mounting this specific ssd where needed. so 
basically we will end up having two filesystem (one with normal 10PB+ and 2nd 
with SSD)..

Does this sounds correct ?

any other advice please ..


/Zeeshan


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] separate SSD only filesystem including HDD

2018-08-28 Thread Patrick Farrell
How are you measuring write speed?



From: lustre-discuss  on behalf of 
Zeeshan Ali Shah 
Sent: Tuesday, August 28, 2018 1:30:03 AM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] separate SSD only filesystem including HDD

Dear All, I recently deployed 10PB+ Lustre solution which is working fine. 
Recently for  genomic pipeline we acquired another racks with dedicated compute 
nodes with single 24-NVME SSD Servers/Rack .  Each SSD server connected to 
Compute Node via 100 G Omnipath.

Issue 1:  is that when I combined SSDs in stripe mode using zfs we  are not 
linearly scaling in terms of performance . for e..g single SSD write speed is 
1.3GB/sec , adding 5 of those in stripe mode should give us atleast 1.3x5 or 
less , but we still get 1.3 GB out of those 5 SSD .

Issue 2: if we resolve issue #1, 2nd challenge is to allow 24 NVMEs to compute 
nodes distributed and parallel wise , NFS not an option .. tried glusterfs but 
due to its DHT it is slow..

I am thinking to add another Filesystem to our existing MDT and install 
OSTs/OSS over the NVME server.. mounting this specific ssd where needed. so 
basically we will end up having two filesystem (one with normal 10PB+ and 2nd 
with SSD)..

Does this sounds correct ?

any other advice please ..


/Zeeshan


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] oldest lustre deployment?

2018-08-15 Thread Patrick Farrell
No, Peter - I just meant assuming 2.7 or newer is everywhere is not a safe 
assumption!  No comment intended on what versions are safe to run.  If asked, I 
would definitely recommend something newer than 2.5.



From: Peter Jones 
Sent: Wednesday, August 15, 2018 10:25:54 AM
To: Patrick Farrell; Latham, Robert J.; lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] oldest lustre deployment?


I agree that there are still sites running Lustre 1.8.x in production, but I 
don’t think that it is a reasonable assumption that 2.7 or newer isn’t safe yet 
– I think that most vendors (including your employer  ) are shipping something 
based on 2.7 or newer. My gut feeling is that if the usage survey was conducted 
today rather than six months ago that 2.10.x would come out as the clear leader 
(rather than being tied with 2.5.x). It’s also still possible to get support 
for older releases, but the matrix was “cleaned up” recently because the volume 
of information there made it hard for people to find the most current info, 
which is what most visitors were looking for. I’m fine to alter it again if 
there is demand to do so.



From: lustre-discuss  on behalf of 
Patrick Farrell 
Date: Wednesday, August 15, 2018 at 8:07 AM
To: "Latham, Robert J." , "lustre-discuss@lists.lustre.org" 

Subject: Re: [lustre-discuss] oldest lustre deployment?



Oh, yes.  Absolutely.  Many sites are running 2.5, a few are evening running 
1.8.



It's not "officially supported", but that's all those matrices indicate.



Sorry, assuming 2.7 or newer isn't safe yet.  2.5 may still be the largest 
single release by usage.



Check these slides for an update from this year on what's being run:
http://cdn.opensfs.org/wp-content/uploads/2018/04/Jones-Community_Release_Update_LUG_2018.pdf



- Patrick



From: lustre-discuss  on behalf of 
Latham, Robert J. 
Sent: Wednesday, August 15, 2018 9:40:31 AM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] oldest lustre deployment?



I am looking at the patch to ROMIO to support the new Progressive File

Layout feature:



https://jira.whamcloud.com/browse/LU-9657

and

https://review.whamcloud.com/#/c/27869/10



This change greatly reworks ROIMIO's support for determining striping count and 
stripe size.  I'm all for using the 'llapi_layout' routines instead of bare 
ioctl() calls, but llapi_layout did not show up until 2.7 I think.



Is there any chance that some enviroment out there is still running a

lustre from 2014 or earlier?



Nothing before 2.10 shows up in the "community matrix"



https://wiki.whamcloud.com/display/PUB/Lustre+Support+Matrix



but I do see some fairly old versions in the "intel releases"



https://wiki.whamcloud.com/display/PUB/Lustre+Support+Matrix+-+Intel+Releases



==rob


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] oldest lustre deployment?

2018-08-15 Thread Patrick Farrell
Oh, yes.  Absolutely.  Many sites are running 2.5, a few are evening running 
1.8.


It's not "officially supported", but that's all those matrices indicate.


Sorry, assuming 2.7 or newer isn't safe yet.  2.5 may still be the largest 
single release by usage.


Check these slides for an update from this year on what's being run:
http://cdn.opensfs.org/wp-content/uploads/2018/04/Jones-Community_Release_Update_LUG_2018.pdf



- Patrick


From: lustre-discuss  on behalf of 
Latham, Robert J. 
Sent: Wednesday, August 15, 2018 9:40:31 AM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] oldest lustre deployment?

I am looking at the patch to ROMIO to support the new Progressive File
Layout feature:

https://jira.whamcloud.com/browse/LU-9657
and
https://review.whamcloud.com/#/c/27869/10

This change greatly reworks ROIMIO's support for determining striping count and 
stripe size.  I'm all for using the 'llapi_layout' routines instead of bare 
ioctl() calls, but llapi_layout did not show up until 2.7 I think.

Is there any chance that some enviroment out there is still running a
lustre from 2014 or earlier?

Nothing before 2.10 shows up in the "community matrix"

https://wiki.whamcloud.com/display/PUB/Lustre+Support+Matrix

but I do see some fairly old versions in the "intel releases"

https://wiki.whamcloud.com/display/PUB/Lustre+Support+Matrix+-+Intel+Releases

==rob

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [lustre-devel] MDT test in rel2.11

2018-07-18 Thread Patrick Farrell

Yes, there is intention to add it to lfs find.  Whether or not it should 
disqualify results is up to you at I/O 500 - it seems like if most users would 
think it acceptable for find most of the time (and it should be), then it 
should probably be allowed.  But at the same time, its (theoretical - couldn’t 
today) use for mdtest would very much be “writing to the benchmark” and 
defeating the intent.


From: John Bent 
Sent: Tuesday, July 17, 2018 11:54:32 PM
To: Patrick Farrell
Cc: Abe Asraoui; lustre-de...@lists.lustre.org; lustre-discuss@lists.lustre.org
Subject: Re: [lustre-devel] MDT test in rel2.11

Thanks Patrick.  That's interesting.  However, the exact motivation why IO500 
has the 'find' command is this same intended use case; stale results therefore 
actually present an interesting dilemma to IO500.  They are not POSIX compliant 
but that loss of compliance shouldn't necessarily disqualify this result...

On Wed, Jul 18, 2018 at 12:49 AM, Patrick Farrell 
mailto:p...@cray.com>> wrote:
Lazy SoM is not landed yet, and it won’t be improving benchmark scores - it’s 
never “known 100% correct”, so it can’t be used for actual POSIX ops - if a 
file size read out is used for a write offset, then you’ve got data corruption.

So for now it’s strictly limited to tools that know about it (accessed via an 
ioctl) and can accept information that may be stale.  The intended use case is 
scanning the FS for policy application.


From: John Bent mailto:johnb...@gmail.com>>
Sent: Tuesday, July 17, 2018 10:55:24 PM
To: Patrick Farrell
Cc: Abe Asraoui; 
lustre-de...@lists.lustre.org<mailto:lustre-de...@lists.lustre.org>; 
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-devel] MDT test in rel2.11

I'm curious about how DOM improves IO500 scores.  :)
Also LSOM but I don't know actually whether that's in 2.11 or where.

On Tue, Jul 17, 2018 at 11:33 PM, Patrick Farrell 
mailto:p...@cray.com>> wrote:

Abe,

Any benchmarking would be highly dependent on hardware, both client and server. 
 Is there a particular comparison (say, between versions) you’re interested in 
or something you’re concerned about?

- Patrick


From: lustre-devel 
mailto:lustre-devel-boun...@lists.lustre.org>>
 on behalf of Abe Asraoui mailto:a...@supermicro.com>>
Sent: Tuesday, July 17, 2018 9:23:10 PM
To: lustre-de...@lists.lustre.org<mailto:lustre-de...@lists.lustre.org>; 
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>; Abe 
Asraoui
Subject: [lustre-devel] MDT test in rel2.11

Hi All,


Has anyone done any MDT testing under the latest rel2.11 and have benchmark 
data to share?


Thanks,
Abe


___
lustre-devel mailing list
lustre-de...@lists.lustre.org<mailto:lustre-de...@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

___
lustre-devel mailing list
lustre-de...@lists.lustre.org<mailto:lustre-de...@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [lustre-devel] MDT test in rel2.11

2018-07-17 Thread Patrick Farrell
To be clear in case I sound too down on it - Lazy SoM is a very nice feature 
that will speed up important use cases.  It’s just not going to jazz up mdtest 
#s.



From: Patrick Farrell
Sent: Tuesday, July 17, 2018 11:49:48 PM
To: John Bent
Cc: Abe Asraoui; lustre-de...@lists.lustre.org; lustre-discuss@lists.lustre.org
Subject: Re: [lustre-devel] MDT test in rel2.11

Lazy SoM is not landed yet, and it won’t be improving benchmark scores - it’s 
never “known 100% correct”, so it can’t be used for actual POSIX ops - if a 
file size read out is used for a write offset, then you’ve got data corruption.

So for now it’s strictly limited to tools that know about it (accessed via an 
ioctl) and can accept information that may be stale.  The intended use case is 
scanning the FS for policy application.


From: John Bent 
Sent: Tuesday, July 17, 2018 10:55:24 PM
To: Patrick Farrell
Cc: Abe Asraoui; lustre-de...@lists.lustre.org; lustre-discuss@lists.lustre.org
Subject: Re: [lustre-devel] MDT test in rel2.11

I'm curious about how DOM improves IO500 scores.  :)
Also LSOM but I don't know actually whether that's in 2.11 or where.

On Tue, Jul 17, 2018 at 11:33 PM, Patrick Farrell 
mailto:p...@cray.com>> wrote:

Abe,

Any benchmarking would be highly dependent on hardware, both client and server. 
 Is there a particular comparison (say, between versions) you’re interested in 
or something you’re concerned about?

- Patrick


From: lustre-devel 
mailto:lustre-devel-boun...@lists.lustre.org>>
 on behalf of Abe Asraoui mailto:a...@supermicro.com>>
Sent: Tuesday, July 17, 2018 9:23:10 PM
To: lustre-de...@lists.lustre.org<mailto:lustre-de...@lists.lustre.org>; 
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>; Abe 
Asraoui
Subject: [lustre-devel] MDT test in rel2.11

Hi All,


Has anyone done any MDT testing under the latest rel2.11 and have benchmark 
data to share?


Thanks,
Abe


___
lustre-devel mailing list
lustre-de...@lists.lustre.org<mailto:lustre-de...@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

___
lustre-devel mailing list
lustre-de...@lists.lustre.org<mailto:lustre-de...@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [lustre-devel] MDT test in rel2.11

2018-07-17 Thread Patrick Farrell
Lazy SoM is not landed yet, and it won’t be improving benchmark scores - it’s 
never “known 100% correct”, so it can’t be used for actual POSIX ops - if a 
file size read out is used for a write offset, then you’ve got data corruption.

So for now it’s strictly limited to tools that know about it (accessed via an 
ioctl) and can accept information that may be stale.  The intended use case is 
scanning the FS for policy application.


From: John Bent 
Sent: Tuesday, July 17, 2018 10:55:24 PM
To: Patrick Farrell
Cc: Abe Asraoui; lustre-de...@lists.lustre.org; lustre-discuss@lists.lustre.org
Subject: Re: [lustre-devel] MDT test in rel2.11

I'm curious about how DOM improves IO500 scores.  :)
Also LSOM but I don't know actually whether that's in 2.11 or where.

On Tue, Jul 17, 2018 at 11:33 PM, Patrick Farrell 
mailto:p...@cray.com>> wrote:

Abe,

Any benchmarking would be highly dependent on hardware, both client and server. 
 Is there a particular comparison (say, between versions) you’re interested in 
or something you’re concerned about?

- Patrick


From: lustre-devel 
mailto:lustre-devel-boun...@lists.lustre.org>>
 on behalf of Abe Asraoui mailto:a...@supermicro.com>>
Sent: Tuesday, July 17, 2018 9:23:10 PM
To: lustre-de...@lists.lustre.org<mailto:lustre-de...@lists.lustre.org>; 
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>; Abe 
Asraoui
Subject: [lustre-devel] MDT test in rel2.11

Hi All,


Has anyone done any MDT testing under the latest rel2.11 and have benchmark 
data to share?


Thanks,
Abe


___
lustre-devel mailing list
lustre-de...@lists.lustre.org<mailto:lustre-de...@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org

___
lustre-devel mailing list
lustre-de...@lists.lustre.org<mailto:lustre-de...@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] MDT test in rel2.11

2018-07-17 Thread Patrick Farrell

Abe,

Any benchmarking would be highly dependent on hardware, both client and server. 
 Is there a particular comparison (say, between versions) you’re interested in 
or something you’re concerned about?

- Patrick


From: lustre-devel  on behalf of Abe 
Asraoui 
Sent: Tuesday, July 17, 2018 9:23:10 PM
To: lustre-de...@lists.lustre.org; lustre-discuss@lists.lustre.org; Abe Asraoui
Subject: [lustre-devel] MDT test in rel2.11

Hi All,


Has anyone done any MDT testing under the latest rel2.11 and have benchmark 
data to share?


Thanks,
Abe


___
lustre-devel mailing list
lustre-de...@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-devel-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Not able to load lustre modules on Luster client

2018-06-29 Thread Patrick Farrell
I am not certain, but I believe insmod does not attempt to fulfill 
dependencies.  What happens when you try modprobe and what are the errors in 
dmesg then?



From: lustre-discuss  on behalf of 
vaibhav pol 
Sent: Thursday, June 28, 2018 11:44:10 PM
To: Andreas Dilger
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Not able to load lustre modules on Luster client

Hi,

I am using the precompiled rpms from site 
(https://downloads.whamcloud.com/public/lustre/lustre-2.11.0/el7.4.1708/).
Modprobe is not able to find the lnet module .  Tried  to insert using insmod.
Following is the  error message.



cfs_array_alloc (err 0)
cfs_get_random_bytes (err 0)
cfs_expr_list_free_list (err 0)
libcfs_register_ioctl (err 0)
cfs_percpt_lock_create (err 0)
cfs_restore_sigs (err 0)
lbug_with_loc (err 0)
libcfs_log_goto (err 0)
libcfs_debug_msg (err 0)
cfs_cpt_table (err 0)
cfs_expr_list_print (err 0)
cfs_trace_copyout_string (err 0)
cfs_cpt_current (err 0)
__x86_indirect_thunk_rax (err 0)
cfs_percpt_free (err 0)
cfs_percpt_lock_free (err 0)
cfs_rand (err 0)
cfs_percpt_unlock (err 0)
cfs_percpt_alloc (err 0)
libcfs_log_return (err 0)
lnet_insert_debugfs (err 0)
cfs_percpt_number (err 0)
cfs_expr_list_match (err 0)
cfs_trimwhite (err 0)
cfs_array_free (err 0)
libcfs_kmemory (err 0)
cfs_trace_copyin_string (err 0)
libcfs_deregister_ioctl (err 0)
cfs_srand (err 0)
cfs_block_allsigs (err 0)
cfs_str2num_check (err 0)
ktime_get_real_seconds (err 0)
__x86_indirect_thunk_rcx (err 0)
cfs_cpt_spread_node (err 0)
cfs_expr_list_values_free (err 0)
libcfs_subsystem_debug (err 0)
cfs_expr_list_free (err 0)
cfs_percpt_lock (err 0)
cfs_gettok (err 0)
cfs_expr_list_parse (err 0)
cfs_cpt_of_node (err 0)
libcfs_debug (err 0)
cfs_cpt_weight (err 0)
lprocfs_call_handler (err 0)
cfs_cpt_distance (err 0)
ktime_get_seconds (err 0)
cfs_cpt_number (err 0)
cfs_expr_list_values (err 0)



Thanks and regards,
Vaibhav Pol
HPC I
Centre for Development of Advanced Computing
Ganeshkhind Road
Pune University Campus
PUNE-Maharashtra
Phone +91-20-25704183 ext: 183
Cell Phone : +919850466409


On June 29, 2018 at 9:36 AM Andreas Dilger  wrote:
> It would be useful to include the actual error messages, in particular which 
> module symbols it is complaining about.
>
> Cheers, Andreas
>
> On Jun 28, 2018, at 22:01, vaibhav pol  wrote:
> >
> > Hi,
> > I have installed the Lustre client RPMS (Version 2.11.0) on CentOS 7.4
> > Whenever I tried to insert lnet module it give the unknown symbol message 
> > and not able to load modules. Tried to insert forcefully but that is also 
> > not working.
>

---
[ C-DAC is on Social-Media too. Kindly follow us at:
Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ]

This e-mail is for the sole use of the intended recipient(s) and may
contain confidential and privileged information. If you are not the
intended recipient, please contact the sender by reply e-mail and destroy
all copies and the original message. Any unauthorized review, use,
disclosure, dissemination, forwarding, printing or copying of this email
is strictly prohibited and appropriate legal action will be taken.
---
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lctl ping node28@o2ib report Input/output error

2018-06-28 Thread Patrick Farrell

It seems expensive (straight mirroring rather than parity’s) and it’s 
asynchronous from Lustre, so if you’re really just syncing the block devices, 
that can’t guarantee safety on failure.  If I understand what you’re doing, 
when a failure occurs, drbd may be in the middle of syncing the block device.  
That would likely lead to losing data you had already written and possibly to 
corrupting the on disk file system in the mirror.  (Specifically, you’d end up 
copying part of something important before the failure occurred)


From: yu sun 
Sent: Wednesday, June 27, 2018 11:26:52 PM
To: Patrick Farrell
Cc: adil...@whamcloud.com; lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] lctl ping node28@o2ib report Input/output error

yes, drbd will mirror the content of block devices between hosts synchronously 
or asynchronously. this will provide us data redundancy between hosts.
perhaps we should use zfs + drbd for mdt and ost?

Thanks
Yu

Patrick Farrell mailto:p...@cray.com>> 于2018年6月27日周三 下午9:28写道:

I’m a little puzzled - it can switch, but isn’t the data on the failed disk 
lost...?  That’s why Andreas is suggesting RAID.  Or is drbd doing syncing of 
the disk?  That seems like a really expensive way to get redundancy, since it 
would have to be full online mirroring with all the costs in hardware and 
resource usage that implies...?

ZFS is not a requirement, it generally performs a bit worse than ldiskfs but 
makes it up with impressive features to improve data integrity and related 
things.  Since it sounds like that’s not a huge concern for you, I would stick 
with ldiskfs.  It will likely be a little faster and is easier to set up.


From: lustre-discuss 
mailto:lustre-discuss-boun...@lists.lustre.org>>
 on behalf of yu sun mailto:sunyu1...@gmail.com>>
Sent: Wednesday, June 27, 2018 8:21:43 AM
To: adil...@whamcloud.com<mailto:adil...@whamcloud.com>
Cc: lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] lctl ping node28@o2ib report Input/output error

yes, you are right, thanks for your great suggestions.

now we are using glusterfs to store training data for ML, and we begin to 
investigate lustre to instead glusterfs for performance.

Firstly, yes we do want to get maximum perforance, you means we should use zfs 
, for example , not each ost/mdt on a separate partitions, for better 
perforance?

Secondly, we dont use any underlying RAID devices,  and we do configure each 
ost on a separate disk, considering that lustre does not provide disk data 
redundancy, we are use drbd + pacemarker + corosync for data redundancy and HA, 
you can see we have configured --servicenode when mkfs.lustre. I dont know how 
reliable is this solution?  it seems ok for our current test, when one disk 
faild, pacemarker can switch to other ost on the other machine automaticly.

we also want to use zfs and I have test zfs by mirror, However, if the physical 
machine down,data on the machine will lost. so we decice use the solution 
listed above.

Now we are testing, and any suggesting is appreciated .
thanks Andreas.

Your
Yu



Andreas Dilger mailto:adil...@whamcloud.com>> 
于2018年6月27日周三 下午7:07写道:
On Jun 27, 2018, at 09:12, yu sun 
mailto:sunyu1...@gmail.com>> wrote:
>
> client:
> root@ml-gpu-ser200.nmg01:~$ mount -t lustre 
> node28@o2ib1:node29@o2ib1:/project /mnt/lustre_data
> mount.lustre: mount node28@o2ib1:node29@o2ib1:/project at /mnt/lustre_data 
> failed: Input/output error
> Is the MGS running?
> root@ml-gpu-ser200.nmg01:~$ lctl ping node28@o2ib1
> failed to ping 10.82.143.202@o2ib1: Input/output error
> root@ml-gpu-ser200.nmg01:~$
>
>
> mgs and mds:
> mkfs.lustre --mgs --reformat --servicenode=node28@o2ib1 
> --servicenode=node29@o2ib1 /dev/sdb1
> mkfs.lustre --fsname=project --mdt --index=0 --mgsnode=node28@o2ib1 
> --mgsnode=node29@o2ib1 --servicenode node28@o2ib1 --servicenode node29@o2ib1 
> --reformat --backfstype=ldiskfs /dev/sdc1

Separate from the LNet issues, it is probably worthwhile to point out some 
issues
with your configuration.  You shouldn't use partitions on the OST and MDT 
devices
if you want to get maximum performance.  That can offset all of the filesystem 
IO
from the RAID/sector alignment and hurt performance.

Secondly, it isn't clear if you are using underlying RAID devices, or if you are
configuring each OST on a separate disk?  It looks like the latter - that you 
are
making each disk a separate OST.  That isn't a good idea for Lustre, since it 
does
not (yet) have any redundancy at higher layers, and any disk failure would 
result
in data loss.  You currently need to have RAID-5/6 or ZFS for each OST/MDT, 
unless
this is a really "scratch" filesystem where you don't care if the data is lost 
and
reformatting the filesystem is OK (i.e. low

Re: [lustre-discuss] lctl ping node28@o2ib report Input/output error

2018-06-27 Thread Patrick Farrell

I’m a little puzzled - it can switch, but isn’t the data on the failed disk 
lost...?  That’s why Andreas is suggesting RAID.  Or is drbd doing syncing of 
the disk?  That seems like a really expensive way to get redundancy, since it 
would have to be full online mirroring with all the costs in hardware and 
resource usage that implies...?

ZFS is not a requirement, it generally performs a bit worse than ldiskfs but 
makes it up with impressive features to improve data integrity and related 
things.  Since it sounds like that’s not a huge concern for you, I would stick 
with ldiskfs.  It will likely be a little faster and is easier to set up.


From: lustre-discuss  on behalf of yu 
sun 
Sent: Wednesday, June 27, 2018 8:21:43 AM
To: adil...@whamcloud.com
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] lctl ping node28@o2ib report Input/output error

yes, you are right, thanks for your great suggestions.

now we are using glusterfs to store training data for ML, and we begin to 
investigate lustre to instead glusterfs for performance.

Firstly, yes we do want to get maximum perforance, you means we should use zfs 
, for example , not each ost/mdt on a separate partitions, for better 
perforance?

Secondly, we dont use any underlying RAID devices,  and we do configure each 
ost on a separate disk, considering that lustre does not provide disk data 
redundancy, we are use drbd + pacemarker + corosync for data redundancy and HA, 
you can see we have configured --servicenode when mkfs.lustre. I dont know how 
reliable is this solution?  it seems ok for our current test, when one disk 
faild, pacemarker can switch to other ost on the other machine automaticly.

we also want to use zfs and I have test zfs by mirror, However, if the physical 
machine down,data on the machine will lost. so we decice use the solution 
listed above.

Now we are testing, and any suggesting is appreciated .
thanks Andreas.

Your
Yu



Andreas Dilger mailto:adil...@whamcloud.com>> 
于2018年6月27日周三 下午7:07写道:
On Jun 27, 2018, at 09:12, yu sun 
mailto:sunyu1...@gmail.com>> wrote:
>
> client:
> root@ml-gpu-ser200.nmg01:~$ mount -t lustre 
> node28@o2ib1:node29@o2ib1:/project /mnt/lustre_data
> mount.lustre: mount node28@o2ib1:node29@o2ib1:/project at /mnt/lustre_data 
> failed: Input/output error
> Is the MGS running?
> root@ml-gpu-ser200.nmg01:~$ lctl ping node28@o2ib1
> failed to ping 10.82.143.202@o2ib1: Input/output error
> root@ml-gpu-ser200.nmg01:~$
>
>
> mgs and mds:
> mkfs.lustre --mgs --reformat --servicenode=node28@o2ib1 
> --servicenode=node29@o2ib1 /dev/sdb1
> mkfs.lustre --fsname=project --mdt --index=0 --mgsnode=node28@o2ib1 
> --mgsnode=node29@o2ib1 --servicenode node28@o2ib1 --servicenode node29@o2ib1 
> --reformat --backfstype=ldiskfs /dev/sdc1

Separate from the LNet issues, it is probably worthwhile to point out some 
issues
with your configuration.  You shouldn't use partitions on the OST and MDT 
devices
if you want to get maximum performance.  That can offset all of the filesystem 
IO
from the RAID/sector alignment and hurt performance.

Secondly, it isn't clear if you are using underlying RAID devices, or if you are
configuring each OST on a separate disk?  It looks like the latter - that you 
are
making each disk a separate OST.  That isn't a good idea for Lustre, since it 
does
not (yet) have any redundancy at higher layers, and any disk failure would 
result
in data loss.  You currently need to have RAID-5/6 or ZFS for each OST/MDT, 
unless
this is a really "scratch" filesystem where you don't care if the data is lost 
and
reformatting the filesystem is OK (i.e. low cost is the primary goal, which is 
fine
also, but not very common).

We are working at Lustre-level data redundancy, and there is some support for 
this
in the 2.11 release, but it is not yet in a state where you could reliably use 
it
to mirror all of the files in the filesystem.

Cheers, Andreas

>
> ost:
> ml-storage-ser22.nmg01:
> mkfs.lustre --fsname=project --reformat --mgsnode=node28@o2ib1 
> --mgsnode=node29@o2ib1  --servicenode=node22@o2ib1 --servicenode=node23@o2ib1 
> --ost --index=12 /dev/sdc1
> mkfs.lustre --fsname=project --reformat --mgsnode=node28@o2ib1 
> --mgsnode=node29@o2ib1  --servicenode=node22@o2ib1 --servicenode=node23@o2ib1 
> --ost --index=13 /dev/sdd1
> mkfs.lustre --fsname=project --reformat --mgsnode=node28@o2ib1 
> --mgsnode=node29@o2ib1  --servicenode=node22@o2ib1 --servicenode=node23@o2ib1 
> --ost --index=14 /dev/sde1
> mkfs.lustre --fsname=project --reformat --mgsnode=node28@o2ib1 
> --mgsnode=node29@o2ib1  --servicenode=node22@o2ib1 --servicenode=node23@o2ib1 
> --ost --index=15 /dev/sdf1
> mkfs.lustre --fsname=project --reformat --mgsnode=node28@o2ib1 
> --mgsnode=node29@o2ib1  --servicenode=node22@o2ib1 --servicenode=node23@o2ib1 
> --ost --index=16 /dev/sdg1
> mkfs.lustre --fsname=project --reformat --mgsnode=node28@o2ib1 
> 

Re: [lustre-discuss] Lustre 2.11 File Level Replication

2018-06-22 Thread Patrick Farrell
Mark,

Hmm.
I’m adding the list back on here, because that *seems* like it’s wrong.  Don’t 
have time to check right now, but I’m curious if others can weigh in.


  *   Patrick

From: Mark Roper 
Date: Friday, June 22, 2018 at 2:29 PM
To: Patrick Farrell 
Subject: Re: [lustre-discuss] Lustre 2.11 File Level Replication

Thanks Patrick!  It looks like you can't set default mirroring with setstripe.  
I appreciate the lead and the response!

On Thu, Jun 21, 2018 at 10:09 AM Patrick Farrell 
mailto:p...@cray.com>> wrote:

Mark,

I haven’t played specifically with FLR and inheritance/templates, but if you 
want to set a default layout on a directory, you’ll want to look at lfs 
setstripe.  Mirror extend is specifically for modifying individual, existing 
files.

- Patrick

From: lustre-discuss 
mailto:lustre-discuss-boun...@lists.lustre.org>>
 on behalf of Mark Roper mailto:markro...@gmail.com>>
Sent: Thursday, June 21, 2018 8:36:43 AM
To: lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: [lustre-discuss] Lustre 2.11 File Level Replication

Hi Lustre Users,

I have set up a Lustre 2.11 cluster with multiple OSS's in order to experiment 
with the File Level Replication feature in Lustre 2.11, which I'm excited about 
since it raises the bar on availability for cloud deployment of Lustre. I 
cannot seem to get files created within a directory to inherit the mirror 
settings from the parent directory, which I thought might be possible given the 
documentation at 
http://wiki.lustre.org/File_Level_Replication_High_Level_Design :

'lfs mirror extend [--no-verify] <-N[mirror_count]> [other setstripe options|-f 
] 
This command will append a replica indicated by setstripe options or just take 
the layout from existing file victim_file into the file file_name. The 
file_name must be an existing file but it can be a mirrored or normal file. 
This command will create new volatile file with any optional setstripe options 
that are specified, or using the defaults inherited from the parent directory 
or filesystem.'

I was hoping to be able to make all files written to particular directories or 
within a filesystem inherit a mirror count transparently to the client writing 
or reading files.  Does anyone know if this is possible?

Cheers!

Mark Roper
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.11 File Level Replication

2018-06-21 Thread Patrick Farrell

Mark,

I haven’t played specifically with FLR and inheritance/templates, but if you 
want to set a default layout on a directory, you’ll want to look at lfs 
setstripe.  Mirror extend is specifically for modifying individual, existing 
files.

- Patrick


From: lustre-discuss  on behalf of 
Mark Roper 
Sent: Thursday, June 21, 2018 8:36:43 AM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Lustre 2.11 File Level Replication

Hi Lustre Users,

I have set up a Lustre 2.11 cluster with multiple OSS's in order to experiment 
with the File Level Replication feature in Lustre 2.11, which I'm excited about 
since it raises the bar on availability for cloud deployment of Lustre. I 
cannot seem to get files created within a directory to inherit the mirror 
settings from the parent directory, which I thought might be possible given the 
documentation at 
http://wiki.lustre.org/File_Level_Replication_High_Level_Design :

'lfs mirror extend [--no-verify] <-N[mirror_count]> [other setstripe options|-f 
] 
This command will append a replica indicated by setstripe options or just take 
the layout from existing file victim_file into the file file_name. The 
file_name must be an existing file but it can be a mirrored or normal file. 
This command will create new volatile file with any optional setstripe options 
that are specified, or using the defaults inherited from the parent directory 
or filesystem.'

I was hoping to be able to make all files written to particular directories or 
within a filesystem inherit a mirror count transparently to the client writing 
or reading files.  Does anyone know if this is possible?

Cheers!

Mark Roper
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Do I need Lustre?

2018-04-27 Thread Patrick Farrell
One factor is probably budget - Lustre is probably a higher budget option, in 
terms of hardware and time investment.  I would guess at the 6-8 node range you 
probably don't need its speed, though you might need at least one other trick 
it has:

One thing Lustre gives that NFS does not is the ability for multiple nodes to 
write to the same file in parallel while maintaining consistency.  It's a 
clustered/parallel file system, not just a network file system.  Some codes 
require this if you want to run them across multiple nodes.

You might start by setting up whatever seems "easy" to you, probably an NFS 
share of a storage appliance you've already got, and then see what happens.  If 
users are happy and you don't seem to be spending a lot of time doing I/O, then 
you're probably OK.  If not, Lustre is more work, but you do get something for 
your labors. :)


From: lustre-discuss  on behalf of 
Brett Lee 
Sent: Friday, April 27, 2018 8:11:21 PM
To: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Do I need Lustre?

Hi Neil,

One of the considerations in using Lustre should be the I/O patterns of your 
applications.  Lustre excels with large, sequential reads and writes.

Another are the costs, to include hardware, software, support, and coming up to 
speed with Lustre.  These components interact.  For example, having 
professional support helps with coming up to speed on Lustre. :)

Hey Michael!


On Fri, Apr 27, 2018, 12:22 PM Hebenstreit, Michael 
> wrote:

You can do a simple test. Run a small sample of you application directly out of 
/dev/shm (the ram-disk). Then run it from the NFS file server. If you measure 
significant speedups your application is I/O sensitive and a Lustre configured 
with OPA or other InfiniBand solution will help.



From: lustre-discuss 
[mailto:lustre-discuss-boun...@lists.lustre.org]
 On Behalf Of Thackeray, Neil L
Sent: Friday, April 27, 2018 11:08 AM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Do I need Lustre?



I’m new to the cluster realm, so I’m hoping for some good advice. We are 
starting up a new cluster, and I’ve noticed that lustre seems to be used widely 
in datacenters. The thing is I’m not sure the scale of our cluster will need it.



We are planning a small cluster, starting with 6 -8 nodes with 2 GPUs per node. 
They will be used for Deep Learning, MRI data processing, and Matlab among 
other things. With the size of the cluster we figure that 10Gb networking will 
be sufficient. We aren’t going to allow persistent storage on the cluster. 
Users will just upload and download data. I’m mostly concerned about I/O 
speeds. I don’t know if NFS would be fast enough to handle the data.



We are hoping that the cluster will grow over time. We are already talking 
about buying more nodes next fiscal year.



Thanks.

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Upgrade to 2.11: unrecognized mount option

2018-04-11 Thread Patrick Farrell
I think you missed it – It came out a few days ago, and Peter Jones announced 
it in what I assume was the usual manner.  Maybe there’s a “which lists were 
sent to” issue?


  *   Patrick

From: lustre-discuss  on behalf of 
"E.S. Rosenberg" 
Date: Wednesday, April 11, 2018 at 10:58 AM
To: "Dilger, Andreas" 
Cc: "lustre-discuss@lists.lustre.org" 
Subject: Re: [lustre-discuss] Upgrade to 2.11: unrecognized mount option

OT: Did I miss the mail announcing Lustre 2.11 or all the recent 2.11 mails 
early adapters?
Thanks,
Eli

On Sat, Apr 7, 2018 at 2:58 AM, Dilger, Andreas 
> wrote:
On Apr 6, 2018, at 06:27, Thomas Roth > 
wrote:
>
> Hi all,
>
> (don't know if it isn't a bit early to complain yet, but)
> I have upgraded an OSS and MDS von 2.10.2 to 2.11.0, just installing the 
> downloaded rpms - no issues here, except when mounting the MDS:
>
> > LDISKFS-fs (drbd0): Unrecognized mount option 
> > "context="unconfined_u:object_r:user_tmp_t:s0"" or missing value
>
> This mount option is visible also by 'tunefs.lustre --dryrun', so I followed 
> a tip on this list from last May and did
>
> > tunefs.lustre --mountfsoptions="user_xattr,errors=remount-ro" /dev/drbd0
>
> = keeping the rest of the mount options. Afterwards the mount worked.
>
>
> I checked, I formatted this MDS with
>
> > mkfs.lustre --reformat --mgs --mdt --fsname=hebetest --index=0
> --servicenode=10.20.1.198@o2ib5 --servicenode=10.20.1.199@o2ib5
> --mgsnode=10.20.1.198@o2ib5 --mgsnode=10.20.1.199@o2ib5
> --mkfsoptions="-E stride=4,stripe-width=20 -O flex_bg,mmp,uninit_bg" 
> /dev/drbd0
>
>
> Just the defaults here?
> Where did the unknown mount option come from, and what does it mean anyway?

I suspect it's automatically added by SELinux, but I couldn't tell you
where or why.  Hopefully one of the people more familiar with SELinux
can answer, and it can be handled properly in the future.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] latest kernel version supported by Lustre ?

2018-04-09 Thread Patrick Farrell
Peter,

Unfortunately, Riccardo was asking about server support.

- Patrick


From: lustre-discuss  on behalf of 
Jones, Peter A 
Sent: Monday, April 9, 2018 7:24:52 AM
To: Dilger, Andreas; Riccardo Veraldi
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] latest kernel version supported by Lustre ?

We’re more up to date than that - we have 4.12 client support in 2.10.3 and 
2.11 (see LU-9558). We’re tracking 4.14 client support under LU-10560 and the 
last couple of patches just missed out on 2.11 but should land to master in the 
coming days. Work to track 4.15 is underway under LU-10805. James Simmons may 
well elaborate.




On 2018-04-08, 12:17 PM, "lustre-discuss on behalf of Dilger, Andreas" 
 
wrote:

>What version of Lustre?  I think 2.11 clients work with something like 4.8? 
>kernels, while 2.10 works with 4.4?  Sorry, I can't check the specifics right 
>now.
>
>If you need a specific kernel, the best thing to do is try the configure/build 
>step for Lustre with that kernel, and then check Jira/Gerrit for tickets for 
>each build failure you hit.
>
>It may be that there are some unlanded patches that can get you a running 
>client.
>
>Cheers, Andreas
>
>> On Apr 7, 2018, at 09:48, Riccardo Veraldi  
>> wrote:
>>
>> Hello,
>>
>> if I would like to use kernel 4.* from elrepo on RHEL74 for the lustre
>> OSSes what is the latest supported kernel 4 version  by Lustre ?
>>
>> thank you
>>
>>
>> Rick
>>
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>___
>lustre-discuss mailing list
>lustre-discuss@lists.lustre.org
>http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.10.3 client fails to compile on centos 6.5

2018-04-08 Thread Patrick Farrell
Are you not able to move to a newer version even of CentOS6?  6.5 is no longer 
supported and it looks like you would have to revert some Lustre patches to get 
the newest client to build.



From: lustre-discuss  on behalf of 
Alex Vodeyko 
Sent: Sunday, April 8, 2018 4:32:36 AM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Lustre 2.10.3 client fails to compile on centos 6.5

Hi,

I'm trying to build lustre 2.10.3 client on centos 6.5 (kernel:
2.6.32-431.el6.x86_64) with "rpmbuild  --rebuild --without servers
lustre-2.10.3-1.src.rpm" and it fails with:

/root/rpmbuild/BUILD/lustre-2.10.3/lustre/ptlrpc/../../lustre/ldlm/ldlm_lock.c:
In function 'ldlm_lock_remove_from_lru_check':
/root/rpmbuild/BUILD/lustre-2.10.3/lustre/ptlrpc/../../lustre/ldlm/ldlm_lock.c:267:
error: implicit declaration of function 'ktime_compare'
make[6]: *** 
[/root/rpmbuild/BUILD/lustre-2.10.3/lustre/ptlrpc/../../lustre/ldlm/ldlm_lock.o]
Error 1
make[5]: *** [/root/rpmbuild/BUILD/lustre-2.10.3/lustre/ptlrpc] Error 2
make[5]: *** Waiting for unfinished jobs
make[4]: *** [/root/rpmbuild/BUILD/lustre-2.10.3/lustre] Error 2
make[3]: *** [_module_/root/rpmbuild/BUILD/lustre-2.10.3] Error 2
make[2]: *** [modules] Error 2
make[1]: *** [all-recursive] Error 1
make: *** [all] Error 2

Have to keep centos-6.5 for the app compatibility, so could you please
help with it?

Many thanks,
Alex
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] varying sequential read performance.

2018-04-03 Thread Patrick Farrell
John,

There’s a simple explanation for that lack of top line performance benefit - 
you’re not reading 16 GB then 16 GB then 16 GB etc.  It’s interleaved.

Read ahead will do large reads, much larger than your 1 MiB i/o size, so it’s 
all interleaved from four sources on every actual read operation.

So you’re effectively pulling from all four sources at the same time 
throughout, so one of them completing faster just means you wait for the others 
to get their work done.  A similar effect would be more obvious if you had four 
independent files going in parallel as you’d see that file complete first. This 
is subtler but it’s the same effect.

- Patrick



From: lustre-discuss  on behalf of 
John Bauer 
Sent: Tuesday, April 3, 2018 1:23:30 AM
To: Colin Faber
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] varying sequential read performance.

Colin

Since I do not have root privileges on the system, I do not have access to 
dropcache.  So, no, I do not flush cache between the dd runs.  The 10 dd runs 
were done in a single
job submission and the scheduler does dropcache between jobs, so the first of 
the dd passes does start with a virgin cache.  What strikes me odd about this 
is the first dd
run is the slowest and obviously must read all the data from the OSSs, which is 
confirmed by the plot I have added to the top, which indicates the total amount 
of data moved
via lnet during the life of each dd process.  Notice that the second dd run, 
which lnetstats indicates also moves the entire 64 GB file from the OSSs, is 3 
times faster, and has
to work with a non-virgin cache.  Runs 4 through 10 all move only 48GB via lnet 
because one of the OSCs keeps its entire 16GB that is needed in cache across 
all the runs.
Even with the significant advantage that runs 4-10 have, you could never tell 
in the dd results.  Run 5 is slightly faster than run 2, and run 7 is as slow 
as run 0.

John


[cid:part1.A7414AA4.17B94BD3@iodoctors.com]

On 4/3/2018 12:20 AM, Colin Faber wrote:
Are you flushing cache between test runs?

On Mon, Apr 2, 2018, 6:06 PM John Bauer 
> wrote:
I am running dd 10 times consecutively to  read a 64GB file ( stripeCount=4 
stripeSize=4M ) on a Lustre client(version 2.10.3) that has 64GB of memory.
The client node was dedicated.

for pass in 1 2 3 4 5 6 7 8 9 10
do
   of=/dev/null if=${file} count=128000 bs=512K
done

Instrumentation of the I/O from dd reveals varying performance.  In the plot 
below, the bottom frame has wall time
on the X axis, and file position of the dd reads on the Y axis, with a dot 
plotted at the wall time and starting file position of every read.
The slopes of the lines indicate the data transfer rate, which vary from 
475MB/s to 1.5GB/s.  The last 2 passes have sharp breaks
in the performance, one with increasing performance, and one with decreasing 
performance.

The top frame indicates the amount of memory used by each of the file's 4 OSCs 
over the course of the 10 dd runs.  Nothing terribly odd here except that
one of the OSC's eventually has its entire stripe ( 16GB ) cached and then 
never gives any up.

I should mention that the file system has 320 OSTs.  I found LU-6370 which 
eventually started discussing LRU management issues on systems with high
numbers of OST's leading to reduced RPC sizes.

Any explanations for the varying performance?
Thanks,
John

[cid:part1.FE7755F6.C36ADB75@iodoctors.com]

--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


--
I/O Doctors, LLC
507-766-0378
bau...@iodoctors.com
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Static lfs?

2018-03-23 Thread Patrick Farrell
Another off list note pointing out that lfs is likely a script now.  So here's 
the bitter end:


Ah, it looks like you're correct.  There's still an lfs.c but it no longer 
generates the "lfs" executable as it previously - Instead there's a lengthy and 
complex script named "lfs" which is not invoked by "make", but only during the 
install process.  That generates the lfs binary that is actually installed...

Uck.  Well, I found where it squirrels away the real binary when executed.

Run the script lustre/utils/lfs in your build dir, and it will start lfs.  Quit 
it, and you will find the actual lfs binary in lustre/utils/.libs/lt-lfs

Maybe this particular bit of build tooling would be clearer if it didn't try to 
pretend it didn't exist by apeing the binary without actually being it?

Thanks to John Bauer for help with this.



From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
Patrick Farrell <p...@cray.com>
Sent: Friday, March 23, 2018 3:17:14 PM
To: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Static lfs?


Ah, interesting – I got a question off list about this, but I thought I’d reply 
here.



‘ldd’ on the lfs binary says “not a dynamic executable”.



So it seems I’m confused (never was much for compilers and linkers).  Here are 
the errors I get trying to run it on another node:
./lfs: line 202: cd: /home/build/paf/[……..]/lustre/utils: No such file or 
directory

gcc: error: lfs.o: No such file or directory

gcc: error: lfs_project.o: No such file or directory

gcc: error: ./.libs/liblustreapi.so: No such file or directory

gcc: error: ../../lnet/utils/lnetconfig/.libs/liblnetconfig.so



From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
Patrick Farrell <p...@cray.com>
Date: Friday, March 23, 2018 at 3:03 PM
To: "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org>
Subject: [lustre-discuss] Static lfs?



Good afternoon,



I’ve got a developer question that perhaps someone has some insight on.  After 
some recent (a few months ago now) changes to make the Lustre libraries and 
utilities build dynamically linked rather than statically linked, I’ve got a 
problem.  If I build an lfs binary just by doing “make”, the resultant binary 
looks for various libraries in the build directories and cannot be run on any 
system other than the one it was built on (well, I guess without replicating 
the build directory structure).  When doing make rpms and installing the RPMs, 
it works fine.  The problem is “make rpms” takes ~5 minutes, as opposed to ~1 
second for “make” in /utils.  (I assume “make install” works too, but I 
explicitly need to test on nodes other than the one where I’m doing the build, 
so that’s not an option.)



Does anyone have any insight on a way around this for a developer?  Either some 
tweak I can make locally to get static builds again, or some fix to make that 
would let the dynamically linked binary from “make” have correct library paths? 
 (To be completely clear: The dynamically linked binary from “make” looks for 
libraries in the locations where they are built, regardless of whether or not 
they’re already installed in the normal system library locations.)



Regards,

Patrick Farrell


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Static lfs?

2018-03-23 Thread Patrick Farrell
Ah, interesting – I got a question off list about this, but I thought I’d reply 
here.

‘ldd’ on the lfs binary says “not a dynamic executable”.

So it seems I’m confused (never was much for compilers and linkers).  Here are 
the errors I get trying to run it on another node:
./lfs: line 202: cd: /home/build/paf/[……..]/lustre/utils: No such file or 
directory
gcc: error: lfs.o: No such file or directory
gcc: error: lfs_project.o: No such file or directory
gcc: error: ./.libs/liblustreapi.so: No such file or directory
gcc: error: ../../lnet/utils/lnetconfig/.libs/liblnetconfig.so

From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
Patrick Farrell <p...@cray.com>
Date: Friday, March 23, 2018 at 3:03 PM
To: "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org>
Subject: [lustre-discuss] Static lfs?

Good afternoon,

I’ve got a developer question that perhaps someone has some insight on.  After 
some recent (a few months ago now) changes to make the Lustre libraries and 
utilities build dynamically linked rather than statically linked, I’ve got a 
problem.  If I build an lfs binary just by doing “make”, the resultant binary 
looks for various libraries in the build directories and cannot be run on any 
system other than the one it was built on (well, I guess without replicating 
the build directory structure).  When doing make rpms and installing the RPMs, 
it works fine.  The problem is “make rpms” takes ~5 minutes, as opposed to ~1 
second for “make” in /utils.  (I assume “make install” works too, but I 
explicitly need to test on nodes other than the one where I’m doing the build, 
so that’s not an option.)

Does anyone have any insight on a way around this for a developer?  Either some 
tweak I can make locally to get static builds again, or some fix to make that 
would let the dynamically linked binary from “make” have correct library paths? 
 (To be completely clear: The dynamically linked binary from “make” looks for 
libraries in the locations where they are built, regardless of whether or not 
they’re already installed in the normal system library locations.)

Regards,
Patrick Farrell

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Static lfs?

2018-03-23 Thread Patrick Farrell
Good afternoon,

I’ve got a developer question that perhaps someone has some insight on.  After 
some recent (a few months ago now) changes to make the Lustre libraries and 
utilities build dynamically linked rather than statically linked, I’ve got a 
problem.  If I build an lfs binary just by doing “make”, the resultant binary 
looks for various libraries in the build directories and cannot be run on any 
system other than the one it was built on (well, I guess without replicating 
the build directory structure).  When doing make rpms and installing the RPMs, 
it works fine.  The problem is “make rpms” takes ~5 minutes, as opposed to ~1 
second for “make” in /utils.  (I assume “make install” works too, but I 
explicitly need to test on nodes other than the one where I’m doing the build, 
so that’s not an option.)

Does anyone have any insight on a way around this for a developer?  Either some 
tweak I can make locally to get static builds again, or some fix to make that 
would let the dynamically linked binary from “make” have correct library paths? 
 (To be completely clear: The dynamically linked binary from “make” looks for 
libraries in the locations where they are built, regardless of whether or not 
they’re already installed in the normal system library locations.)

Regards,
Patrick Farrell

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] File locking errors.

2018-02-20 Thread Patrick Farrell
Entirely correct.

Technically, there is a very small cost difference between calling flock with 
localflock on vs with flock (even in absence of contention), but it really 
should be very small, and it would only be in the flock call itself, no general 
slowdown or similar.  I would not expect that difference to be measurable in 
the context of real use of the flock code.

On 2/20/18, 9:00 AM, "Prentice Bisbal" <pbis...@pppl.gov> wrote:


On 02/20/2018 08:58 AM, Patrick Farrell wrote:
> There is almost NO overhead to this locking unless you’re using it to 
keep threads away from each other’s on multiple nodes, in which case the time 
is spent doing waiting your app is asking for.
>
> Lustre is doing implicit metadata and data locking constantly throughout 
normal operation, all active clients at all times, this is just a little more 
locking, of an explicit kind.  It should be almost impossible to measure a cost 
to flock vs localflock unless you’re really going wild with your file locking, 
in which case it would be worth your while to modify your job to use a better 
kind of concurrency control, because even localflock is slow compared to MPI 
communication.
So, just to be clear, you are saying that file locking, whether local or 
global, creates little overhead/performance penalty, except for the case 
when an application is actively using global filelocking, and then it's 
only because that app is waiting for the lock to be freed?

Put another way enabling local or global file-locking will not affect 
the overall performance of Lustre. It will only affect the apps are 
actually calling flock, and only when there is contention for a lock, 
which is unavoidable, and the whole point of file-locking in the first 
place. Is that correct?

Prentice
>
>
> 
> From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf 
of Michael Di Domenico <mdidomeni...@gmail.com>
> Sent: Tuesday, February 20, 2018 6:47:16 AM
> To: Prentice Bisbal
> Cc: lustre-discuss
> Subject: Re: [lustre-discuss] File locking errors.
>
> On Fri, Feb 16, 2018 at 11:52 AM, Prentice Bisbal <pbis...@pppl.gov> 
wrote:
>> On 02/15/2018 06:30 PM, Patrick Farrell wrote:
>>> Localflock will only provide flock between threads on the same node.  I
>>> would describe it as “likely to result in data corruption unless used 
with
>>> extreme care”.
>> I can't agree with this enough. Someone runs a single node job and thinks
>> "file locking works just fine!" and the runs a large multinode job, and 
then
>> wonders why the output files are all messed up. I think enabling 
filelocking
>> must be an all or nothing thing.
> there are a few instances where this proves false.  think mpi job that
> spawns openmp threads, but needs a local scratch file to run
> out-of-core work...  also keep in mind locking across a large number
> of clients imparts some performance penalty on the file system.
>
> on a side note, in a previous email you stated your lustre version,
> just a word from the wise, don't fall behind unless it's a managed
> vendor solution.  its painful to update the servers, but failing
> behind and then trying to update the servers is much worse.  i did it
> once and have the scars to prove it... :)
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] File locking errors.

2018-02-20 Thread Patrick Farrell
There is almost NO overhead to this locking unless you’re using it to keep 
threads away from each other’s on multiple nodes, in which case the time is 
spent doing waiting your app is asking for.

Lustre is doing implicit metadata and data locking constantly throughout normal 
operation, all active clients at all times, this is just a little more locking, 
of an explicit kind.  It should be almost impossible to measure a cost to flock 
vs localflock unless you’re really going wild with your file locking, in which 
case it would be worth your while to modify your job to use a better kind of 
concurrency control, because even localflock is slow compared to MPI 
communication.



From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
Michael Di Domenico <mdidomeni...@gmail.com>
Sent: Tuesday, February 20, 2018 6:47:16 AM
To: Prentice Bisbal
Cc: lustre-discuss
Subject: Re: [lustre-discuss] File locking errors.

On Fri, Feb 16, 2018 at 11:52 AM, Prentice Bisbal <pbis...@pppl.gov> wrote:
> On 02/15/2018 06:30 PM, Patrick Farrell wrote:
>> Localflock will only provide flock between threads on the same node.  I
>> would describe it as “likely to result in data corruption unless used with
>> extreme care”.
>
> I can't agree with this enough. Someone runs a single node job and thinks
> "file locking works just fine!" and the runs a large multinode job, and then
> wonders why the output files are all messed up. I think enabling filelocking
> must be an all or nothing thing.

there are a few instances where this proves false.  think mpi job that
spawns openmp threads, but needs a local scratch file to run
out-of-core work...  also keep in mind locking across a large number
of clients imparts some performance penalty on the file system.

on a side note, in a previous email you stated your lustre version,
just a word from the wise, don't fall behind unless it's a managed
vendor solution.  its painful to update the servers, but failing
behind and then trying to update the servers is much worse.  i did it
once and have the scars to prove it... :)
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] File locking errors.

2018-02-15 Thread Patrick Farrell
Ah, it sounds like HDF is using file locking to keep *other* possible users 
out, rather than for its own consistency requirements.  It's only trying to use 
flock on Lustre because configure detects it's on a platform (Linux) that 
supports flocks.  But Lustre doesn't necessarily.  That makes sense, and it 
would mean localflock would be safe, unless you had some other application 
which looked for flocks before accessing a file.


From: Arman Khalatyan <arm2...@gmail.com>
Sent: Thursday, February 15, 2018 5:38:39 PM
To: Patrick Farrell
Cc: E.S. Rosenberg; Alexander I Kulyavtsev; Lustre discussion
Subject: Re: [lustre-discuss] File locking errors.

ok, you are right, localflock might be a problem on parallel access, but at 
least our code is started to work after that.just for information  the thread 
from hdf is following:
https://lists.hdfgroup.org/pipermail/hdf-forum_lists.hdfgroup.org/2016-May/009483.html

Am 16.02.2018 12:30 vorm. schrieb "Patrick Farrell" 
<p...@cray.com<mailto:p...@cray.com>>:


Localflock will only provide flock between threads on the same node.  I would 
describe it as “likely to result in data corruption unless used with extreme 
care”.

Are you sure HDF only ever uses flocks between threads on the same node?  That 
seems extremely unlikely or maybe impossible for HDF.  You should definitely 
use flock, which gets flocks working across nodes, and is supported with all 
vaguely recent versions of Lustre.


From: lustre-discuss 
<lustre-discuss-boun...@lists.lustre.org<mailto:lustre-discuss-boun...@lists.lustre.org>>
 on behalf of Arman Khalatyan <arm2...@gmail.com<mailto:arm2...@gmail.com>>
Sent: Thursday, February 15, 2018 5:19:14 PM
To: E.S. Rosenberg
Cc: Alexander I Kulyavtsev; Lustre discussion
Subject: Re: [lustre-discuss] File locking errors.

we had similar troubles with hdf1.10 vs hdf1.8.x. on the lustre
the new hdf require flock support from the underlying filesystem( due to the 
security reasons or whatever more info on hdf you can digg in hdf forums)
to fix the mounts you should unmount an mount again with the option localflock, 
this works for us, independent on lustre version.
that what we did:

https://arm2armcos.blogspot.de/2018/02/hdf5-v110-or-above-on-lustre-fs.html?m=1




Am 15.02.2018 11:18 nachm. schrieb "E.S. Rosenberg" 
<esr+lus...@mail.hebrew.edu<mailto:esr%2blus...@mail.hebrew.edu><mailto:esr%2blus...@mail.hebrew.edu<mailto:esr%252blus...@mail.hebrew.edu>>>:


On Fri, Feb 16, 2018 at 12:00 AM, Colin Faber 
<cfa...@gmail.com<mailto:cfa...@gmail.com><mailto:cfa...@gmail.com<mailto:cfa...@gmail.com>>>
 wrote:
If the mount on the users clients had the various options enabled, and those 
aren't present in fstab, you'd end up with such behavior. Also 2.8? Can you 
upgrade to 2.10 LTS??
Depending on when they installed their system that may not be such a 'small' 
change, our 2.8 is running on CentOS 6.8 so an upgrade to 2.10 requires us to 
also upgrade the OS from 6.x to 7.x and though I very much want to do that that 
is a more intensive process that so far I have not had the time for and I can 
imagine others have the same issue.
Regards,
Eli



On Feb 15, 2018 1:06 PM, "Prentice Bisbal" 
<pbis...@pppl.gov<mailto:pbis...@pppl.gov><mailto:pbis...@pppl.gov<mailto:pbis...@pppl.gov>>>
 wrote:

No. Several others have asked me the same thing, so that seems like it might be 
the issue. The only problem with that solution is that the user claimed his 
program worked just fine up until a couple of weeks ago, so if that is the 
issue, I'll still be scratching my head trying to figure out how/what changed


Prentice

On 02/15/2018 12:31 PM, Alexander I Kulyavtsev wrote:
Do you have flock option in fstab for lustre mount or in command you use to 
mount lustre on client?

Search for flock on lustre wiki
http://wiki.lustre.org/Mounting_a_Lustre_File_System_on_Client_Nodes
or lustre manual
http://doc.lustre.org/lustre_manual.pdf

Here are links where to start learning about lustre:
* http://lustre.org/getting-started-with-lustre/
* http://wiki.lustre.org
* https://wiki.hpdd.intel.com
* jira.hpdd.intel.com<http://jira.hpdd.intel.com><http://jira.hpdd.intel.com>
* http://opensfs.org/lustre/

Alex.

On Feb 15, 2018, at 11:02 AM, Prentice Bisbal 
<pbis...@pppl.gov<mailto:pbis...@pppl.gov><mailto:pbis...@pppl.gov<mailto:pbis...@pppl.gov>>>
 wrote:

Hi.

I'm an experience HPC system admin, but I know almost nothing about Lustre 
administration. The system admin who administered our small Lustre filesystem 
recently retired, and no one has filled that gap yet. A user recently reported 
they are now getting file-locking errors from a program they've run repeatedly 
on Lustre in the past. When the run the same program on an 

Re: [lustre-discuss] File locking errors.

2018-02-15 Thread Patrick Farrell


Localflock will only provide flock between threads on the same node.  I would 
describe it as “likely to result in data corruption unless used with extreme 
care”.

Are you sure HDF only ever uses flocks between threads on the same node?  That 
seems extremely unlikely or maybe impossible for HDF.  You should definitely 
use flock, which gets flocks working across nodes, and is supported with all 
vaguely recent versions of Lustre.


From: lustre-discuss  on behalf of 
Arman Khalatyan 
Sent: Thursday, February 15, 2018 5:19:14 PM
To: E.S. Rosenberg
Cc: Alexander I Kulyavtsev; Lustre discussion
Subject: Re: [lustre-discuss] File locking errors.

we had similar troubles with hdf1.10 vs hdf1.8.x. on the lustre
the new hdf require flock support from the underlying filesystem( due to the 
security reasons or whatever more info on hdf you can digg in hdf forums)
to fix the mounts you should unmount an mount again with the option localflock, 
this works for us, independent on lustre version.
that what we did:

https://arm2armcos.blogspot.de/2018/02/hdf5-v110-or-above-on-lustre-fs.html?m=1




Am 15.02.2018 11:18 nachm. schrieb "E.S. Rosenberg" 
>:


On Fri, Feb 16, 2018 at 12:00 AM, Colin Faber 
> wrote:
If the mount on the users clients had the various options enabled, and those 
aren't present in fstab, you'd end up with such behavior. Also 2.8? Can you 
upgrade to 2.10 LTS??
Depending on when they installed their system that may not be such a 'small' 
change, our 2.8 is running on CentOS 6.8 so an upgrade to 2.10 requires us to 
also upgrade the OS from 6.x to 7.x and though I very much want to do that that 
is a more intensive process that so far I have not had the time for and I can 
imagine others have the same issue.
Regards,
Eli



On Feb 15, 2018 1:06 PM, "Prentice Bisbal" 
> wrote:

No. Several others have asked me the same thing, so that seems like it might be 
the issue. The only problem with that solution is that the user claimed his 
program worked just fine up until a couple of weeks ago, so if that is the 
issue, I'll still be scratching my head trying to figure out how/what changed


Prentice

On 02/15/2018 12:31 PM, Alexander I Kulyavtsev wrote:
Do you have flock option in fstab for lustre mount or in command you use to 
mount lustre on client?

Search for flock on lustre wiki
http://wiki.lustre.org/Mounting_a_Lustre_File_System_on_Client_Nodes
or lustre manual
http://doc.lustre.org/lustre_manual.pdf

Here are links where to start learning about lustre:
* http://lustre.org/getting-started-with-lustre/
* http://wiki.lustre.org
* https://wiki.hpdd.intel.com
* jira.hpdd.intel.com
* http://opensfs.org/lustre/

Alex.

On Feb 15, 2018, at 11:02 AM, Prentice Bisbal 
> wrote:

Hi.

I'm an experience HPC system admin, but I know almost nothing about Lustre 
administration. The system admin who administered our small Lustre filesystem 
recently retired, and no one has filled that gap yet. A user recently reported 
they are now getting file-locking errors from a program they've run repeatedly 
on Lustre in the past. When the run the same program on an NFS filesystem, the 
error goes away. I've cut-and-pasted the error messages below.

Since I have real experience as a Lustre admin, I turned to google, and it 
looks like it might be that the file-locking daemon died (if Lustre has a 
separate file-lock daemon), or somehow file-locking was recently disabled. If 
that is possible, how do I check this, and restart or re-enable if necessary?  
I skimmed the user manual, and could not find anything on either of these 
issues.

Any and all help will be greatly appreciated.

Some of the error messages:

HDF5-DIAG: Error detected in HDF5 (1.10.0-patch1) MPI-process 9:
  #000: H5F.c line 579 in H5Fopen(): unable to open file
major: File accessibilty
minor: Unable to open file
  #001: H5Fint.c line 1168 in H5F_open(): unable to lock the file or initialize 
file structure
major: File accessibilty
minor: Unable to open file
  #002: H5FD.c line 1821 in H5FD_lock(): driver lock request failed
major: Virtual File Layer
minor: Can't update object
  #003: H5FDsec2.c line 939 in H5FD_sec2_lock(): unable to flock file, errno = 
38, error message = 'Function not implemented'
major: File accessibilty
minor: Bad file ID accessed
Error: couldn't open file HDF5-DIAG: Error detected in HDF5 (1.10.0-patch1) 
MPI-process 13:
  #000: H5F.c line 579 in H5Fopen(): unable to open file
major: File accessibilty
minor: Unable to open file
  #001: H5Fint.c line 1168 in H5F_open(): unable to lock the file or initialize 
file structure
major: File accessibilty
minor: Unable to open 

Re: [lustre-discuss] Are there any performance hits with the https://access.redhat.com/security/vulnerabilities/speculativeexecution?

2018-01-08 Thread Patrick Farrell
Note though that since the servers live in kernel space they are also going to 
be affected only minimally.  The Lustre server code itself will see zero 
effect, since it’s entirely kernel code.  Other things running on those servers 
may see impact, and if there’s enough user space stuff, increased usage there 
could reduce resources available for Lustre.

Note also it’s important to distinguish here: the issue is not context switches 
(which is scheduling a different process), it’s syscalls, which do not require 
a context switch.  Context switches already had this sort of overhead.  A 
syscall is not a context switch.  (But the KPTI changes make the effective 
difference smaller.)



From: lustre-discuss  on behalf of 
E.S. Rosenberg 
Sent: Monday, January 8, 2018 7:05:48 AM
To: Arman Khalatyan
Cc: Lustre discussion
Subject: Re: [lustre-discuss] Are there any performance hits with the 
https://access.redhat.com/security/vulnerabilities/speculativeexecution?

The hit is mainly for things that do context switches (which IO is the biggest 
thing in.

On Mon, Jan 8, 2018 at 1:23 PM, Arman Khalatyan 
> wrote:
Ok, We did some tests with the new lustre clients(no patch on servers)
I can confirm like Marek: maximum downgrade is about 40% by rsync with
small files, lfs find on large folders 45% performance penalty:(
We found terrible performance on the test system with zfs+compression+lustre.
Good news: the compute node flops are about 1% or even none. So only
IO intensive applications are impacted.

Cheers,
Arman.

On Mon, Jan 8, 2018 at 11:45 AM, Marek Magryś 
> wrote:
> Hi all,
>
>> I wonder if any performance impacts on lustre with the new security
>> patches for the Intel?
>
> According to our initial tests on 3.10.0-693.11.6.el7.x86_64 kernel
> (Centos 7.4) with Lustre 2.10.2, there is a penalty of ca. 10% in nice
> workloads (1MB IO) up to 40% in 4k IOs. Tested with IOR.
>
> It looks bad, however probably we don't need to patch the servers, as
> Lustre lives in kernelspace anyway. Some kind of advisory from Intel
> HPDD would be nice here.
>
> Cheers,
> Marek
>
> --
> Marek Magrys
> ACC Cyfronet AGH-UST
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre Client in a container

2018-01-03 Thread Patrick Farrell
FWIW, as long as you don’t intend to use any interesting features (quotas, 
etc), 1.8 clients were used with 2.5 servers at ORNL for some time with no ill 
effects on the IO side of things.

I’m not sure how much further that limited compatibility goes, though.

From: Dilger, Andreas <andreas.dil...@intel.com>
Sent: Wednesday, January 3, 2018 4:20:56 AM
To: David Cohen
Cc: Patrick Farrell; lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Lustre Client in a container

On Dec 31, 2017, at 01:50, David Cohen <cda...@physics.technion.ac.il> wrote:
>
> Patrick,
> Thanks for you response.
> I looking for a way to migrate from 1.8.9 system to 2.10.2, stable enough to 
> run the several weeks or more that it might take.

Note that there is no longer direct support for upgrading from 1.8 to 2.10.

That said, are you upgrading the filesystem in place, or are you copying the 
data from the 1.8.9 filesystem to the 2.10.2 filesystem?  In the latter case, 
the upgrade compatibility doesn't really matter.  What you need is a client 
that can mount both server versions at the same time.

Unfortunately, no 2.x clients can mount the 1.8.x server filesystem directly, 
so that does limit your options.  There was a time of interoperability with 1.8 
clients being able to mount 2.1-ish servers, but that doesn't really help you.  
You could upgrade the 1.8 servers to 2.1 or later, and then mount both 
filesystems with a 2.5-ish client, or upgrade the servers to 2.5.

Cheers, Andreas

> On Sun, Dec 31, 2017 at 12:12 AM, Patrick Farrell <p...@cray.com> wrote:
> David,
>
> I have no direct experience trying this, but I would imagine not - Lustre is 
> a kernel module (actually a set of kernel modules), so unless the container 
> tech you're using allows loading multiple different versions of *kernel 
> modules*, this is likely impossible.  My limited understanding of container 
> tech on Linux suggests that this would be impossible, containers allow 
> userspace separation but there is only one kernel/set of modules/drivers.
>
> I don't know of any way to run multiple client versions on the same node.
>
> The other question is *why* do you want to run multiple client versions on 
> one node...?  Clients are usually interoperable across a pretty generous set 
> of server versions.
>
> - Patrick
>
>
> From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
> David Cohen <cda...@physics.technion.ac.il>
> Sent: Saturday, December 30, 2017 11:45:15 AM
> To: lustre-discuss@lists.lustre.org
> Subject: [lustre-discuss] Lustre Client in a container
>
> Hi,
> Is it possible to run Lustre client in a container?
> The goal is to run two different client version on the same node, can it be 
> done?
>
> David
>
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre Client in a container

2017-12-31 Thread Patrick Farrell


Keeping in mind that the choice of NFS means that you don’t have the POSIX 
guarantees provided by Lustre, so simultaneous access to the same files is 
dicey unless it’s only reading.

From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
Daniel Kobras <kob...@linux.de>
Sent: Sunday, December 31, 2017 12:26:49 PM
To: David Cohen
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Lustre Client in a container

Hi David!

Do you require both systems to be available as native Lustre filesystems on all 
clients? Otherwise, reexporting one of the systems via NFS during the migration 
phase will keep all data available but decouple the version interdependence 
between servers and clients. In this situation, it’s probably the least 
experimental option.

Kind regards,

Daniel

> Am 31.12.2017 um 09:50 schrieb David Cohen <cda...@physics.technion.ac.il>:
>
> Patrick,
> Thanks for you response.
> I looking for a way to migrate from 1.8.9 system to 2.10.2, stable enough to 
> run the several weeks or more that it might take.
>
>
> David
>
> On Sun, Dec 31, 2017 at 12:12 AM, Patrick Farrell <p...@cray.com> wrote:
> David,
>
> I have no direct experience trying this, but I would imagine not - Lustre is 
> a kernel module (actually a set of kernel modules), so unless the container 
> tech you're using allows loading multiple different versions of *kernel 
> modules*, this is likely impossible.  My limited understanding of container 
> tech on Linux suggests that this would be impossible, containers allow 
> userspace separation but there is only one kernel/set of modules/drivers.
>
> I don't know of any way to run multiple client versions on the same node.
>
> The other question is *why* do you want to run multiple client versions on 
> one node...?  Clients are usually interoperable across a pretty generous set 
> of server versions.
>
> - Patrick
>
>
> From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
> David Cohen <cda...@physics.technion.ac.il>
> Sent: Saturday, December 30, 2017 11:45:15 AM
> To: lustre-discuss@lists.lustre.org
> Subject: [lustre-discuss] Lustre Client in a container
>
> Hi,
> Is it possible to run Lustre client in a container?
> The goal is to run two different client version on the same node, can it be 
> done?
>
> David
>
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre Client in a container

2017-12-31 Thread Patrick Farrell
Ah, yes, that will likely be a tricky one.  You may simply have to bite the 
bullet, copy what you can early, and accept a downtime to finalize.

Note also that there may be no kernel version for which you could compile both 
of those?  Possibly some version of CentOS 6.

From: David Cohen <cda...@physics.technion.ac.il>
Sent: Sunday, December 31, 2017 2:50:05 AM
To: Patrick Farrell
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Lustre Client in a container

Patrick,
Thanks for you response.
I looking for a way to migrate from 1.8.9 system to 2.10.2, stable enough to 
run the several weeks or more that it might take.


David

On Sun, Dec 31, 2017 at 12:12 AM, Patrick Farrell 
<p...@cray.com<mailto:p...@cray.com>> wrote:

David,


I have no direct experience trying this, but I would imagine not - Lustre is a 
kernel module (actually a set of kernel modules), so unless the container tech 
you're using allows loading multiple different versions of *kernel modules*, 
this is likely impossible.  My limited understanding of container tech on Linux 
suggests that this would be impossible, containers allow userspace separation 
but there is only one kernel/set of modules/drivers.


I don't know of any way to run multiple client versions on the same node.


The other question is *why* do you want to run multiple client versions on one 
node...?  Clients are usually interoperable across a pretty generous set of 
server versions.


- Patrick




From: lustre-discuss 
<lustre-discuss-boun...@lists.lustre.org<mailto:lustre-discuss-boun...@lists.lustre.org>>
 on behalf of David Cohen 
<cda...@physics.technion.ac.il<mailto:cda...@physics.technion.ac.il>>
Sent: Saturday, December 30, 2017 11:45:15 AM
To: lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: [lustre-discuss] Lustre Client in a container

Hi,
Is it possible to run Lustre client in a container?
The goal is to run two different client version on the same node, can it be 
done?

David


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Designing a new Lustre system

2017-12-20 Thread Patrick Farrell
I won’t try to answer all your questions (I’m not really qualified to opine), 
but a quick one on ZFS:

ZFS today is still much slower for the MDT.  It’s competitive on OSTs, arguably 
better, depending on your needs and hardware.  So a strong choice for a config 
today would be ldiskfs MDTs and ZFS OSTs, I know several places do that.

As for MDS+OSS in one node, probably the main problem you’ll face is memory 
usage.  The MDS and OSSes can both benefit from lots of RAM, depending on your 
workload and configuration.  So it might be hard to provide happily for both.

But combined MDS+OSS is certainly something people have been discussing 
recently, for the reasons you gave.  I don’t know if any real deployments exist 
(there are certainly test setups all over).

- Patrick

From: lustre-discuss 
>
 on behalf of "E.S. Rosenberg" 
>
Date: Wednesday, December 20, 2017 at 10:21 AM
To: "lustre-discuss@lists.lustre.org" 
>
Subject: [lustre-discuss] Designing a new Lustre system

Hi everyone,

We are currently looking into upgrading/replacing our Lustre system with a 
newer system.

I had several ideas I'd like to run by you and also some questions:
1. After my recent experience with failover I wondered is there any reason not 
to set all machines that are within reasonable cable range as potential 
failover nodes so that in the very unlikely event of both machines connected to 
a disk enclosure failing simple recabling + manual mount would still work?

2. I'm trying to decide how to do metadata, on the one hand I would very much 
like/prefer to have a failover pair, on the other hand when I look at the load 
on the MDS it seems like a big waste to have even one machine allocated to this 
exclusively, so I was thinking instead to maybe make all Lustre nodes MDS+OSS, 
this would as I understand potentially provide better metadata performance if 
needed and also allow me to put small files on the MDS and also provide for 
better resilience. Am I correct in these assumptions? Has anyone done something 
similar?

3. An LLNL lecture at Open-ZFS last year seems to strongly suggest using zfs 
over ldiskfs,is this indeed 'the way to go for new systems' or are both still 
fully valid options?

4. One of my colleagues likes Isilon very much, I have not been able to find 
any literature on if/how Lustre compares any pointers/knowledge on the subject 
is very welcome.

Our current system consists of 1 MDS + 3 OSS (15 OST), using FDR IB about 
approx 500TB in size currently running Lustre 2.8 but I hope to upgrade it to 
2.10.x, the cluster it services consists of 72 nodes though we hope that will 
grow more.
A new system would hopefully (budget dependent) be at least 1PB and still be 
servicing the same/expanded cluster.

Thanks,
Eli
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] BAD CHECKSUM

2017-12-07 Thread Patrick Farrell
I would think it's possible if the application is doing direct I/O. This
should be impossible for buffered I/O, since the checksums are all
calculated after the copies in to kernel memory (the page cache) are
complete, so it doesn¹t matter what userspace does to its memory (at
least, it doesn¹t matter for the checksums).

And I¹m not 100% sure it¹s possible for direct.  I would think it is.
Someone else might be able to weigh in there - but it¹s definitely not
possible for buffered I/O.


It would be good, as Andreas said, to see the exact message.

One other thought: While the Lustre client might resend correctly, I would
think it extremely likely unintentionally messing with memory being used
for I/O represents a serious application bug, likely to lead to incorrect
operation.

Regards,
- Patrick

On 12/7/17, 2:36 PM, "lustre-discuss on behalf of Dilger, Andreas"
 wrote:

>On Dec 7, 2017, at 10:37, Hans Henrik Happe  wrote:
>> 
>> Hi,
>> 
>> Can an application cause BAD CHECKSUM errors in Lustre logs by somehow
>> overwriting memory while being DMA'ed to network?
>> 
>> After upgrading to 2.10.1 on the server side we started seeing this from
>> a user's application (MPI I/O). Both 2.9.0 and 2.10.1 clients emit these
>> errors. We have not yet established weather the application is doing
>> things correctly.
>
>If applications are using mmap IO it is possible for the page to become
>inconsistent after the checksum has been computed.  However, mmap IO is
>normally detected by the client and no message should be printed.
>
>There isn't anything that the application needs to do, since the client
>will resend the data if there is a checksum error, but the resends do
>slow down the IO.  If the inconsistency is on the client, there is no
>cause for concern (though it would be good to figure out the root cause).
>
>It would be interesting to see what the exact error message is, since
>that will say whether the data became inconsistent on the client, or over
>the network.  If the inconsistency is over the network or on the server,
>then that may point to hardware issues.
>
>Cheers, Andreas
>--
>Andreas Dilger
>Lustre Principal Architect
>Intel Corporation
>
>
>
>
>
>
>
>___
>lustre-discuss mailing list
>lustre-discuss@lists.lustre.org
>http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre and Elasticsearch

2017-11-26 Thread Patrick Farrell
They more or less don't.  They only come in to play for applications that 
explicitly ask for them and the implementation is fast and efficient (it's tied 
in to the standard Lustre locking mechanisms)


- Patrick


From: lustre-discuss  on behalf of 
E.S. Rosenberg 
Sent: Sunday, November 26, 2017 1:03:39 PM
To: Torsten Harenberg
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Lustre and Elasticsearch

Hi Torsten,
Thanks that worked!

Do you or anyone on the list know if/how flock affects Lustre performance?

Thanks again,
Eli

On Tue, Nov 21, 2017 at 9:18 AM, Torsten Harenberg 
> wrote:
Hi Eli,

Am 21.11.17 um 01:26 schrieb E.S. Rosenberg:
> So I was wondering would this issue be solved by Lustre bindings for
> Java or is this a way of locking that isn't supported by Lustre?

I know nothing about Elastic Search, but have you tried to mount Lustre
with "flock" in the mount options?

Cheers

 Torsten

--
<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
<>  <>
<> Dr. Torsten Harenberg 
torsten.harenb...@cern.ch  <>
<> Bergische Universitaet   <>
<> Fakutät 4 - PhysikTel.: +49 (0)202 
439-3521  <>
<> Gaussstr. 20  Fax : +49 (0)202 
439-2811  <>
<> 42097 Wuppertal   @CERN: Bat. 1-1-049<>
<>  <>
<><><><><><><>< Of course it runs NetBSD http://www.netbsd.org ><>

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS-OST layout, number of OSTs

2017-10-24 Thread Patrick Farrell
It can be pretty easily inferred from the nature of the feature.


If a decent policy is written and applied to all files (starting with few 
stripes and going to many as size increases), then it will resolve the problem 
of large files on single OSTs.  If the policy is not universally applied or is 
poorly constructed, you may have issues.


Otherwise, as long as users are not restricting file creation to a single or 
small # of OSTs, then there's not really any way for them to fill up a single 
OST without filling up all of them.


- Patrick


From: lustre-discuss  on behalf of 
Mark Hahn 
Sent: Tuesday, October 24, 2017 3:21:47 PM
To: Lustre Discuss
Subject: Re: [lustre-discuss] ZFS-OST layout, number of OSTs

> It?s also worth noting that if you have small OSTs it?s much easier to bump
>into a full OST situation.   And specifically, if you singly stripe a file
>the file size is limited by the size of the OST.

is there enough real-life experience to know whether
progressive file layout will mitigate this issue?

thanks,
Mark Hahn | SHARCnet Sysadmin | h...@sharcnet.ca | http://www.sharcnet.ca
   | McMaster RHPCS| h...@mcmaster.ca | 905 525 9140 x24687
   | Compute/Calcul Canada| http://www.computecanada.ca
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] FW: Lustre 2.10.1 released

2017-10-24 Thread Patrick Farrell
Peter,


Not mine - Elis.  (I knew that one. :) )


- Patrick


From: Jones, Peter A <peter.a.jo...@intel.com>
Sent: Tuesday, October 24, 2017 11:43:30 AM
To: E.S. Rosenberg; Patrick Farrell
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] FW: Lustre 2.10.1 released

I missed Patrick’s side question – this is a routine agenda item on LWG calls 
and there a minutes summarizing the discussion for those who are unable to 
attend live.



As a side question:
Is there any place where I can follow the status of vanilla kernel Lustre?

- Patrick


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] ZFS-OST layout, number of OSTs

2017-10-22 Thread Patrick Farrell
Thomas,

This is likely a reflection of an older issue, since resolved.  For a long 
time, Lustre reserved max_rpcs_in_flight*max_pages_per_rpc for each OST (on the 
client).  This was a huge memory commitment in larger setups, but was resolved 
a few versions back, and now per OST memory usage on the client is pretty 
trivial when the client isn’t doing I/o to that OST.  The main arguments 
against large OST counts are probably the pain of managing larger numbers of 
them, and individual OSTs being slow (because they use fewer disks), requiring 
users to stripe files more widely to see the benefit.  This is both an 
administrative burden for users and uses more space on the metadata server to 
track the file layouts.

But if your MDT is large and your users amenable to thinking about that (or you 
set a good default striping policy - progressive file layouts from 2.10 are 
wonderful for this), then it’s probably fine.  The largest OST counts I am 
aware of are in the low thousands.

Ah, one more thing - clients must ping every OST periodically if they haven’t 
otherwise contacted it within the required interval.  This can contribute to 
network traffic and CPU noise/jitter on the clients.  I don’t have a good sense 
of how serious this is in practice, but I know some larger sites worry about it.

- Patrick



From: lustre-discuss  on behalf of 
Thomas Roth 
Sent: Sunday, October 22, 2017 9:04:35 AM
To: Lustre Discuss
Subject: [lustre-discuss] ZFS-OST layout, number of OSTs

Hi all,

I have done some "fio" benchmarking, amongst other things to test the 
proposition that to get more iops, the number of disks per raidz should be less.
I was happy I could reproduce that: one server with 30 disks in one raidz2 
(=one zpool = one OST) is indeed slower than one with 30 disks in three
raidz2 (one zpool, one OST).
I ran fio also on a third server were the 30 disks make up 3 raidz2 = 3 zpools 
= 3 OSTs, that one is faster still.

Now I seem to remember a warning not to have too many OSTs in one Lustre, 
because each OST eats some memory on the client. I haven't found that
reference, and I would like to ask what the critical numbers might be? How much 
RAM are we talking about? Is there any other "wise" limit on the OST
number?
Currently our clients are equipped with 128 or 256 GB RAM.  We have 550 OSTs in 
the system, but the next cluster could easily grow much larger here if
we stick to the small OSTs.

Regards,
Thomas
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Acceptable thresholds

2017-10-19 Thread Patrick Farrell
Several processes per CPU core, probably?  It’s a lot.

But there’s a lot of environmental and configuration dependence here too.

Why not look at how many you have running currently when Lustre is set up and 
set the limit to double that?  Watching process count isn’t a good way to 
measure load anyway - it’s probably only good for watching for a fork-bomb type 
thing, where process count goes runaway.  So why not configure to catch that 
and otherwise don’t worry about it?

- Patrick

From: lustre-discuss 
>
 on behalf of "E.S. Rosenberg" 
>
Date: Thursday, October 19, 2017 at 2:20 PM
To: "lustre-discuss@lists.lustre.org" 
>
Subject: [lustre-discuss] Acceptable thresholds

Hi,
This question is I guess not truly answerable because it is probably very 
specific for each environment etc. but I am still going to ask it to get a 
general idea.

We started testing monitoring using Zabbix, its' default 'too many processes' 
threshold is not very high, so I already raised it to 1024 but the Lustre 
servers are still well over even that count.

So what is a 'normal' process count for Lustre servers?
Should I assume X processes per client? What is X?

Thanks,
Eli
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre clients and Lustre servers (MDS/OSS) operating system requirements?

2017-10-10 Thread Patrick Farrell
Amjad,

To answer your question more directly…  Operating system differences between 
client and server don’t matter – it’s very common to deploy clients and servers 
using different kernels and/or different distributions.

However, Lustre versions do matter.  There is probably no client version that 
would build for RHEL 7 that would also work with server 1.8.7, which I believe 
is the last server to support RHEL 5.x.  So, because of which kernels are 
supported by which Lustre versions, this is probably impossible in practice.

You really, really should just upgrade the servers.  RHEL 5 is incredibly out 
of date, and 1.8.7 is quite old as well.

- Patrick

From: lustre-discuss 
>
 on behalf of Brian Andrus >
Date: Sunday, October 8, 2017 at 5:18 PM
To: "lustre-discuss@lists.lustre.org" 
>
Subject: Re: [lustre-discuss] Lustre clients and Lustre servers (MDS/OSS) 
operating system requirements?


Hmm. Well, I used to take care of the cluster at the Naval Postgraduate School, 
and I would always upgrade both. When we went to 7, it was on the Luster 
servers first (the nodes were on 6).

The only thing I could guess is a FUD factor (Fear-Uncertainty-Doubt). We did 
several kernel upgrades without an issue.
I expect the concern is about data loss, but the path has been traveled by many 
very large instances, so with a proper plan, there is no worry.

Brian Andrus


On 10/8/2017 4:42 AM, Jones, Peter A wrote:
Hmm. Perhaps one of the end user sites will speak up with something that they 
have tried successfully but I cannot think of anyone that I have heard running 
RHEL 5.x on the Lustre servers but RHEL 7.x on the Lustre clients.  Why os 
there a reluctance to upgrade the OS on the storage servers? Is it older 
storage not supported on supported with newer version of RHEL?

On 2017-10-07, 10:59 PM, "lustre-discuss on behalf of Amjad Syed" 

 on behalf of amjad...@gmail.com> wrote:

Hello,
We have an existing HPC running Lustre 1.8.7 on RHEL 5.4
The Lustre servers (MDS and OSS) are all running RHEL 5.4
The Lustre clients(HPC compute nodes)  are also running RHEL 5.4
Now the management has decided to upgrade the compute nodes to RHEL 7.
But they do not want to upgrade the OS of Lustre Servers which is still RHEL 5.4
So the question is will this configuration work where the Lustre clients are 
RHEL 7 and Lustre server are all running RHEL 5.4?

Thanks.



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.orghttp://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] FW: Lustre 2.10.1 released

2017-10-07 Thread Patrick Farrell
Michael,

In general, yes, and if the kernel versions match, then almost certainly 
specifically too.  I’ve certainly done it on a Fedora box in the past.

- Patrick



From: lustre-discuss  on behalf of 
Michael Watters 
Sent: Saturday, October 7, 2017 8:50:12 AM
To: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] FW: Lustre 2.10.1 released



On 10/3/17 3:30 AM, Jones, Peter A wrote:


On 2017-10-02, 11:16 PM, "Jones, Peter A" 
> wrote:



  *   4.9 Kernel support for Lustre client 
(LU-9183)

Does this mean that lustre 2.10 will build on Fedora hosts?  I'd love to be 
able to natively mount Lustre file systems on my workstation without having to 
use NFS.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] 2.10.0 CentOS6.9 ksoftirqd CPU load

2017-09-27 Thread Patrick Farrell
A guess for you to consider:


A very common cause of ksoftirqd load is a hypervisor putting memory pressure 
on a VM.  At least VMWare, and I think KVM and others, use IRQs to implement 
some of their memory management and it can show up like this.


That would of course mean it's not really the ptlrpc module, I'm not sure how 
carefully you verified that it is causing this.  (Obviously your 'remove it, 
check, add it, check' method is sound, but if you just checked once or twice, 
you may have been wrong through bad luck or you could've been right at your 
limit of available memory.)


From: lustre-discuss  on behalf of 
Dilger, Andreas 
Sent: Wednesday, September 27, 2017 11:50:03 AM
To: Hans Henrik Happe
Cc: Shehata, Amir; lustre-discuss; Olaf Weber
Subject: Re: [lustre-discuss] 2.10.0 CentOS6.9 ksoftirqd CPU load

On Sep 26, 2017, at 01:10, Hans Henrik Happe  wrote:
>
> Hi,
>
> Did anyone else experience CPU load from ksoftirqd after 'modprobe
> lustre'? On an otherwise idle node I see:
>
>  PID USER  PR   NI VIRT  RES  SHR S %CPU  %MEM TIME+   COMMAND
>9 root  20   0 000 S 28.5  0.0  2:05.58 ksoftirqd/1
>
>
>   57 root  20   0 000 R 23.9  0.0  2:22.91 ksoftirqd/13
>
> The sum of those two is about 50% CPU.
>
> I have narrowed it down to the ptlrpc module. When I remove that, it stops.
>
> I also tested the 2.10.1-RC1, which is the same.

If you can run "echo l > /proc/sysrq-trigger" it will report the processes
that are currently running on the CPUs of your system to the console (and
also /var/log/messages, if it can write everything in time).

You might need to do this several times to get a representative sample of
the ksoftirqd process stacks to see what they are doing that is consuming
so much CPU.

Alternately, "echo t > /proc/sysrq-trigger" will report the stacks of all
processes to the console (and /v/l/m), but there will be a lot of them,
and no better chance that it catches what ksoftirqd is doing 25% of the time.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Running IBM Power boxes as OSSs?

2017-09-01 Thread Patrick Farrell
While I can't speak to Intels intentions, I can say this:

Lustre clients work well on Power architectures (see, for example, LLNL), so a 
lot of the work is done and there shouldn't be any endianness issues in the 
network part of the code.

I suspect the server code would also build for such an architecture, because 
why not?  If it didn't, it would likely be minimal effort to make it do so.

HOWEVER, the server code has other internal endianness dependencies because it 
works with on disk data formats.  I know a few years ago there were some 
patches from people who found endianness issues there (failure to swab to 
correct on disk endianness on either write or read or both), but my 
understanding is they were from inspection (rather than real use experience) 
and I have no idea if they found everything.

So my fear - which others might be able to speak better to - would be those 
subtler things that work fine if you only use one endianness in production.  
Not basic build & run functionality.

Best of luck!



From: lustre-discuss  on behalf of 
Andrew Holway 
Sent: Friday, September 1, 2017 6:48:58 AM
To: Daniel Kidger
Cc: lustre-discuss
Subject: Re: [lustre-discuss] Running IBM Power boxes as OSSs?

my 0.02¢

This question is quite interesting considering Big Blue offers the competing 
GPFS filesystem on Power.  I have it on fairly good authority that Intel bought 
Whamcloud in order to compete with IBM for future very large (exascale) 
supercomputer installations. Power architecture is seemingly quite formidable 
in the supercomputing space so having a combined filesystems and processor 
architecture is very important for intel if they want to compete in the HPC 
space.

I doubt that Intel, as the current guardians of Lustre would allow any serious 
work on supporting a Power power port. I guess this would be a bit of an own 
goal!



On 1 September 2017 at 12:24, Daniel Kidger 
> wrote:
Hi.

This is my first posting to the list.
I have worked off an on with Lustre since a helping set up a demo at SC02 in 
Baltimore.
A long time has passed and I now find myself at IBM.
The question I have today is:

Are any sites running with IBM POWER hardware for their Lustre servers i.e. MDS 
and OSSs?
The only references I find are very old, certainly long before the availability 
of little-endian RedHat.

And if not, what are likely to be the pain points and hurdles in building and 
running Lustre on non-x86 platforms like POWER?

Daniel

Daniel Kidger
IBM Systems, UK
daniel.kid...@uk.ibm.com
+44 (0)7818 522266
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Bandwidth bottleneck at socket?

2017-08-30 Thread Patrick Farrell
Brian,


Hm.  At least from what you said, I see no reason to implicate *sockets* rather 
than *clients*.  (And there are, in general, no socket level issues you should 
expect.  The bandwidth in and out of a socket generally dwarfs available 
network bandwidth.  There are occasionally some NUMA issues, but they shouldn't 
come up with simple i/o like this.)


Best of luck - I bet the OSC tuning will help.


- Patrick


From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
Brian Andrus <toomuc...@gmail.com>
Sent: Wednesday, August 30, 2017 11:39:48 AM
To: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Bandwidth bottleneck at socket?


Patrick,


By socket-level, I am referring to a physical socket. It seems that increasing 
the number of cores for an mpirun or ior doesn't increase total throughput 
unless it is adding another physical socket.

I'm pretty sure the network and OSTs can handle the traffic. I have tested the 
network to 40Gb/s with iperf and the OSTs are all NVMe

I have used 1, 2 and 3 clients by using an mpi-io copy program. It will read 
from one file on lustre and write it to another, with each worker reading in 
its portion of the file.


Hmm. I shall try doing multiple copies at the same time to see what happens. 
That, I hadn't tested.

We are using Lustre 2.10.51-1 under CentOS 7 kernel 3.10.0-514.26.2

Brian

On 8/30/2017 9:32 AM, Patrick Farrell wrote:

Brian,


I'm not sure what you mean by "socket level".


A starter question:
How fast are your OSTs?  Are you sure the limit isn't the OST?  (Easy way to 
test - Multiple files on that OST from multiple clients, see how that performs)

(lfs setstripe -i [index] to set the OST for a singly striped file)


In general, you can get ~1.3-1.8 GB/s from one process to one file with a 
recent-ish Xeon, if your OSTs and network can handle it.  There are a number of 
other factors that can get involved in limiting your bandwidth with multiple 
threads.


It sounds like you're always (in the numbers you report) using one client at a 
time.  Is that correct?


I suspect that you're limited in bandwidth to a specific OST, either by the OST 
or by the client settings.  What's your bandwidth limit from one client to 
multiple files on the same OST?  Is it that same 1.5 GB/s?


If so (or even if it's close), you may need to increase your clients RPC size 
(max_pages_per_rpc in /proc/fs/lustre/osc/[OST]/), or max_rpcs_in_flight (same 
place).  Note if you increase those you need to increase max_dirty_mb (again, 
same place).  The manual describes the relationship.


Also - What version of Lustre are you running?  Client & server.


- Patrick


From: lustre-discuss 
<lustre-discuss-boun...@lists.lustre.org><mailto:lustre-discuss-boun...@lists.lustre.org>
 on behalf of Brian Andrus <toomuc...@gmail.com><mailto:toomuc...@gmail.com>
Sent: Wednesday, August 30, 2017 11:16:08 AM
To: lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: [lustre-discuss] Bandwidth bottleneck at socket?

All,

I've been doing some various performance tests on a small lustre
filesystem and there seems to be a consistent bottleneck of ~700MB/s per
socket involved.

We have 6 servers with 2 Intel E5-2695 chips in each.

3 servers are clients, 1 is MGS and 2 are OSSes with 1 OST each.
Everything is connected with 40Gb Ethernet.

When I write to a single stripe, the best throughput I see is about
1.5GB/s. That doubles if I write to a file that has 2 stripes.

If I do a parallel copy (using mpiio) I can get 1.5GB/s from a single
machine, whether I use 28 cores or 2 cores. If I only use 1, it goes
down to ~700MB/s

Is there a bandwidth bottleneck that can occur at the socket level for a
system? This really seems like it.


Brian Andrus

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Bandwidth bottleneck at socket?

2017-08-30 Thread Patrick Farrell
Brian,


I'm not sure what you mean by "socket level".


A starter question:
How fast are your OSTs?  Are you sure the limit isn't the OST?  (Easy way to 
test - Multiple files on that OST from multiple clients, see how that performs)

(lfs setstripe -i [index] to set the OST for a singly striped file)


In general, you can get ~1.3-1.8 GB/s from one process to one file with a 
recent-ish Xeon, if your OSTs and network can handle it.  There are a number of 
other factors that can get involved in limiting your bandwidth with multiple 
threads.


It sounds like you're always (in the numbers you report) using one client at a 
time.  Is that correct?


I suspect that you're limited in bandwidth to a specific OST, either by the OST 
or by the client settings.  What's your bandwidth limit from one client to 
multiple files on the same OST?  Is it that same 1.5 GB/s?


If so (or even if it's close), you may need to increase your clients RPC size 
(max_pages_per_rpc in /proc/fs/lustre/osc/[OST]/), or max_rpcs_in_flight (same 
place).  Note if you increase those you need to increase max_dirty_mb (again, 
same place).  The manual describes the relationship.


Also - What version of Lustre are you running?  Client & server.


- Patrick


From: lustre-discuss  on behalf of 
Brian Andrus 
Sent: Wednesday, August 30, 2017 11:16:08 AM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Bandwidth bottleneck at socket?

All,

I've been doing some various performance tests on a small lustre
filesystem and there seems to be a consistent bottleneck of ~700MB/s per
socket involved.

We have 6 servers with 2 Intel E5-2695 chips in each.

3 servers are clients, 1 is MGS and 2 are OSSes with 1 OST each.
Everything is connected with 40Gb Ethernet.

When I write to a single stripe, the best throughput I see is about
1.5GB/s. That doubles if I write to a file that has 2 stripes.

If I do a parallel copy (using mpiio) I can get 1.5GB/s from a single
machine, whether I use 28 cores or 2 cores. If I only use 1, it goes
down to ~700MB/s

Is there a bandwidth bottleneck that can occur at the socket level for a
system? This really seems like it.


Brian Andrus

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Best way to run serverside 2.8 w. MOFED 4.1 on Centos 7.2

2017-08-18 Thread Patrick Farrell

I would strongly suggest make -j something for parallelism, unless you want to 
have time to go out for your coffee.


From: lustre-discuss  on behalf of 
Christopher Johnston 
Sent: Friday, August 18, 2017 3:45:39 PM
To: Jeff Johnson
Cc: lustre-discuss
Subject: Re: [lustre-discuss] Best way to run serverside 2.8 w. MOFED 4.1 on 
Centos 7.2

Get coffee somewhere in between 

On Aug 18, 2017 1:08 PM, "Jeff Johnson" 
> wrote:
John,

You can rebuild 2.8 against MOFED. 1) Install MOFED version of choice. 2) Pull 
down the 2.8 Lustre source and configure with 
'--with-o2ib=/usr/src/ofa_kernel/default'. 3) `make rpms` 4) Install. 5) Profit.

--Jeff

On Fri, Aug 18, 2017 at 9:41 AM, john casu 
> wrote:
I have an existing 2.8 install that broke when we added MOFED into the mix.

Nothing I do wrt installing 2.8 rpms works to fix this, and I get a couple of 
missing symbole, when I install lustre-modules:
depmod: WARNING: 
/lib/modules/3.10.0-327.3.1.el7_lustre.x86_64/extra/kernel/net/lustre/ko2iblnd.ko
 needs unknown symbol ib_query_device
depmod: WARNING: 
/lib/modules/3.10.0-327.3.1.el7_lustre.x86_64/extra/kernel/net/lustre/ko2iblnd.ko
 needs unknown symbol ib_alloc_pd

I'm assuming the issue is that lustre 2.8 is built using the standard Centos 
7.2 infiniband drivers.

I can't move to Centos 7.3, at this time.  Is there any way to get 2.8 up & 
running w. mofed without rebuilding lustre rpms?

If I have to rebuild, it'd probably be easier to go to 2.10 (and zfs 0.7.1). Is 
that a correct assumption?
Or will the 2.10 rpms work on Centps 7.2?

thanks,
-john c
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



--
--
Jeff Johnson
Co-Founder
Aeon Computing

jeff.john...@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001   f: 
858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite D - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] nodes crash during ior test

2017-08-04 Thread Patrick Farrell
Brian,


What is the actual crash?  Null pointer, failed assertion/LBUG...?  Probably 
just a few more lines back in the log would show that.


Also, Lustre 2.10 has been released, you might benefit from switching to that.  
There are almost certainly more bugs in this pre-2.10 development version 
you're running than in the release.


- Patrick


From: lustre-discuss  on behalf of 
Brian Andrus 
Sent: Friday, August 4, 2017 12:12:59 PM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] nodes crash during ior test

All,

I am trying to run some ior benchmarking on a small system.

It only has 2 OSSes.
I have been having some trouble where one of the clients will reboot and
do a crash dump somewhat arbitrarily. The runs will work most of the
time, but every 5 or so times, a client reboots and it is not always the
same client.

The call trace seems to point to lnet:


72095.973865] Call Trace:
[72095.973892]  [] ? cfs_percpt_unlock+0x36/0xc0 [libcfs]
[72095.973936]  []
lnet_return_tx_credits_locked+0x211/0x480 [lnet]
[72095.973973]  [] lnet_msg_decommit+0xd0/0x6c0 [lnet]
[72095.974006]  [] lnet_finalize+0x1e9/0x690 [lnet]
[72095.974037]  [] ksocknal_tx_done+0x85/0x1c0 [ksocklnd]
[72095.974068]  [] ksocknal_handle_zcack+0x137/0x1e0
[ksocklnd]
[72095.974101]  []
ksocknal_process_receive+0x3a1/0xd90 [ksocklnd]
[72095.974134]  [] ksocknal_scheduler+0xee/0x670
[ksocklnd]
[72095.974165]  [] ? wake_up_atomic_t+0x30/0x30
[72095.974193]  [] ? ksocknal_recv+0x2a0/0x2a0 [ksocklnd]
[72095.974222]  [] kthread+0xcf/0xe0
[72095.974244]  [] ? kthread_create_on_node+0x140/0x140
[72095.974272]  [] ret_from_fork+0x58/0x90
[72095.974296]  [] ? kthread_create_on_node+0x140/0x140

I am currently using lustre 2.9.59_15_g107b2cb built for kmod

Is there something I can do to track this down and hopefully remedy it?

Brian Andrus

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Problem with raising osc.*.max_rpcs_in_flight

2017-07-03 Thread Patrick Farrell
It definitely is limited to 32 buckets.  We've toyed with raising that limit 
(and Cray did so internally), but it does use some memory, etc.


So that's almost certainly the issue you're seeing, Reinoud.  RPCs larger than 
the largest size appear as the largest size.


- Patrick


From: lustre-discuss  on behalf of 
Dilger, Andreas 
Sent: Sunday, July 2, 2017 3:45:03 AM
To: Reinoud Bokhorst
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Problem with raising osc.*.max_rpcs_in_flight

It may also be that this histogram is limited to 32 buckets?

Cheers, Andreas

> On Jun 30, 2017, at 03:03, Reinoud Bokhorst  wrote:
>
> Hi all,
>
> I have a problem with raising the osc.*.max_rpcs_in_flight client
> setting on our Lustre 2.7.0. I am trying the increase the setting from
> 32 to 64 but according to osc.*.rpc_stats it isn't being used. The
> statistics still stop at 31 rpcs with high write request numbers, e.g.
>
>readwrite
> rpcs in flightrpcs   % cum % |   rpcs   % cum %
> 0:   0   0   0   |  0   0   0
> 1:7293  38  38   |   2231  16  16
> 2:3872  20  59   |   1196   8  25
> 3:1851   9  69   |935   6  31
> --SNIP--
> 28:  0   0 100   | 89   0  87
> 29:  0   0 100   | 90   0  87
> 30:  0   0 100   | 94   0  88
> 31:  0   0 100   |   1573  11 100
>
> I have modified some ko2iblnd driver parameters in an attempt to get it
> working:
>
> options ko2iblnd peer_credits=128 peer_credits_hiw=128 credits=2048
> concurrent_sends=256 ntx=2048 map_on_demand=32 fmr_pool_size=2048
> fmr_flush_trigger=512 fmr_cache=1
>
> Specifically I raised peer_credits_hiw to 128 as I've understood that it
> must be twice the value of max_rpcs_in_flight. Checking the module
> parameters that were actually loaded, I noticed that it was set to 127.
> So apparently it must be smaller than peers_credits. After noticing this
> I tried setting max_rpcs_in_flight to 60 but that didn't help either.
> Are there any other parameters affecting the max rpcs? Do all settings
> have to be powers of 2?
>
> Related question; documentation on the driver parameters and how it all
> hangs together is rather scarce on the internet. Does anyone have some
> good pointers?
>
> Thanks,
> Reinoud Bokhorst
>
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Per-client I/O Operation Counters

2017-06-01 Thread Patrick Farrell
rpc_stats on the clients may be helpful here, as a first step.  They are found 
in /proc/fs/lustre/osc/[ost ]/rpc_stats on the client.  Contents should be 
mostly self-explanatory.  Look for lots of small RPCs.


- Patrick


From: lustre-discuss  on behalf of 
Russell Dekema 
Sent: Thursday, June 1, 2017 6:34:59 PM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Per-client I/O Operation Counters

Greetings,

Is there a way, either on the Lustre clients or (preferably) OSSes, to
determine how many I/O operations each Lustre client is performing
against the filesystem?

I know several ways of finding the number of *bytes* read or written
by a client (or even on a per-job basis with job_stats), but we
suspect we have some clients overwhelming our filesystem with large
numbers of small I/O requests, and I don't know how to find per-client
(or per-job) I/O operation counters.

This is a Lustre 2.5 system.

Sincerely,
Rusty Dekema
University of Michigan
Advanced Research Computing - Technical Services
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] client complains about server version

2017-05-06 Thread Patrick Farrell
Yes, I believe so.  Andreas or someone else from Intel may be able to confirm 
the most recent client that got interoperability testing with the 2.4 server 
from Intel.  It may have been only 2.5, 2.4 was not a maintenance release.  
(2.5 was, and so got testing with at least 2.7 clients, maybe 2.8 as well?)


- Patrick


From: Riccardo Veraldi <riccardo.vera...@cnaf.infn.it>
Sent: Saturday, May 6, 2017 10:03:17 PM
To: Patrick Farrell; lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] client complains about server version

thanks for the hints.
Is Lustre 2.8.0 client vs server 2.4.1 untested as well ?

On 5/6/17 7:36 PM, Patrick Farrell wrote:

Riccardo,

You may be unable to free space on the OSTs when deleting files.  I can't 
remember if 2.4 has the required support for delete-from-MDS (not the real 
feature name, sorry).  I think it does, but I'm not sure.  It's easy to check - 
just delete a large file and see if the space shows up as free.

Other than that, your main concern is that no one has tested this or fixed any 
bugs that came up. I'm not aware of anything that should be broken, again 
except possibly for that thing with delete I mentioned, and obviously, lots of 
client features will not work. But they should be disabled automatically 
because the server doesn't support them.

So you might be OK, but there's a very real chance those clients may crash the 
servers or the other way around, as the combination is not supported and has 
not been tested.

I strongly suggest you upgrade your servers!  There are lots of handy new 
features, and you would avoid this problem entirely.

- Patrick

From: lustre-discuss 
<lustre-discuss-boun...@lists.lustre.org><mailto:lustre-discuss-boun...@lists.lustre.org>
 on behalf of Riccardo Veraldi 
<riccardo.vera...@cnaf.infn.it><mailto:riccardo.vera...@cnaf.infn.it>
Sent: Saturday, May 6, 2017 5:25:40 PM
To: lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
Subject: [lustre-discuss] client complains about server version

Hello,

I moved many of my lustre clients to 2.9.0. Anyway the server version is
pretty old (2.4)

Do I have to worry ?

Things seems working though

May  4 14:37:48 psana1620 kernel: [   43.145108] Lustre: Server MGS
version (2.4.1.0) is much older than client. Consider upgrading server
(2.9.0)


thank you

Ricl


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] client complains about server version

2017-05-06 Thread Patrick Farrell

Riccardo,

You may be unable to free space on the OSTs when deleting files.  I can't 
remember if 2.4 has the required support for delete-from-MDS (not the real 
feature name, sorry).  I think it does, but I'm not sure.  It's easy to check - 
just delete a large file and see if the space shows up as free.

Other than that, your main concern is that no one has tested this or fixed any 
bugs that came up. I'm not aware of anything that should be broken, again 
except possibly for that thing with delete I mentioned, and obviously, lots of 
client features will not work. But they should be disabled automatically 
because the server doesn't support them.

So you might be OK, but there's a very real chance those clients may crash the 
servers or the other way around, as the combination is not supported and has 
not been tested.

I strongly suggest you upgrade your servers!  There are lots of handy new 
features, and you would avoid this problem entirely.

- Patrick

From: lustre-discuss  on behalf of 
Riccardo Veraldi 
Sent: Saturday, May 6, 2017 5:25:40 PM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] client complains about server version

Hello,

I moved many of my lustre clients to 2.9.0. Anyway the server version is
pretty old (2.4)

Do I have to worry ?

Things seems working though

May  4 14:37:48 psana1620 kernel: [   43.145108] Lustre: Server MGS
version (2.4.1.0) is much older than client. Consider upgrading server
(2.9.0)


thank you

Ricl


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.8.0 - MDT/MGT failing to mount

2017-05-04 Thread Patrick Farrell
Hm, I'm not sure everyone here is talking about the same ordering...


As I understand it:

The writeconf process is to unmount everything, then writeconf all your targets 
(order doesn't matter, pretty sure - Someone will correct me if not...), then 
mount in the order Colin gave - MGS/MDT, then OSTs, then clients.


- Patrick


From: lustre-discuss  on behalf of 
Colin Faber 
Sent: Thursday, May 4, 2017 10:51:06 AM
To: Mohr Jr, Richard Frank (Rick Mohr)
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Lustre 2.8.0 - MDT/MGT failing to mount

Hi,

Yes MGS/MDT as well as OSTs, Remount MGS/MDT, then OSTs, then clients.

-cf


On Thu, May 4, 2017 at 9:24 AM, Mohr Jr, Richard Frank (Rick Mohr) 
> wrote:

> On May 4, 2017, at 11:03 AM, Steve Barnet 
> > wrote:
>
> On 5/4/17 10:01 AM, Mohr Jr, Richard Frank (Rick Mohr) wrote:
>> Did you try doing a writeconf to regenerate the config logs for the file 
>> system?
>
>
> Not yet, but quick enough to try. Do this for the MDT/MGT first,
> then the OSTs?
>

I believe that is correct, but you should check the Lustre manual to be certain 
of the procedure.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.nics.tennessee.edu=DwICAg=IGDlg0lD0b-nebmJJ0Kp8A=x9pM59OqndbWw-lPPdr8w1Vud29EZigcxcNkz0uw5oQ=zNWOKVoBbMeg1KtWlyO1oNprX_1JpEc6vKU6dgcqmQM=qi1QBsAhh_VrYyASzVltuIBDt9VlL4wsIIVnIA9vdGE=

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org=DwICAg=IGDlg0lD0b-nebmJJ0Kp8A=x9pM59OqndbWw-lPPdr8w1Vud29EZigcxcNkz0uw5oQ=zNWOKVoBbMeg1KtWlyO1oNprX_1JpEc6vKU6dgcqmQM=N5BBW4WTyDBCfEfkgHB9_iQh_kEA5QzKPGTMZbbub5o=

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] operation ldlm_queue failed with -11

2017-05-03 Thread Patrick Farrell
Rick, Lydia,


That reasoning is sound, but this is a special case.  -11 (-EAGAIN) on 
ldlm_enqueue is generally OK...


LU-8658 explains the situation (it's POSIX flocks), so I'm going to reference 
that rather than repeat it here.


https://jira.hpdd.intel.com/browse/LU-8658


- Patrick


From: lustre-discuss  on behalf of 
Mohr Jr, Richard Frank (Rick Mohr) 
Sent: Wednesday, May 3, 2017 11:07:53 AM
To: Lydia Heck
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] operation ldlm_queue failed with -11

I think that -11 is EAGAIN, but I don’t know how to interpret what that means 
in the context of Lustre locking.  I assume these messages are from the clients 
and the changing “x” portion is just the fact that each client has a 
different identifier.  So if you have multiple clients complaining about errors 
to the same MDS server, then my first guess would be that there is some wrong 
on the server side of things.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu


> On May 2, 2017, at 4:52 AM, Lydia Heck  wrote:
>
>
> Dear all,
>
> we get many entries in our logs of the type
>
>
> kernel: LustreError: 11-0: scratch-MDT-mdc-xx: Communicating 
> with 172.17.xxx.yyy@o2ib, operation ldlm_enqueue failed with -11
>
> with the -xx changing
>
> but to the same MDS system?
>
> I have looked on the internet, but fail to find this error. There is very 
> little info on ldlm_enqueue messages.
>
> Best wishes,
> Lydia
>
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Compile a C++ app. using the Lustre API

2017-03-15 Thread Patrick Farrell
It looks like your compiler is being fussier than the C compiler.


Specifically, the problem appears to be with the enum type.  The C compiler is 
happy to let pass using a short (cr_flags) where an enum is called for 
(argument to changelog_rec_offset).  In C, I think an enum is an int (so 
passing in a short like this is always fine).  I guess in C++ either enum is 
not an int, or it's just fussier.


There might be a compiler flag to make it not error on this?  I am not familiar 
with icpc, so I can't help with specifics.

You might also try a different C++ compiler, to see if it has a different 
attitude towards that error.


One further thought, though:
This is a C header.  Presumably, it is not intended to be included directly in 
a C++ project?


- Patrick


From: lustre-discuss  on behalf of 
François Tessier 
Sent: Wednesday, March 15, 2017 2:00:31 PM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Compile a C++ app. using the Lustre API

Hi All,

I'm working on a piece of code using the Lustre API. To do so, I include
lustreapi.h. When I compile my code with a C compiler (icc), everything
is fine. However, when I compile it with a C++ compiler (icpc), I get
these errors:

-

In file included from /usr/include/lustre/lustreapi.h(47),
 from topo.c(5):
/usr/include/lustre/lustre_user.h(914): error: argument of type 
"__u16={unsigned short}" is incompatible with parameter of type 
"changelog_rec_flags"
 return changelog_rec_offset(rec->cr_flags);
 ^

In file included from /usr/include/lustre/lustreapi.h(47),
 from topo.c(5):
/usr/include/lustre/lustre_user.h(925): error: a value of type "int" cannot be 
used to initialize an entity of type "changelog_rec_flags"
 enum changelog_rec_flags crf = rec->cr_flags & CLF_VERSION;
^

In file included from /usr/include/lustre/lustreapi.h(47),
 from topo.c(5):
/usr/include/lustre/lustre_user.h(935): error: a value of type "int" cannot be 
used to initialize an entity of type "changelog_rec_flags"
 enum changelog_rec_flags crf = rec->cr_flags &
^

In file included from /usr/include/lustre/lustreapi.h(47),
 from topo.c(5):
/usr/include/lustre/lustre_user.h(945): error: argument of type "int" is 
incompatible with parameter of type "changelog_rec_flags"
 return (char *)rec + changelog_rec_offset(rec->cr_flags &
   ^

In file included from /usr/include/lustre/lustreapi.h(47),
 from topo.c(5):
/usr/include/lustre/lustre_user.h(986): error: this operation on an enumerated 
type requires an applicable user-defined operator function
 crf_wanted &= CLF_SUPPORTED;
^

In file included from /usr/include/lustre/lustreapi.h(47),
 from topo.c(5):
/usr/include/lustre/lustre_user.h(997): error: argument of type "int" is 
incompatible with parameter of type "changelog_rec_flags"
   changelog_rec_offset(crf_wanted & ~CLF_JOBID);
^

In file included from /usr/include/lustre/lustreapi.h(47),
 from topo.c(5):
/usr/include/lustre/lustre_user.h(999): error: argument of type "int" is 
incompatible with parameter of type "changelog_rec_flags"
   changelog_rec_offset(crf_wanted & ~(CLF_JOBID | CLF_RENAME));
^

Makefile:10: recipe for target 'topo' failed
make: *** [topo] Error 2

-

It's probably more a compiler issue than a Lustre one but a solution
could help other users or Lustre developers.

Any idea?

Thanks,

François


--
--
François TESSIER, Ph.D.
Postdoctoral Appointee
Argonne National Laboratory
LCF Division - Bldg 240, 4E 19
Tel : +1 (630)-252-5068
http://www.francoistessier.info

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LNET Self-test

2017-02-05 Thread Patrick Farrell
Doug,


It seems to me that's not true any more, with larger RPC sizes available.  Is 
there some reason that's not true?


- Patrick


From: lustre-discuss  on behalf of 
Oucharek, Doug S 
Sent: Sunday, February 5, 2017 3:18:10 PM
To: Jeff Johnson
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] LNET Self-test

Yes, you can bump your concurrency.  Size caps out at 1M because that is how 
LNet is setup to work.  Going over 1M size would result in an unrealistic 
Lustre test.

Doug

> On Feb 5, 2017, at 11:55 AM, Jeff Johnson  
> wrote:
>
> Without seeing your entire command it is hard to say for sure but I would 
> make sure your concurrency option is set to 8 for starters.
>
> --Jeff
>
> Sent from my iPhone
>
>> On Feb 5, 2017, at 11:30, Jon Tegner  wrote:
>>
>> Hi,
>>
>> I'm trying to use lnet selftest to evaluate network performance on a test 
>> setup (only two machines). Using e.g., iperf or Netpipe I've managed to 
>> demonstrate the bandwidth of the underlying 10 Gbits/s network (and 
>> typically you reach the expected bandwidth as the packet size increases).
>>
>> How can I do the same using lnet selftest (i.e., verifying the bandwidth of 
>> the underlying hardware)? My initial thought was to increase the I/O size, 
>> but it seems the maximum size one can use is "--size=1M".
>>
>> Thanks,
>>
>> /jon
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Status of LU-8703 for Knights Landing

2017-02-01 Thread Patrick Farrell
Andrew,


Are they really just not working?  I didn't see that with KNL (the default CPT 
generated without the fixes from LU-8703 is very weird, but didn't affect 
performance much - the real NUMA-ness of KNL processors seems to be minimal, 
despite the various NUMA related configuration options...), but Cray systems 
are unusual and I don't think I ever saw an empty NUMA node (possibly something 
we fix in the BIOS).  Anyway, you should be able to work around this without 
patching your client, just set some module parameters before starting 
Lustre/loading the modules.

I can think of two things which should work, both are module parameters for the 
libcfs module, I believe.  I haven't tried this, so it's possible your error is 
coming earlier in the loading process...  But I think not, based on the message.

1. Limit yourself to 1 partition, by setting cpu_npartitions to 1.
static int cpu_npartitions;
module_param(cpu_npartitions, int, 0444);
MODULE_PARM_DESC(cpu_npartitions, "# of CPU partitions");


2. Or, you could draw up a CPU partition table yourself.  Parameter name is 
cpu_pattern.


Here's the code describing that:
"

/**
 * modparam for setting CPU partitions patterns:
 *
 * i.e: "0[0,1,2,3] 1[4,5,6,7]", number before bracket is CPU partition ID,
 *  number in bracket is processor ID (core or HT)
 *
 * i.e: "N 0[0,1] 1[2,3]" the first character 'N' means numbers in bracket
 *   are NUMA node ID, number before bracket is CPU partition ID.
 *
 * i.e: "N", shortcut expression to create CPT from NUMA & CPU topology
 *
 * NB: If user specified cpu_pattern, cpu_npartitions will be ignored
 */
static char *cpu_pattern = "N";
module_param(cpu_pattern, charp, 0444);
MODULE_PARM_DESC(cpu_pattern, "CPU partitions pattern");"

Notice the default pattern is N, but you can override it.

(Code references from libcfs/libcfs/linux/linux-cpu.c in Lustre.)

Either of those should let you get past the error, no need to carry patches.  I 
can't speak to the production-readiness of the patches, but I'd definitely go 
the module parameter route if it were my system.

- Patrick


From: lustre-discuss  on behalf of 
Prout, Andrew - LLSC - MITLL 
Sent: Wednesday, February 1, 2017 3:11:07 PM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] Status of LU-8703 for Knights Landing

Anyone know the production-readiness of the patches attached to LU-8703 to fix 
issues with Lustre on Xeon Phi Knights Landing hardware? We're considering 
merging them against our 2.9.0 client to get it working on our KL nodes.

Andrew Prout
Lincoln Laboratory Supercomputing Center
MIT Lincoln Laboratory
244 Wood Street, Lexington, MA 02420
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


  1   2   >