Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Russell Dekema
At first glance, this sounds like your Infiniband subnet manager may be down or malfunctioning. In this case, nodes which were already up when the subnet manager was working will continue to be able to communicate over IB, but nodes which reboot after the SM goes down will not. You can test this t

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Russell Dekema
link_layer: InfiniBand > > [root@pg-gpu01 ~]# sminfo > sminfo: sm lid 1 sm guid 0xf452140300f62320, activity count 80878098 > priority 0 state 3 SMINFO_MASTER > > Looks like the rebooted node is able to connect/contact IB/IB subnetmanager > > > > >

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Russell Dekema
W > version, for the same device available on this fabric is 2.36.5150 > -W- pg-node014/U1 - Node has FW version 2.32.5100 while the latest FW > version, for the same device available on this fabric is 2.36.5150 > -W- pg-node015/U1 - Node has FW version 2.32.5100 while the latest FW > version,

Re: [lustre-discuss] client fails to mount

2017-04-24 Thread Russell Dekema
I'm not sure this is likely to help either, but if you run the command 'ibhosts' on one of the non-working Lustre client nodes, do you see all of your Lustre servers in the printed list? -Rusty On Mon, Apr 24, 2017 at 10:39 AM, Russell Dekema wrote: > I can't rule it out

[lustre-discuss] Meaning of 'slow creates' messages on MDS

2017-05-28 Thread Russell Dekema
Greetings, We have been having various kinds of trouble with our Lustre filesystem lately; right now the main problem we are having is intermittent severe slowness (such as 30 seconds for an 'ls' of a directory containing 100 files to return) when 'cd' and 'ls'ing around our Lustre filesystem. As

Re: [lustre-discuss] Meaning of 'slow creates' messages on MDS

2017-05-30 Thread Russell Dekema
On Tue, May 30, 2017 at 12:20 PM, Oleg Drokin wrote: > > This means exactly what it says. > This ost is slow creating new objects (for the object preallocates). > > If all of your OST creates are slow - then when you create a lot of files, > eventually you run out of OST objects (or when striping

[lustre-discuss] Per-client I/O Operation Counters

2017-06-01 Thread Russell Dekema
Greetings, Is there a way, either on the Lustre clients or (preferably) OSSes, to determine how many I/O operations each Lustre client is performing against the filesystem? I know several ways of finding the number of *bytes* read or written by a client (or even on a per-job basis with job_stats)

Re: [lustre-discuss] Lustre user mount permission issue

2017-08-06 Thread Russell Dekema
Good evening, In my experience, you definitely need to sync your user/group information with the MDS(es). I don't think you need to sync it to the OSSes though. -Rusty On Sun, Aug 6, 2017 at 9:32 PM, Yasir Israr wrote: > I've sync lustre user with all mounted client. Do I've to sync user with

[lustre-discuss] Lustre chown and create operations failing for certain UIDs

2019-01-10 Thread Russell Dekema
We've got a Lustre system running lustre-2.5.42.28.ddn8 and are having a problem with it that none of us here have ever seen before. We are wondering if anyone here has seen this or has any idea what might be causing it. (I have redacted the example affected username and its corresponding UID in t

Re: [lustre-discuss] Lustre chown and create operations failing for certain UIDs

2019-01-11 Thread Russell Dekema
little different from what I see when the MDS node's > passwd file is incomplete, but did you verify the affected_user has a > proper /etc/passwd entry on the MDS node(s)? > > On 1/10/19 12:14 PM, Russell Dekema wrote: > > We've got a Lustre system running lustre-2.5.42

Re: [lustre-discuss] Robinhood scan time

2020-12-04 Thread Russell Dekema
Greetings, What kind of hardware are you running on your metadata array? Cheers, Rusty Dekema On Fri, Dec 4, 2020 at 5:12 PM Kumar, Amit wrote: > > HI All, > > > > During LAD’20 Andreas mentioned if I could share the Robinhood scan time for > the 369millions files we have. So here it is. It to