[lustre-discuss] questions about group locks / LDLM_FL_NO_TIMEOUT flag

2023-08-30 Thread Bertschinger, Thomas Andrew Hjorth via lustre-discuss
Hello, We have a few files created by a particular application where reads to those files consistently hang. The debug log on a client attempting a read() has messages like: > ldlm_completion_ast(): waiting indefinitely because of NO_TIMEOUT ... This is printed when the flag

Re: [lustre-discuss] questions about group locks / LDLM_FL_NO_TIMEOUT flag

2023-09-01 Thread Bertschinger, Thomas Andrew Hjorth via lustre-discuss
s is killed. Cheers, Andreas > On Aug 30, 2023, at 07:42, Bertschinger, Thomas Andrew Hjorth via > lustre-discuss wrote: > > Hello, > > We have a few files created by a particular application where reads to those > files consistently hang. The debug log on a client

[lustre-discuss] very slow mounts with OSS node down and peer discovery enabled

2023-10-26 Thread Bertschinger, Thomas Andrew Hjorth via lustre-discuss
Hello, Recently we had an OSS node down for an extended period with hardware problems. While the node was down, mounting lustre on a client took an extremely long time to complete (20-30 minutes). Once the fs is mounted, all operations are normal and there isn't any noticeable impact from the

[lustre-discuss] DNE v3 and directory inode changing

2023-03-23 Thread Bertschinger, Thomas Andrew Hjorth via lustre-discuss
Hello, We've been experimenting with DNEv3 recently and have run into this issue: https://jira.whamcloud.com/browse/LU-7607 where the directory inode number changes after auto-split. In addition to the problem noted with backups that track the inode number, we have found that file access

Re: [lustre-discuss] DNE v3 and directory inode changing

2023-03-24 Thread Bertschinger, Thomas Andrew Hjorth via lustre-discuss
Thanks, this is helpful. We certainly don't need the auto-split feature and were just experimenting with it, so this should be fine for us. And we have been satisfied with the round robin directory creation so far. Just out of curiosity, is the auto-split feature still being actively worked on

Re: [lustre-discuss] Lustre 2.15.1 server with ZFS nothing provides ksym

2023-07-11 Thread Bertschinger, Thomas Andrew Hjorth via lustre-discuss
Hello Jon, I've ran into this issue in the past when trying to install Lustre with ZFS RPMs I built myself. I was able to resolve the problem by including the flag "--with-spec=redhat" when configuring ZFS: ./configure --with-spec=redhat see:

Re: [lustre-discuss] [EXTERNAL] Re: storing Lustre jobid in file xattrs: seeking feedback

2023-05-14 Thread Bertschinger, Thomas Andrew Hjorth via lustre-discuss
;slurm", ...) to reduce the chance of namespace clashes in user or system? However, I'm reluctant to restrict names in user since this shouldn't have any fatal side effects (e.g. data corruption like in trusted or system), and the admin is supposed to know what they are doing... On May 4,

[lustre-discuss] storing Lustre jobid in file xattrs: seeking feedback

2023-05-04 Thread Bertschinger, Thomas Andrew Hjorth via lustre-discuss
Hello Lustre Users, There has been interest in a proposed feature https://jira.whamcloud.com/browse/LU-13031 to store the jobid with each Lustre file at create time, in an extended attribute. An open question is which xattr namespace is to use between "user", the Lustre-specific namespace

[lustre-discuss] question on behavior of supplementary group permissions

2024-01-24 Thread Bertschinger, Thomas Andrew Hjorth via lustre-discuss
Hello, We have a curious issue with supplemental group permissions. There is a set of files where a user has group read permission to the files via a supplemental group. If the user tries to open() one of these files, they get EACCES. Then, if the user stat()s the file (or seemingly does any

Re: [lustre-discuss] Coordinating cluster start and shutdown?

2023-12-06 Thread Bertschinger, Thomas Andrew Hjorth via lustre-discuss
Hello Jan, You can use the Pacemaker / Corosync high-availability software stack for this: specifically, ordering constraints [1] can be used. Unfortunately, Pacemaker is probably over-the-top if you don't need HA -- its configuration is complex and difficult to get right, and it significantly

Re: [lustre-discuss] hanging threads

2023-12-18 Thread Bertschinger, Thomas Andrew Hjorth via lustre-discuss
Hello Ger, Can you share the full stack trace from the log output for the hung thread? That will be helpful for diagnosing the issue. Some other clues: do you get any stack traces or error output on clients where you observe the hang? Does every client hang, or only some? Does it hang on any

Re: [lustre-discuss] open() against files on lustre hangs

2024-02-22 Thread Bertschinger, Thomas Andrew Hjorth via lustre-discuss
Hello Jan, More often than not, when I see stat() syscalls hanging, it's due to a communication issue with an OSS rather than an MDS. I think the message about "Lustre: comind-MDT: haven't heard from client ..." may be a downstream effect of the client hanging (maybe due to an OSS issue),