[lustre-discuss] Lnet errors

2023-10-05 Thread Alastair Basden via lustre-discuss

Hi,

Lustre 2.12.2.

We are seeing lots of errors on the servers such as:
Oct  5 11:16:48 oss04 kernel: LNetError: 
6414:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending PUT to 
12345-172.19.171.15@o2ib1: -125
Oct  5 11:16:48 oss04 kernel: LustreError: 
6414:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc 
8fe066bb9400

and
Oct  4 14:59:48 oss04 kernel: LustreError: 
6383:0:(events.c:305:request_in_callback()) event type 2, status -103, service 
ost_io

and
Oct  5 11:18:06 oss04 kernel: LustreError: 
6388:0:(events.c:305:request_in_callback()) event type 2, status -5, service 
ost_io
Oct  5 11:18:06 oss04 kernel: LNet: 
6412:0:(o2iblnd_cb.c:413:kiblnd_handle_rx()) PUT_NACK from 172.19.171.15@o2ib1

and on the clients:
m7: Oct  5 14:46:59 m7132 kernel: LustreError: 
2466:0:(events.c:200:client_bulk_callback()) event type 2, status -103, desc 
9a251fc14400

and
m7: Oct  5 11:18:34 m7086 kernel: LustreError: 
2495:0:(events.c:200:client_bulk_callback()) event type 2, status -5, desc 
9a39ad668000

Does anyone have any ideas about what could be causing this?

Thanks,
Alastair.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Changing OST servicenode

2022-11-16 Thread Alastair Basden via lustre-discuss

Hi,

We want to change the service node of an OST.  We think this involves:
1. umount the OST
2. tunefs.lustre --erase-param failover.node 
--servicenode=172.18.100.1@o2ib,172.17.100.1@tcp pool1/ost1

Is this all?  Unclear from the documentation whether a writeconf is 
required (if it is, then we'd need to dismount the whole file system, and 
take it all down, and writeconf every ost/mdt/mgt and then mount them in 
order).


Thanks,
Alastair.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre recycle bin

2022-10-17 Thread Alastair Basden via lustre-discuss

Hi Francois,

We had something similar a few months back - I suspect a bug somewhere.

Basically files weren't getting removed from the OST.  Eventually, we 
mounted as ext, and removed them manually, I think.


A reboot of the file system meant that rm operations then proceeded 
correctly after that.


Cheers,
Alastair.

On Mon, 17 Oct 2022, Cloete, F. (Francois) via lustre-discuss wrote:


[EXTERNAL EMAIL]
Hi Andreas,
Our OSTs still display high file-system usage after removing folders.

Are there any commands that could be run to confirm if the allocated space 
which was used by those files have been released successfully ?

Thanks
Francois

From: Andreas Dilger 
Sent: Saturday, 15 October 2022 00:20
To: Cloete, F. (Francois) 
Cc: lustre-discuss@lists.lustre.org
Subject: Re: [lustre-discuss] Lustre recycle bin

You don't often get email from adil...@whamcloud.com. 
Learn why this is important
CAUTION - EXTERNAL SENDER - Please be careful when opening links and 
attachments. Nedbank - IT Information Security Department (ISD)
There isn't a recycle bin, but filenames are deleted from the filesystem 
quickly and the data objects are deleted in the background asynchronously (with 
transactions to prevent the space being leaked).  If there are a lot of files 
this may take some time, rebooting will not speed it up.


On Oct 14, 2022, at 10:00, Cloete, F. (Francois) via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hi Community,
Is anyone aware of a recycle bin parameter for Lustre?

Just deleted a whole lot of files but for some reason the space is not getting 
cleared.

Server rebooted, file-system un-mounted etc.

Thanks


Nedbank Limited group of companies (Nedbank) disclaimer and confidentiality 
notice:
This email, including any attachments (email), contains information that is 
confidential and is meant only for intended recipients. You may not share or 
copy the email or any part of it, unless the sender has specifically allowed 
you to do so. If you are not an intended recipient, please delete the email 
permanently and let Nedbank know that you have deleted it by replying to the 
sender or calling the Nedbank Contact Centre on 0860 555 111.
This email is not confirmation of a transaction or a Nedbank statement and is 
not offering or inviting anyone to take up any financial products or services, 
unless the content specifically indicates that it does so. Nedbank will not be 
liable for any errors or omissions in this email. The views and opinions are 
those of the author and not necessarily those of Nedbank.
The names of the Nedbank Board of Directors and Company Secretary are available here: 
www.nedbank.co.za/terms/DirectorsNedbank.htm.
 Nedbank Ltd Reg No 1951/09/06.

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud






___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] (no subject)

2022-05-17 Thread Alastair Basden via lustre-discuss

Hi all,

We had a problem with one of our MDS (ldiskfs) on Lustre 2.12.6, which we 
think is a bug - but haven't been able to identify it.  Can anyone shed 
any light?  We unmounted and remounted the mdt at around 23:00.


Client logs:
May 16 22:15:41 m8011 kernel: LustreError: 11-0: 
lustrefs8-MDT-mdc-956fb73c3800: operation ldlm_enqueue to node 
172.18.185.1@o2ib failed: rc = -107
May 16 22:15:41 m8011 kernel: Lustre: lustrefs8-MDT-mdc-956fb73c3800: 
Connection to lustrefs8-MDT (at 172.18.185.1@o2ib) was lost; in progress 
operations using this service will wait for recovery to complete
May 16 22:15:41 m8011 kernel: LustreError: Skipped 5 previous similar messages
May 16 22:15:48 m8011 kernel: Lustre: 
101710:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1652735641/real 1652735641]  req@949d8cb1de80 
x1724290358528896/t0(0) 
o101->lustrefs8-MDT-mdc-956fb73c3800@172.18.185.1@o2ib:12/10 lens 
480/568 e 4 to 1 dl 1652735748 ref 2 fl Rpc:X/0/ rc 0/-1
May 16 22:15:48 m8011 kernel: Lustre: 
101710:0:(client.c:2146:ptlrpc_expire_one_request()) Skipped 6 previous similar 
messages
May 16 23:00:15 m8011 kernel: Lustre: 
4784:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent has timed out 
for slow reply: [sent 1652738408/real 1652738408]  req@94ea07314380 
x1724290358763776/t0(0) o400->MGC172.18.185.1@o2ib@172.18.185.1@o2ib:26/25 lens 
224/224 e 0 to 1 dl 1652738415 ref 1 fl Rpc:XN/0/ rc 0/-1
May 16 23:00:15 m8011 kernel: LustreError: 166-1: MGC172.18.185.1@o2ib: 
Connection to MGS (at 172.18.185.1@o2ib) was lost; in progress operations using 
this service will fail
May 16 23:00:15 m8011 kernel: Lustre: Evicted from MGS (at 
MGC172.18.185.1@o2ib_0) after server handle changed from 0xdb7c7c778c8908d6 to 
0xdb7c7cbad3be9e79
May 16 23:00:15 m8011 kernel: Lustre: MGC172.18.185.1@o2ib: Connection restored 
to MGC172.18.185.1@o2ib_0 (at 172.18.185.1@o2ib)
May 16 23:01:49 m8011 kernel: LustreError: 167-0: 
lustrefs8-MDT-mdc-956fb73c3800: This client was evicted by 
lustrefs8-MDT; in progress operations using this service will fail.
May 16 23:01:49 m8011 kernel: LustreError: 
101719:0:(vvp_io.c:1562:vvp_io_init()) lustrefs8: refresh file layout 
[0x28107:0x9b08:0x0] error -108.
May 16 23:01:49 m8011 kernel: LustreError: 
101719:0:(vvp_io.c:1562:vvp_io_init()) Skipped 3 previous similar messages
May 16 23:01:49 m8011 kernel: Lustre: lustrefs8-MDT-mdc-956fb73c3800: 
Connection restored to 172.18.185.1@o2ib (at 172.18.185.1@o2ib)



MDS server logs:
May 16 22:15:40 c8mds1 kernel: LustreError: 
10686:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired 
after 99s: evicting client at 172.18.181.11@o2ib  ns: 
mdt-lustrefs8-MDT_UUID lock: 97b3730d98c0/0xdb7c7cbad3be1c7b lrc: 3/0,0 
mode: PW/PW res: [0x29119:0x327f:0x0].0x0 bits 0x40/0x0 rrc: 201 type: IBT 
flags: 0x6020040020 nid: 172.18.181.11@o2ib remote: 0xe62e31610edfb808 
expref: 90 pid: 10707 timeout: 8482830 lvb_type: 0
May 16 22:15:40 c8mds1 kernel: LustreError: 
10712:0:(ldlm_lockd.c:1351:ldlm_handle_enqueue0()) ### lock on destroyed export 
9769eaf46c00 ns: mdt-lustrefs8-MDT_UUID lock: 
97d828635e80/0xdb7c7cbad3be1c90 lrc: 3/0,0 mode: PW/PW res: 
[0x29119:0x327f:0x0].0x0 bits 0x40/0x0 rrc: 199 type: IBT flags: 
0x5020040020 nid: 172.18.181.11@o2ib remote: 0xe62e31610edfb80f expref: 77 
pid: 10712 timeout: 0 lvb_type: 0
May 16 22:15:40 c8mds1 kernel: LustreError: 
10712:0:(ldlm_lockd.c:1351:ldlm_handle_enqueue0()) Skipped 27 previous similar 
messages
May 16 22:17:22 c8mds1 kernel: LNet: Service thread pid 10783 was inactive for 
200.73s. The thread might be hung, or it might only be slow and will resume 
later. Dumping the stack trace for debugging purposes:
May 16 22:17:22 c8mds1 kernel: LNet: Skipped 3 previous similar messages
May 16 22:17:22 c8mds1 kernel: Pid: 10783, comm: mdt01_040 
3.10.0-1160.2.1.el7_lustre.x86_64 #1 SMP Wed Dec 9 20:53:35 UTC 2020
May 16 22:17:22 c8mds1 kernel: Call Trace:
May 16 22:17:22 c8mds1 kernel: [] 
ldlm_completion_ast+0x430/0x860 [ptlrpc]
May 16 22:17:22 c8mds1 kernel: [] 
ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
May 16 22:17:22 c8mds1 kernel: [] 
mdt_object_local_lock+0x50b/0xb20 [mdt]
May 16 22:17:22 c8mds1 kernel: [] 
mdt_object_lock_internal+0x70/0x360 [mdt]
May 16 22:17:22 c8mds1 kernel: [] mdt_object_lock+0x20/0x30 
[mdt]
May 16 22:17:22 c8mds1 kernel: [] mdt_brw_enqueue+0x44b/0x760 
[mdt]
May 16 22:17:22 c8mds1 kernel: [] mdt_intent_brw+0x1f/0x30 
[mdt]
May 16 22:17:22 c8mds1 kernel: [] 
mdt_intent_policy+0x435/0xd80 [mdt]
May 16 22:17:22 c8mds1 kernel: [] 
ldlm_lock_enqueue+0x376/0x9b0 [ptlrpc]
May 16 22:17:22 c8mds1 kernel: [] 
ldlm_handle_enqueue0+0xa86/0x1620 [ptlrpc]
May 16 22:17:22 c8mds1 kernel: [] tgt_enqueue+0x62/0x210 
[ptlrpc]
May 16 22:17:22 c8mds1 kernel: [] 
tgt_request_handle+0xada/0x1570 [ptlrpc]
May 16 

[lustre-discuss] ZFS wobble

2022-04-28 Thread Alastair Basden via lustre-discuss

Hi,

We have OSDs on ZFS (0.7.9) / Lustre 2.12.6.

Recently, one of our JBODs had a wobble, and the disks (as presented to 
the OS) disappeared for a few seconds (and then returned).


This upset a few zpools which SUSPENDED.

A zpool clear on these then started the resilvering process, and zpool 
status gave e.g.:

errors: Permanent errors have been detected in the following files:

:<0x0>
:<0xb01>
:<0x15>
:<0x383>
cos6-ost7/ost7:/O/40400/d11/10617643
cos6-ost7/ost7:/O/40400/d21/583029


However, once the resilvering had completed, these permanent errors had 
gone.


The question is then, are these errors really permanent, or was zfs able 
to correct them?


Lustre continues to remain fine (though obviously froze while the pools 
were suspended).


Should we be worried that there might be some under-the-hood corruption 
that will present itself when we need to remount (e.g. after a reboot) the 
OST?  In particular the :<0x0> file worries me a bit!


Thanks,
Alastair.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] 2.12.6 freeze

2021-12-01 Thread Alastair Basden

Hi,

Turns out there is a problem with the zpool, which we think got corrupted 
by a stonith event when a disk on another pool started to do a predicted 
failure.


A zpool scrub has been done, and there are 5 files with permanent errors 
(zpool status -v):

errors: Permanent errors have been detected in the following files:

cos8-ost6/ost6:<0xe>
cos8-ost6/ost6:<0x1a>
cos8-ost6/ost6:<0x1c>
cos8-ost6/ost6:/
cos8-ost6/ost6:<0x193>

The fact that / is corrupted seems to worry me!
If we set the canmount=on property and mount the zpool, then an ls of the 
mount point gives an Input/output error.


Does anyone have experience with how to repair this?

There is no hardware problem, all 12 disks within this z2 pool are fine - 
we think the stonith must have caused it - though I thought zfs was 
supposed to be immune to that!


Thanks...


On Tue, 30 Nov 2021, Tommi Tervo wrote:


[EXTERNAL EMAIL]


Upon attempting to mount a zfs OST, we are getting:
Message from syslogd@c8oss01 at Nov 29 18:11:47 ...
 kernel:LustreError: 58223:0:(lu_object.c:1267:lu_device_fini())
ASSERTION( atomic_read(>ld_ref) == 0 ) failed: Refcount is 1

Message from syslogd@c8oss01 at Nov 29 18:11:47 ...
 kernel:LustreError: 58223:0:(lu_object.c:1267:lu_device_fini()) LBUG


Hi,

Looks like LU-12675, time to upgrade 2.12.7?

https://jira.whamcloud.com/browse/LU-12675

HTH,
-Tommi


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] 2.12.6 freeze

2021-11-29 Thread Alastair Basden
Additional info - exporting the pool, importing on another (HA) server and 
attempting to mount there also has the same problem, i.e. a kernel panic, 
and the trace shown below.


A writeconf does not help.

On Mon, 29 Nov 2021, Alastair Basden wrote:


[EXTERNAL EMAIL]

Some more information.  This is repeatable... (previously the file system
has been fine - it's an established file system).

To get this, we boot the node, and then do:
zpool import -o cachefile=none  pool1
zpool status shows all is well.

mount -t lustre pool1/pool1 /mnt/lustre/pool1

And the kernel panic.


Some additional logs in /var/log/messages:
Nov 29 18:37:54 c8oss01 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 128, 
npartitions: 2

Nov 29 18:37:54 c8oss01 kernel: alg: No test for adler32 (adler32-zlib)
Nov 29 18:37:55 c8oss01 kernel: Lustre: Lustre: Build Version: 2.12.6
Nov 29 18:37:55 c8oss01 kernel: LNet: 
40260:0:(config.c:1641:lnet_inet_enumerate()) lnet: Ignoring interface em2: 
it's down

Nov 29 18:37:55 c8oss01 kernel: LNet: Using FastReg for registration
Nov 29 18:37:55 c8oss01 kernel: LNet: Added LNI 172.18.185.5@o2ib 
[32/512/0/100]
Nov 29 18:37:55 c8oss01 kernel: LNet: Added LNI 172.17.185.5@tcp 
[8/256/0/180]

Nov 29 18:37:55 c8oss01 kernel: LNet: Accept secure, port 988
Nov 29 18:37:55 c8oss01 zed: eid=85 class=data pool_guid=0x07C7BF473C816BCB
Nov 29 18:37:55 c8oss01 kernel: LustreError: 
40228:0:(lu_object.c:1267:lu_device_fini()) ASSERTION( 
atomic_read(>ld_ref) == 0 ) failed: Refcount is 1
Nov 29 18:37:55 c8oss01 kernel: LustreError: 
40228:0:(lu_object.c:1267:lu_device_fini()) LBUG
Nov 29 18:37:55 c8oss01 kernel: Pid: 40228, comm: mount.lustre 
3.10.0-1160.2.1.el7_lustre.x86_64 #1 SMP Wed Dec 9 20:53:35 UTC 2020
Nov 29 18:37:55 c8oss01 zed: eid=86 class=checksum 
pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathk

Nov 29 18:37:55 c8oss01 kernel: Call Trace:
Nov 29 18:37:56 c8oss01 kernel: [] 
libcfs_call_trace+0x8c/0xc0 [libcfs]
Nov 29 18:37:56 c8oss01 zed: eid=87 class=checksum 
pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpatheg
Nov 29 18:37:56 c8oss01 kernel: [] lbug_with_loc+0x4c/0xa0 
[libcfs]
Nov 29 18:37:56 c8oss01 kernel: [] lu_device_fini+0xbb/0xc0 
[obdclass]
Nov 29 18:37:56 c8oss01 zed: eid=88 class=checksum 
pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathbj
Nov 29 18:37:56 c8oss01 kernel: [] dt_device_fini+0xe/0x10 
[obdclass]
Nov 29 18:37:56 c8oss01 kernel: [] 
osd_device_alloc+0x278/0x3b0 [osd_zfs]
Nov 29 18:37:56 c8oss01 zed: eid=89 class=checksum 
pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathag
Nov 29 18:37:56 c8oss01 kernel: [] obd_setup+0x119/0x280 
[obdclass]
Nov 29 18:37:56 c8oss01 zed: eid=90 class=checksum 
pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathaf
Nov 29 18:37:56 c8oss01 kernel: [] class_setup+0x2a8/0x840 
[obdclass]
Nov 29 18:37:56 c8oss01 kernel: [] 
class_process_config+0x1726/0x2830 [obdclass]
Nov 29 18:37:56 c8oss01 zed: eid=91 class=checksum 
pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathep
Nov 29 18:37:56 c8oss01 kernel: [] do_lcfg+0x258/0x500 
[obdclass]
Nov 29 18:37:56 c8oss01 kernel: [] 
lustre_start_simple+0x88/0x210 [obdclass]
Nov 29 18:37:56 c8oss01 zed: eid=92 class=checksum 
pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathk
Nov 29 18:37:56 c8oss01 kernel: [] 
server_fill_super+0xf55/0x1890 [obdclass]
Nov 29 18:37:56 c8oss01 zed: eid=93 class=checksum 
pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpatheg
Nov 29 18:37:56 c8oss01 kernel: [] 
lustre_fill_super+0x468/0x960 [obdclass]

Nov 29 18:37:56 c8oss01 kernel: [] mount_nodev+0x4f/0xb0
Nov 29 18:37:56 c8oss01 zed: eid=94 class=checksum 
pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathbj
Nov 29 18:37:56 c8oss01 kernel: [] lustre_mount+0x38/0x60 
[obdclass]

Nov 29 18:37:56 c8oss01 kernel: [] mount_fs+0x3e/0x1b0
Nov 29 18:37:56 c8oss01 zed: eid=95 class=checksum 
pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathag
Nov 29 18:37:56 c8oss01 kernel: [] 
vfs_kern_mount+0x67/0x110

Nov 29 18:37:56 c8oss01 kernel: [] do_mount+0x1ef/0xd00
Nov 29 18:37:56 c8oss01 zed: eid=96 class=checksum 
pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathaf

Nov 29 18:37:56 c8oss01 kernel: [] SyS_mount+0x83/0xd0
Nov 29 18:37:56 c8oss01 kernel: [] 
system_call_fastpath+0x25/0x2a
Nov 29 18:37:56 c8oss01 zed: eid=97 class=checksum 
pool_guid=0x07C7BF473C816BCB vdev_path=/dev/mapper/mpathep

Nov 29 18:37:56 c8oss01 kernel: [] 0x

We suspect corruption on the OST caused by a stonith event, but could be
wrong.  Any tips in how to manually solve would be great...

Thanks,
Alastair.

On Mon, 29 Nov 2021, Alastair Basden wrote:


[EXTERNAL EMAIL]

Hi all,

Upon attempting to mount a zfs OST, we are getting:
Message from syslogd@c8oss01 at Nov 29 18:11:47 ...
kernel:LustreError: 58223:0:(lu_object.c:1267:lu_device_fini())
ASSERTION( atomic_read(>ld_ref) == 0 ) failed: Refcount is 1

Message from syslogd@c8oss01 at Nov 29 18

Re: [lustre-discuss] 2.12.6 freeze

2021-11-29 Thread Alastair Basden
Some more information.  This is repeatable... (previously the file system 
has been fine - it's an established file system).


To get this, we boot the node, and then do:
zpool import -o cachefile=none  pool1
zpool status shows all is well.

mount -t lustre pool1/pool1 /mnt/lustre/pool1

And the kernel panic.


Some additional logs in /var/log/messages:
Nov 29 18:37:54 c8oss01 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 128, 
npartitions: 2
Nov 29 18:37:54 c8oss01 kernel: alg: No test for adler32 (adler32-zlib)
Nov 29 18:37:55 c8oss01 kernel: Lustre: Lustre: Build Version: 2.12.6
Nov 29 18:37:55 c8oss01 kernel: LNet: 
40260:0:(config.c:1641:lnet_inet_enumerate()) lnet: Ignoring interface em2: 
it's down
Nov 29 18:37:55 c8oss01 kernel: LNet: Using FastReg for registration
Nov 29 18:37:55 c8oss01 kernel: LNet: Added LNI 172.18.185.5@o2ib [32/512/0/100]
Nov 29 18:37:55 c8oss01 kernel: LNet: Added LNI 172.17.185.5@tcp [8/256/0/180]
Nov 29 18:37:55 c8oss01 kernel: LNet: Accept secure, port 988
Nov 29 18:37:55 c8oss01 zed: eid=85 class=data pool_guid=0x07C7BF473C816BCB
Nov 29 18:37:55 c8oss01 kernel: LustreError: 
40228:0:(lu_object.c:1267:lu_device_fini()) ASSERTION( atomic_read(>ld_ref) 
== 0 ) failed: Refcount is 1
Nov 29 18:37:55 c8oss01 kernel: LustreError: 
40228:0:(lu_object.c:1267:lu_device_fini()) LBUG
Nov 29 18:37:55 c8oss01 kernel: Pid: 40228, comm: mount.lustre 
3.10.0-1160.2.1.el7_lustre.x86_64 #1 SMP Wed Dec 9 20:53:35 UTC 2020
Nov 29 18:37:55 c8oss01 zed: eid=86 class=checksum pool_guid=0x07C7BF473C816BCB 
vdev_path=/dev/mapper/mpathk
Nov 29 18:37:55 c8oss01 kernel: Call Trace:
Nov 29 18:37:56 c8oss01 kernel: [] 
libcfs_call_trace+0x8c/0xc0 [libcfs]
Nov 29 18:37:56 c8oss01 zed: eid=87 class=checksum pool_guid=0x07C7BF473C816BCB 
vdev_path=/dev/mapper/mpatheg
Nov 29 18:37:56 c8oss01 kernel: [] lbug_with_loc+0x4c/0xa0 
[libcfs]
Nov 29 18:37:56 c8oss01 kernel: [] lu_device_fini+0xbb/0xc0 
[obdclass]
Nov 29 18:37:56 c8oss01 zed: eid=88 class=checksum pool_guid=0x07C7BF473C816BCB 
vdev_path=/dev/mapper/mpathbj
Nov 29 18:37:56 c8oss01 kernel: [] dt_device_fini+0xe/0x10 
[obdclass]
Nov 29 18:37:56 c8oss01 kernel: [] 
osd_device_alloc+0x278/0x3b0 [osd_zfs]
Nov 29 18:37:56 c8oss01 zed: eid=89 class=checksum pool_guid=0x07C7BF473C816BCB 
vdev_path=/dev/mapper/mpathag
Nov 29 18:37:56 c8oss01 kernel: [] obd_setup+0x119/0x280 
[obdclass]
Nov 29 18:37:56 c8oss01 zed: eid=90 class=checksum pool_guid=0x07C7BF473C816BCB 
vdev_path=/dev/mapper/mpathaf
Nov 29 18:37:56 c8oss01 kernel: [] class_setup+0x2a8/0x840 
[obdclass]
Nov 29 18:37:56 c8oss01 kernel: [] 
class_process_config+0x1726/0x2830 [obdclass]
Nov 29 18:37:56 c8oss01 zed: eid=91 class=checksum pool_guid=0x07C7BF473C816BCB 
vdev_path=/dev/mapper/mpathep
Nov 29 18:37:56 c8oss01 kernel: [] do_lcfg+0x258/0x500 
[obdclass]
Nov 29 18:37:56 c8oss01 kernel: [] 
lustre_start_simple+0x88/0x210 [obdclass]
Nov 29 18:37:56 c8oss01 zed: eid=92 class=checksum pool_guid=0x07C7BF473C816BCB 
vdev_path=/dev/mapper/mpathk
Nov 29 18:37:56 c8oss01 kernel: [] 
server_fill_super+0xf55/0x1890 [obdclass]
Nov 29 18:37:56 c8oss01 zed: eid=93 class=checksum pool_guid=0x07C7BF473C816BCB 
vdev_path=/dev/mapper/mpatheg
Nov 29 18:37:56 c8oss01 kernel: [] 
lustre_fill_super+0x468/0x960 [obdclass]
Nov 29 18:37:56 c8oss01 kernel: [] mount_nodev+0x4f/0xb0
Nov 29 18:37:56 c8oss01 zed: eid=94 class=checksum pool_guid=0x07C7BF473C816BCB 
vdev_path=/dev/mapper/mpathbj
Nov 29 18:37:56 c8oss01 kernel: [] lustre_mount+0x38/0x60 
[obdclass]
Nov 29 18:37:56 c8oss01 kernel: [] mount_fs+0x3e/0x1b0
Nov 29 18:37:56 c8oss01 zed: eid=95 class=checksum pool_guid=0x07C7BF473C816BCB 
vdev_path=/dev/mapper/mpathag
Nov 29 18:37:56 c8oss01 kernel: [] vfs_kern_mount+0x67/0x110
Nov 29 18:37:56 c8oss01 kernel: [] do_mount+0x1ef/0xd00
Nov 29 18:37:56 c8oss01 zed: eid=96 class=checksum pool_guid=0x07C7BF473C816BCB 
vdev_path=/dev/mapper/mpathaf
Nov 29 18:37:56 c8oss01 kernel: [] SyS_mount+0x83/0xd0
Nov 29 18:37:56 c8oss01 kernel: [] 
system_call_fastpath+0x25/0x2a
Nov 29 18:37:56 c8oss01 zed: eid=97 class=checksum pool_guid=0x07C7BF473C816BCB 
vdev_path=/dev/mapper/mpathep
Nov 29 18:37:56 c8oss01 kernel: [] 0x

We suspect corruption on the OST caused by a stonith event, but could be 
wrong.  Any tips in how to manually solve would be great...


Thanks,
Alastair.

On Mon, 29 Nov 2021, Alastair Basden wrote:


[EXTERNAL EMAIL]

Hi all,

Upon attempting to mount a zfs OST, we are getting:
Message from syslogd@c8oss01 at Nov 29 18:11:47 ...
kernel:LustreError: 58223:0:(lu_object.c:1267:lu_device_fini())
ASSERTION( atomic_read(>ld_ref) == 0 ) failed: Refcount is 1

Message from syslogd@c8oss01 at Nov 29 18:11:47 ...
kernel:LustreError: 58223:0:(lu_object.c:1267:lu_device_fini()) LBUG


Followed by a system freeze.

Has anyone else seen this?  Any ideas?

Thanks,
Alastair.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lust

[lustre-discuss] 2.12.6 freeze

2021-11-29 Thread Alastair Basden

Hi all,

Upon attempting to mount a zfs OST, we are getting:
Message from syslogd@c8oss01 at Nov 29 18:11:47 ...
 kernel:LustreError: 58223:0:(lu_object.c:1267:lu_device_fini()) 
ASSERTION( atomic_read(>ld_ref) == 0 ) failed: Refcount is 1


Message from syslogd@c8oss01 at Nov 29 18:11:47 ...
 kernel:LustreError: 58223:0:(lu_object.c:1267:lu_device_fini()) LBUG


Followed by a system freeze.

Has anyone else seen this?  Any ideas?

Thanks,
Alastair.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Full OST

2021-09-22 Thread Alastair Basden

Hi all,

Some further developments, which we don't understand.

As files on this OST get written and deleted, it seems that they are 
removed from the MDS, but not actually deleted from the OST.  The OST then 
gradually fills up.


If we do a (on the mds):
lctl set_param osc.snap8-OST004e-*.active=0
lctl set_param osc.snap8-OST004e-*.active=1

it then immediately empties itself of all the removed files.

It then proceeds to fill up again as stuff is written and removed.

This is repeatable - we've been through this cycle twice now.

So the question is, why aren't the objects on the OST being deleted as 
expected?


Log messages from the MDS:
Sep 22 09:24:37 c8snapmds1 kernel: Lustre: setting import snap8-OST004e_UUID 
INACTIVE by administrator request
Sep 22 09:24:37 c8snapmds1 kernel: Lustre: Skipped 3 previous similar messages
Sep 22 09:24:39 c8snapmds1 kernel: Lustre: snap8-OST004e-osc-MDT: 
Connection to snap8-OST004e (at 172.18.185.50@o2ib) was lost; in progress 
operations using this service will wait for recovery to complete
Sep 22 09:24:39 c8snapmds1 kernel: Lustre: Skipped 3 previous similar messages
Sep 22 09:24:40 c8snapmds1 kernel: LustreError: 
4726:0:(import.c:1297:ptlrpc_connect_interpret()) snap8-OST004e_UUID went back 
in time (transno 98789932294 was previously committed, server now claims 
12890210835)!  See https://bugzilla.lustre.org/show_bug.cgi?id=9646
Sep 22 09:24:40 c8snapmds1 kernel: LustreError: 167-0: 
snap8-OST004e-osc-MDT: This client was evicted by snap8-OST004e; in 
progress operations using this service will fail.
Sep 22 09:24:40 c8snapmds1 kernel: LustreError: Skipped 3 previous similar 
messages
Sep 22 09:24:40 c8snapmds1 kernel: Lustre: snap8-OST004e-osc-MDT: 
Connection restored to 172.18.185.80@o2ib (at 172.18.185.50@o2ib)
Sep 22 09:24:40 c8snapmds1 kernel: Lustre: Skipped 3 previous similar messages


And on the OSS:
Sep 22 09:24:39 c8snaposs10 kernel: Lustre: snap8-OST004e: Client 
snap8-MDT-mdtlov_UUID (at 172.18.185.40@o2ib) reconnecting
Sep 22 09:24:39 c8snaposs10 kernel: Lustre: Skipped 3 previous similar messages
Sep 22 09:24:39 c8snaposs10 kernel: Lustre: snap8-OST004e: Connection restored 
to snap8-MDT-mdtlov_UUID (at 172.18.185.40@o2ib)
Sep 22 09:24:40 c8snaposs10 kernel: Lustre: Skipped 3 previous similar messages
Sep 22 09:24:40 c8snaposs10 kernel: Lustre: snap8-OST004e: deleting orphan 
objects from 0x0:41496 to 0x0:41537
Sep 22 09:24:40 c8snaposs10 kernel: Lustre: snap8-OST004e: deleting orphan 
objects from 0x23c402:642 to 0x23c402:737
Sep 22 09:24:40 c8snaposs10 kernel: Lustre: snap8-OST004e: deleting orphan 
objects from 0x23c401:642 to 0x23c401:737
Sep 22 09:24:40 c8snaposs10 kernel: Lustre: snap8-OST004e: deleting orphan 
objects from 0x23c400:1517 to 0x23c400:1537

The OSS also contains other OSTs which aren't seeing any problems.

Lustre 2.12.6.

Thanks,
Alastair.


On Thu, 9 Sep 2021, Andreas Dilger wrote:


[EXTERNAL EMAIL]


On Sep 8, 2021, at 04:42, Alastair Basden 
mailto:a.g.bas...@durham.ac.uk>> wrote:


Next step would be to unmount OST004e, run a full e2fsck, and then check lost+found 
and/or a regular "find /mnt/ost -type f -size +1M" or similar to find where the 
files are.


Thanks.  e2fsck returns clean (on its own, with -p and with -f).

Now, the find command does return a large number of files belonging to usera - 
and of sufficient size to fill up the disk.

e.g. /mnt/ost/O/0/d3/29379 has a size 2.3G.

If you run 'll_decode_filter_fid /mnt/ost/O/0/d3/29379' or 'debugfs -c -R "stat O/0/d3/29379" 
/dev/' it will print the *parent* (MDT) FID suitable for "lfs fid2path" on a 
client.  This probably won't work, but worth a try anyway.

So it would seem that these files are getting deleted from the mds, but not 
from this OST.  Has this been seen before?  The other OSTs seem fine - stuff 
getting deleted as expected.

Based on the very low object number, I would guess that these are old files and relate to 
some kind of issue seen in the past (e.g. MDT corruption where e2fsck cleared some 
inodes, or similar).  The "debugfs stat" command above will also print the 
object creation time along with the normal timestamps.

Is it safe to simply remove all these files, and then remount etc?  How can we 
ensure that new files will be deleted from the OST in the future?

If they are not referenced by any in-use file (per fid2path) then yes.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud









___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Full OST

2021-09-16 Thread Alastair Basden

Hi Cory,

Servers and clients are all 2.12.6, and were installed as such (i.e. 
haven't been updated from an older version).


Cheers,
Alastair.

On Thu, 16 Sep 2021, Spitz, Cory James wrote:


[EXTERNAL EMAIL]

What versions do you have on your servers and clients?  Do you have some wide 
gap in versions?  Is your sever very old?

There was a change to the object deletion protocol that you may need to 
contend with.  It was related to LU-5814.  If you don't have an older 
server then this is not your problem.


But if that's the case you'll need to keep manually deleting objects 
unless or until you get the older software replaced (or change your 
clients to operate the old way).


-Cory


On 9/16/21, 3:45 AM, "lustre-discuss on behalf of Alastair Basden" 
 wrote:

   Hi all,

   We mounted as ext4, removed the files, and then remounted as lustre (and
   did the lfsck scans).

   All seemed fine, and the OST went back into production.

   However, it again has the same problem - it is filling up.  Currently
   lfs df reports it as 89% full with 4.8TB used.

   However, an lfs find --ost=... can only account for 268GB.

   So I again suspect that there are unlinked/deleted files, which aren't
   actually being deleted.

   Does anyone have any idea how to get it deleting files correctly?  All the
   other OSTs are behaving perfectly fine (including those served by the same
   OSS).

   Cheers,
   Alastair.



   On Thu, 9 Sep 2021, Andreas Dilger wrote:

   > [EXTERNAL EMAIL]
   >
   >
   > On Sep 8, 2021, at 04:42, Alastair Basden 
mailto:a.g.bas...@durham.ac.uk>> wrote:
   >
   >
   > Next step would be to unmount OST004e, run a full e2fsck, and then check lost+found 
and/or a regular "find /mnt/ost -type f -size +1M" or similar to find where the 
files are.
   >
   >
   > Thanks.  e2fsck returns clean (on its own, with -p and with -f).
   >
   > Now, the find command does return a large number of files belonging to 
usera - and of sufficient size to fill up the disk.
   >
   > e.g. /mnt/ost/O/0/d3/29379 has a size 2.3G.
   >
   > If you run 'll_decode_filter_fid /mnt/ost/O/0/d3/29379' or 'debugfs -c -R "stat 
O/0/d3/29379" /dev/' it will print the *parent* (MDT) FID suitable for "lfs 
fid2path" on a client.  This probably won't work, but worth a try anyway.
   >
   > So it would seem that these files are getting deleted from the mds, but 
not from this OST.  Has this been seen before?  The other OSTs seem fine - stuff 
getting deleted as expected.
   >
   > Based on the very low object number, I would guess that these are old files and 
relate to some kind of issue seen in the past (e.g. MDT corruption where e2fsck cleared some 
inodes, or similar).  The "debugfs stat" command above will also print the object 
creation time along with the normal timestamps.
   >
   > Is it safe to simply remove all these files, and then remount etc?  How 
can we ensure that new files will be deleted from the OST in the future?
   >
   > If they are not referenced by any in-use file (per fid2path) then yes.
   >
   > Cheers, Andreas
   > --
   > Andreas Dilger
   > Lustre Principal Architect
   > Whamcloud
   >
   >
   >
   >
   >
   >
   >
   >
   ___
   lustre-discuss mailing list
   lustre-discuss@lists.lustre.org
   http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Full OST

2021-09-16 Thread Alastair Basden

Hi all,

We mounted as ext4, removed the files, and then remounted as lustre (and 
did the lfsck scans).


All seemed fine, and the OST went back into production.

However, it again has the same problem - it is filling up.  Currently
lfs df reports it as 89% full with 4.8TB used.

However, an lfs find --ost=... can only account for 268GB.

So I again suspect that there are unlinked/deleted files, which aren't 
actually being deleted.


Does anyone have any idea how to get it deleting files correctly?  All the 
other OSTs are behaving perfectly fine (including those served by the same 
OSS).


Cheers,
Alastair.



On Thu, 9 Sep 2021, Andreas Dilger wrote:


[EXTERNAL EMAIL]


On Sep 8, 2021, at 04:42, Alastair Basden 
mailto:a.g.bas...@durham.ac.uk>> wrote:


Next step would be to unmount OST004e, run a full e2fsck, and then check lost+found 
and/or a regular "find /mnt/ost -type f -size +1M" or similar to find where the 
files are.


Thanks.  e2fsck returns clean (on its own, with -p and with -f).

Now, the find command does return a large number of files belonging to usera - 
and of sufficient size to fill up the disk.

e.g. /mnt/ost/O/0/d3/29379 has a size 2.3G.

If you run 'll_decode_filter_fid /mnt/ost/O/0/d3/29379' or 'debugfs -c -R "stat O/0/d3/29379" 
/dev/' it will print the *parent* (MDT) FID suitable for "lfs fid2path" on a 
client.  This probably won't work, but worth a try anyway.

So it would seem that these files are getting deleted from the mds, but not 
from this OST.  Has this been seen before?  The other OSTs seem fine - stuff 
getting deleted as expected.

Based on the very low object number, I would guess that these are old files and relate to 
some kind of issue seen in the past (e.g. MDT corruption where e2fsck cleared some 
inodes, or similar).  The "debugfs stat" command above will also print the 
object creation time along with the normal timestamps.

Is it safe to simply remove all these files, and then remount etc?  How can we 
ensure that new files will be deleted from the OST in the future?

If they are not referenced by any in-use file (per fid2path) then yes.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud









___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Full OST

2021-09-08 Thread Alastair Basden




Next step would be to unmount OST004e, run a full e2fsck, and then check lost+found 
and/or a regular "find /mnt/ost -type f -size +1M" or similar to find where the 
files are.



Thanks.  e2fsck returns clean (on its own, with -p and with -f).

Now, the find command does return a large number of files belonging to 
usera - and of sufficient size to fill up the disk.


e.g. /mnt/ost/O/0/d3/29379 has a size 2.3G.

So it would seem that these files are getting deleted from the mds, but 
not from this OST.  Has this been seen before?  The other OSTs seem fine - 
stuff getting deleted as expected.


Is it safe to simply remove all these files, and then remount etc?  How 
can we ensure that new files will be deleted from the OST in the future?


Cheers,
Alastair.

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Full OST

2021-09-06 Thread Alastair Basden

Hi Aurélien,

Thanks.

Within O/1/d0 to O/1/d31, these are empty directories.
Within O/0/d0 to d31, these have some files in them.  However, of the ones 
I've tried, the

lfs fid2path /snap8 [0x1004e:0xe0:0x0]
returns e.g.
lfs fid2path: cannot find '[0x1004e:0xe0:0x0]': Invalid argument

where the fid comes from e.g.
debugfs:  stat O/0/d0/224
Inode: 1850   Type: regularMode:  07666   Flags: 0x8
Generation: 2411783677Version: 0x:
User: 0   Group: 0   Project: 0   Size: 0
File ACL: 0
Links: 1   Blockcount: 0
Fragment:  Address: 0Number: 0Size: 0
 ctime: 0x: -- Thu Jan  1 01:00:00 1970
 atime: 0x: -- Thu Jan  1 01:00:00 1970
 mtime: 0x: -- Thu Jan  1 01:00:00 1970
crtime: 0x6101806b:31543ab0 -- Wed Jul 28 17:06:03 2021
Size of extra inode fields: 32
Extended attributes:
  lma: fid=[0x1004e:0xe0:0x0] compat=8 incompat=0
EXTENTS:


The O/10 directory also only contains empty directories.

Some of the others do contain regular files, but for all that I've tried, 
the fid2path returns
lfs fid2path: cannot find '[0x23c401:0x260:0x0]': No such file or 
directory

or the Invalid argument message.

The size of the objects, as returned by stat, is also always 0, in the 
cases that I've seen (perhaps it is suppoed to be, I don't know!)


Cheers,
Alastair.


On Mon, 6 Sep 2021, Degremont, Aurelien wrote:


[EXTERNAL EMAIL]

Hi


   Not quite sure what you meant by the O/*/d* as there are no directories
  within O/, and there is no d/ or d*/ either at top level or within O/


As you can confirm with the 'stat' output you provided, '23c400' is a 
directory and actually all other entries also are.
Not straightforward but 2nd column is file type and permission: '4' means dir.

I think Andreas is referring especially to directory '0', '1' and '10' is your 
output.
Try looking into them, you should see multiple 'dXX' directories with objects 
in them.

Aurélien


Le 06/09/2021 10:12, « Alastair Basden »  a écrit :

   CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



   Hi Andreas,

   Thanks.

   With debugfs /dev/nvme6n1, I get:
   debugfs:  ls -l O
 393217   40755 (2)  0  04096 28-Jul-2021 17:06 .
  2   40755 (2)  0  04096 28-Jul-2021 17:02 ..
 393218   40755 (2)  0  04096 28-Jul-2021 17:02 20003
 524291   40755 (2)  0  04096 28-Jul-2021 17:02 1
 655364   40755 (2)  0  04096 28-Jul-2021 17:02 10
 786437   40755 (2)  0  04096 28-Jul-2021 17:06 0
 917510   40755 (2)  0  04096 28-Jul-2021 17:06 23c402
 1048583   40755 (2)  0  04096 28-Jul-2021 17:06 23c401
 1179656   40755 (2)  0  04096 28-Jul-2021 17:06 23c400

   Then e.g.:
   debugfs:  stat O/23c400
   Inode: 1179656   Type: directoryMode:  0755   Flags: 0x8
   Generation: 2411782533Version: 0x:
   User: 0   Group: 0   Project: 0   Size: 4096
   File ACL: 0
   Links: 34   Blockcount: 8
   Fragment:  Address: 0Number: 0Size: 0
 ctime: 0x6101806b:306016bc -- Wed Jul 28 17:06:03 2021
 atime: 0x6101806b:2d83aad8 -- Wed Jul 28 17:06:03 2021
 mtime: 0x6101806b:306016bc -- Wed Jul 28 17:06:03 2021
   crtime: 0x6101806b:2d83aad8 -- Wed Jul 28 17:06:03 2021
   Size of extra inode fields: 32
   Extended attributes:
  lma: fid=[0x120008:0x8fc0e185:0x0] compat=c incompat=0
   EXTENTS:
   (0):33989


   But then on a client:
   lfs fid2path /snap8 [0x120008:0x8fc0e185:0x0]
   lfs fid2path: cannot find '[0x120008:0x8fc0e185:0x0]': No such file or
   directory

   (and likewise for the others).

   Not quite sure what you meant by the O/*/d* as there are no directories
   within O/, and there is no d/ or d*/ either at top level or within O/


   Running (on the OST):
   lctl lfsck_start -M snap8-OST004e
   seems to work (at least, doesn't return any error).

   However, lctl lfsck_query -M snap8-OST004e   gives:
   Fail to query LFSCK: Inappropriate ioctl for device


   Thanks,
   Alastair.


   On Sat, 4 Sep 2021, Andreas Dilger wrote:

   > [EXTERNAL EMAIL]
   >
   > You could run debugfs on that OST and use "ls -l" to examine the O/*/d* directories for large 
objects, then "stat" any suspicious objects within debugfs to dump the parent FID, and "lfs 
fid2path" on a client to determine the path.
   >
   > Alternately, see "lctl-lfsck-start.8" man page for options to link orphan 
objects to the .lustre/lost+found directory if you think there are no files referencing 
those objects.
   >
   > Cheers, Andreas
   >
   >> On Sep 4, 2021, at 00:54, Alastair Basden  wrote:
   >>
   >> Ah, of course - has to be done on a client.
   >

Re: [lustre-discuss] Full OST

2021-09-06 Thread Alastair Basden

Hi Andreas,

Thanks.

With debugfs /dev/nvme6n1, I get:
debugfs:  ls -l O
 393217   40755 (2)  0  04096 28-Jul-2021 17:06 .
  2   40755 (2)  0  04096 28-Jul-2021 17:02 ..
 393218   40755 (2)  0  04096 28-Jul-2021 17:02 20003
 524291   40755 (2)  0  04096 28-Jul-2021 17:02 1
 655364   40755 (2)  0  04096 28-Jul-2021 17:02 10
 786437   40755 (2)  0  04096 28-Jul-2021 17:06 0
 917510   40755 (2)  0  04096 28-Jul-2021 17:06 23c402
 1048583   40755 (2)  0  04096 28-Jul-2021 17:06 23c401
 1179656   40755 (2)  0  04096 28-Jul-2021 17:06 23c400

Then e.g.:
debugfs:  stat O/23c400
Inode: 1179656   Type: directoryMode:  0755   Flags: 0x8
Generation: 2411782533Version: 0x:
User: 0   Group: 0   Project: 0   Size: 4096
File ACL: 0
Links: 34   Blockcount: 8
Fragment:  Address: 0Number: 0Size: 0
 ctime: 0x6101806b:306016bc -- Wed Jul 28 17:06:03 2021
 atime: 0x6101806b:2d83aad8 -- Wed Jul 28 17:06:03 2021
 mtime: 0x6101806b:306016bc -- Wed Jul 28 17:06:03 2021
crtime: 0x6101806b:2d83aad8 -- Wed Jul 28 17:06:03 2021
Size of extra inode fields: 32
Extended attributes:
  lma: fid=[0x120008:0x8fc0e185:0x0] compat=c incompat=0
EXTENTS:
(0):33989


But then on a client:
lfs fid2path /snap8 [0x120008:0x8fc0e185:0x0]
lfs fid2path: cannot find '[0x120008:0x8fc0e185:0x0]': No such file or 
directory


(and likewise for the others).

Not quite sure what you meant by the O/*/d* as there are no directories 
within O/, and there is no d/ or d*/ either at top level or within O/



Running (on the OST):
lctl lfsck_start -M snap8-OST004e
seems to work (at least, doesn't return any error).

However, lctl lfsck_query -M snap8-OST004e   gives:
Fail to query LFSCK: Inappropriate ioctl for device


Thanks,
Alastair.


On Sat, 4 Sep 2021, Andreas Dilger wrote:


[EXTERNAL EMAIL]

You could run debugfs on that OST and use "ls -l" to examine the O/*/d* directories for large 
objects, then "stat" any suspicious objects within debugfs to dump the parent FID, and "lfs 
fid2path" on a client to determine the path.

Alternately, see "lctl-lfsck-start.8" man page for options to link orphan 
objects to the .lustre/lost+found directory if you think there are no files referencing 
those objects.

Cheers, Andreas


On Sep 4, 2021, at 00:54, Alastair Basden  wrote:

Ah, of course - has to be done on a client.

None of these files are on the dodgy OST.

Any further suggestions?  Essentially we have what seems to be a full OST with 
nothing on it.

Thanks,
Alastair.


On Sat, 4 Sep 2021, Andreas Dilger wrote:

[EXTERNAL EMAIL]
$ man lfs-fid2path.1
lfs-fid2path(1)   user utilities
 lfs-fid2path(1)

NAME
 lfs fid2path - print the pathname(s) for a file identifier

SYNOPSIS
 lfs fid2path [OPTION]...  ...

DESCRIPTION
 lfs  fid2path  maps  a  numeric  Lustre File IDentifier (FID) to one or 
more pathnames
 that have hard links to that file.  This allows resolving filenames for 
FIDs used in console
 error messages, and resolving all of the pathnames for a file that has 
multiple hard links.
 Pathnames are resolved relative to the MOUNT_POINT specified, or relative 
to the
 filesystem mount point if FSNAME is provided.

OPTIONS
 -f, --print-fid
Print the FID with the path.

 -c, --print-link
Print the current link number with each pathname or parent 
directory.

 -l, --link=LINK
If a file has multiple hard links, then print only the specified 
LINK, starting at link 0.
If multiple FIDs are given, but only one pathname is needed for 
each file, use --link=0.

EXAMPLES
 $ lfs fid2path /mnt/testfs [0x20403:0x11f:0x0]
/mnt/testfs/etc/hosts


On Sep 3, 2021, at 14:51, Alastair Basden 
mailto:a.g.bas...@durham.ac.uk>> wrote:

Hi,

lctl get_param mdt.*.exports.*.open_files  returns:
mdt.snap8-MDT.exports.172.18.180.21@o2ib.open_files=
[0x2b90e:0x10aa:0x0]
mdt.snap8-MDT.exports.172.18.180.22@o2ib.open_files=
[0x2b90e:0x21b3:0x0]
mdt.snap8-MDT.exports.172.18.181.19@o2ib.open_files=
[0x2b90e:0x21b3:0x0]
[0x2b90e:0x21b4:0x0]
[0x2b90c:0x1574:0x0]
[0x2b90c:0x1575:0x0]
[0x2b90c:0x1576:0x0]

Doesn't seem to be many open, so I don't think it's a problem of open files.

Not sure which bit of this I need to use with lfs fid2path either...

Cheers,
Alastair.


On Fri, 3 Sep 2021, Andreas Dilger wrote:

[EXTERNAL EMAIL]
You can also check "mdt.*.exports.*.open_files" on the MDTs for a list of FIDs open on 
each client, and use "lfs fid2path" to resolve them to a pathname.

On Sep 3, 2021, at 02:09, Degremont, Aurelien via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org><mailto:lustre-discuss@lists.lustre

Re: [lustre-discuss] Full OST

2021-09-04 Thread Alastair Basden

Ah, of course - has to be done on a client.

None of these files are on the dodgy OST.

Any further suggestions?  Essentially we have what seems to be a full OST 
with nothing on it.


Thanks,
Alastair.

On Sat, 4 Sep 2021, Andreas Dilger wrote:


[EXTERNAL EMAIL]
$ man lfs-fid2path.1
lfs-fid2path(1)   user utilities
 lfs-fid2path(1)

NAME
  lfs fid2path - print the pathname(s) for a file identifier

SYNOPSIS
  lfs fid2path [OPTION]...  ...

DESCRIPTION
  lfs  fid2path  maps  a  numeric  Lustre File IDentifier (FID) to one or 
more pathnames
  that have hard links to that file.  This allows resolving filenames for 
FIDs used in console
  error messages, and resolving all of the pathnames for a file that has 
multiple hard links.
  Pathnames are resolved relative to the MOUNT_POINT specified, or relative 
to the
  filesystem mount point if FSNAME is provided.

OPTIONS
  -f, --print-fid
 Print the FID with the path.

  -c, --print-link
 Print the current link number with each pathname or parent 
directory.

  -l, --link=LINK
 If a file has multiple hard links, then print only the specified 
LINK, starting at link 0.
 If multiple FIDs are given, but only one pathname is needed for 
each file, use --link=0.

EXAMPLES
  $ lfs fid2path /mnt/testfs [0x20403:0x11f:0x0]
 /mnt/testfs/etc/hosts


On Sep 3, 2021, at 14:51, Alastair Basden 
mailto:a.g.bas...@durham.ac.uk>> wrote:

Hi,

lctl get_param mdt.*.exports.*.open_files  returns:
mdt.snap8-MDT.exports.172.18.180.21@o2ib.open_files=
[0x2b90e:0x10aa:0x0]
mdt.snap8-MDT.exports.172.18.180.22@o2ib.open_files=
[0x2b90e:0x21b3:0x0]
mdt.snap8-MDT.exports.172.18.181.19@o2ib.open_files=
[0x2b90e:0x21b3:0x0]
[0x2b90e:0x21b4:0x0]
[0x2b90c:0x1574:0x0]
[0x2b90c:0x1575:0x0]
[0x2b90c:0x1576:0x0]

Doesn't seem to be many open, so I don't think it's a problem of open files.

Not sure which bit of this I need to use with lfs fid2path either...

Cheers,
Alastair.


On Fri, 3 Sep 2021, Andreas Dilger wrote:

[EXTERNAL EMAIL]
You can also check "mdt.*.exports.*.open_files" on the MDTs for a list of FIDs open on 
each client, and use "lfs fid2path" to resolve them to a pathname.

On Sep 3, 2021, at 02:09, Degremont, Aurelien via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org><mailto:lustre-discuss@lists.lustre.org>>
 wrote:

Hi

It could be a bug, but most of the time, this is due to an open-unlinked file, 
typically a log file which is still in use and some processes keep writing to 
it until it fills the OSTs it is using.

Look for such files on your clients (use lsof).

Aurélien


Le 03/09/2021 09:50, « lustre-discuss au nom de Alastair Basden » 
mailto:lustre-discuss-boun...@lists.lustre.org><mailto:lustre-discuss-boun...@lists.lustre.org>
 au nom de 
a.g.bas...@durham.ac.uk<mailto:a.g.bas...@durham.ac.uk><mailto:a.g.bas...@durham.ac.uk>> a 
écrit :

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.



Hi,

We have a file system where each OST is a single SSD.

One of those is reporting as 100% full (lfs df -h /snap8):
snap8-OST004d_UUID  5.8T2.0T3.5T  37% /snap8[OST:77]
snap8-OST004e_UUID  5.8T5.5T7.5G 100% /snap8[OST:78]
snap8-OST004f_UUID  5.8T2.0T3.4T  38% /snap8[OST:79]

However, I can't find any files on it:
lfs find --ost snap8-OST004e /snap8/
returns nothing.

I guess that it has filled up, and that there is some bug or other that is
now preventing proper behaviour - but I could be wrong.

Does anyone have any suggestions?

Essentially, I'd like to find some of the files and delete or migrate
some, and thus return it to useful production.

Cheers,
Alastair.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org><mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org><mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud








Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Full OST

2021-09-03 Thread Alastair Basden

Hi,

lctl get_param mdt.*.exports.*.open_files  returns:
mdt.snap8-MDT.exports.172.18.180.21@o2ib.open_files=
[0x2b90e:0x10aa:0x0]
mdt.snap8-MDT.exports.172.18.180.22@o2ib.open_files=
[0x2b90e:0x21b3:0x0]
mdt.snap8-MDT.exports.172.18.181.19@o2ib.open_files=
[0x2b90e:0x21b3:0x0]
[0x2b90e:0x21b4:0x0]
[0x2b90c:0x1574:0x0]
[0x2b90c:0x1575:0x0]
[0x2b90c:0x1576:0x0]

Doesn't seem to be many open, so I don't think it's a problem of open 
files.


Not sure which bit of this I need to use with lfs fid2path either...

Cheers,
Alastair.


On Fri, 3 Sep 2021, Andreas Dilger wrote:


[EXTERNAL EMAIL]
You can also check "mdt.*.exports.*.open_files" on the MDTs for a list of FIDs open on 
each client, and use "lfs fid2path" to resolve them to a pathname.

On Sep 3, 2021, at 02:09, Degremont, Aurelien via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hi

It could be a bug, but most of the time, this is due to an open-unlinked file, 
typically a log file which is still in use and some processes keep writing to 
it until it fills the OSTs it is using.

Look for such files on your clients (use lsof).

Aurélien


Le 03/09/2021 09:50, « lustre-discuss au nom de Alastair Basden » 
mailto:lustre-discuss-boun...@lists.lustre.org> 
au nom de a.g.bas...@durham.ac.uk<mailto:a.g.bas...@durham.ac.uk>> a écrit :

  CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.



  Hi,

  We have a file system where each OST is a single SSD.

  One of those is reporting as 100% full (lfs df -h /snap8):
  snap8-OST004d_UUID  5.8T2.0T3.5T  37% /snap8[OST:77]
  snap8-OST004e_UUID  5.8T5.5T7.5G 100% /snap8[OST:78]
  snap8-OST004f_UUID  5.8T2.0T3.4T  38% /snap8[OST:79]

  However, I can't find any files on it:
  lfs find --ost snap8-OST004e /snap8/
  returns nothing.

  I guess that it has filled up, and that there is some bug or other that is
  now preventing proper behaviour - but I could be wrong.

  Does anyone have any suggestions?

  Essentially, I'd like to find some of the files and delete or migrate
  some, and thus return it to useful production.

  Cheers,
  Alastair.
  ___
  lustre-discuss mailing list
  lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
  http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Full OST

2021-09-03 Thread Alastair Basden

Hi,

Thanks.  It seems odd that the OST reports having no files on it at all 
(I'd expect several hundred, based on the number of files on the other 
OSTs).


Unless an open-unlinked file would have that effect on lfs find, but I 
don't think it should.


I'm not sure whether lsof would help - the clients could well have files 
open writing to other OSTs.  But have tested, and lsof doesn't return any 
open files that are on that OST on any of the nodes.


Cheers,
Alastair.



On Fri, 3 Sep 2021, Degremont, Aurelien wrote:


[EXTERNAL EMAIL]

Hi

It could be a bug, but most of the time, this is due to an open-unlinked file, 
typically a log file which is still in use and some processes keep writing to 
it until it fills the OSTs it is using.

Look for such files on your clients (use lsof).

Aurélien


Le 03/09/2021 09:50, « lustre-discuss au nom de Alastair Basden » 
 a 
écrit :

   CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.



   Hi,

   We have a file system where each OST is a single SSD.

   One of those is reporting as 100% full (lfs df -h /snap8):
   snap8-OST004d_UUID  5.8T2.0T3.5T  37% /snap8[OST:77]
   snap8-OST004e_UUID  5.8T5.5T7.5G 100% /snap8[OST:78]
   snap8-OST004f_UUID  5.8T2.0T3.4T  38% /snap8[OST:79]

   However, I can't find any files on it:
   lfs find --ost snap8-OST004e /snap8/
   returns nothing.

   I guess that it has filled up, and that there is some bug or other that is
   now preventing proper behaviour - but I could be wrong.

   Does anyone have any suggestions?

   Essentially, I'd like to find some of the files and delete or migrate
   some, and thus return it to useful production.

   Cheers,
   Alastair.
   ___
   lustre-discuss mailing list
   lustre-discuss@lists.lustre.org
   http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Full OST

2021-09-03 Thread Alastair Basden

Hi,

We have a file system where each OST is a single SSD.

One of those is reporting as 100% full (lfs df -h /snap8):
snap8-OST004d_UUID  5.8T2.0T3.5T  37% /snap8[OST:77]
snap8-OST004e_UUID  5.8T5.5T7.5G 100% /snap8[OST:78]
snap8-OST004f_UUID  5.8T2.0T3.4T  38% /snap8[OST:79]

However, I can't find any files on it:
lfs find --ost snap8-OST004e /snap8/
returns nothing.

I guess that it has filled up, and that there is some bug or other that is 
now preventing proper behaviour - but I could be wrong.


Does anyone have any suggestions?

Essentially, I'd like to find some of the files and delete or migrate 
some, and thus return it to useful production.


Cheers,
Alastair.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] OST not being used

2021-06-23 Thread Alastair Basden

Hi Megan,

Thanks - yes, lctl ping responds.

In the end, we did a writeconf, and this seems to have fixed the problem, 
so probably some previous transient.  I would however have expected it to 
heal whilst online - taking the filesystem down and doing a writeconf 
seems a bit drastic!


Cheers,
Alastair.

On Wed, 23 Jun 2021, Ms. Megan Larko via lustre-discuss wrote:


[EXTERNAL EMAIL]
Hi!

Does the NIC on the OSS that serves OST 4-7 respond to an lctl ping? 
You indicated that it does respond to regular ping, ssh, etc.  I would 
review my /etc/lnet.conf file for the behavior of a NIC that times out. 
Does the conf allow for asymmetrical routing?  (Is that what you wish?) 
Is there only one path to those OSTs or is there a way failover NIC 
address that did not work in this even for some reason?


The Lustre Operations Manual Section 9.1 on lnetctl command shows how you can 
get more info on the NIC ( lnetctl show...)

Good luck.
megan


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] OST not being used

2021-06-21 Thread Alastair Basden

Hi Megan, all,

Yes, sorry, I should have said.  Its 2.12.6.

A bit more detail.  I can set the stripe index to 0-3 and 8-191, and it 
works fine.  However, when I set the stripe index to 4-7, they all end up 
on OST 8.  It is a system with 192 OSTs and 24 OSSs.


These 4 OSTs are all served on the second NIC of the first OSS server, so 
suggests a NIC problem (HDR100).  However, the NIC appears to be fine, I 
can ping it, ssh into it, etc.


lctl dl returns the OSTs as expected.

I suspect that it has at some point been deemed to have failed, and marked 
as such.  However I can't find such a mark, and can't work how to return 
it to operation.


Thanks,
Alastair.

On Mon, 21 Jun 2021, Ms. Megan Larko via lustre-discuss wrote:


[EXTERNAL EMAIL]
Greetings Alastair!

You did not indicate which version of Lustre you are using.   FYI it can be 
useful to aiding you in your Lustre queries.

You show your command "lfs setstripe --stripe-index 7 myfile.dat".  The 
Lustre Operations Manual ( https://doc.lustre.org/lustre_manual.xhtml ) 
Section 40.1.1 "Synopsis: indicates that stripe-index starts counting at 
zero.  My reading of the Manual indicates that starting at zero and 
using a default stripe count of one might correctly put the file on to 
obd index 8.  Depending upon whether or not obdidx starts at zero or 
one, eight might possibly be the correct result.  Did you try using a 
stripe-index of 6 to see if the resulting stripe count of one file is 
then on obdidx 7?


If the OST is not usable then the command "lctl dl" will indicate that 
(as does the command you used for active OST devices.  Your info does 
seem to indicate that the OST 7 is okay.


Cheers,
megan


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] OST not being used

2021-06-21 Thread Alastair Basden

Hi,

I'm trying to specify a particular OST to be used with:
lfs setstripe --stripe-index 7 myfile.dat

However, a lfs getstripe reveals that it hasn't managed to use this OST:
myfile.dat
lmm_stripe_count:  1
lmm_stripe_size:   1048576
lmm_pattern:   raid0
lmm_layout_gen:0
lmm_stripe_offset: 8
obdidx   objid   objid   group
 87503 0x1d4f0

So, I assume the OST has been marked as unusable for some reason - how do 
I find out why, and set it back to usable?


A:
lctl get_param osc.fsname-*.active
gives 1 for all OSTs.

Likewise:
lctl get_param osp.fsname-OST*.max_create_count
gives 2 for all OSTs.

What else should I check?

Thanks,
Alastair.



___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Multiple IB Interfaces

2021-03-12 Thread Alastair Basden via lustre-discuss

Hi all,

Thanks for the replies.  The issue as I see it is with sending data from 
an OST to the client, avoiding the inter-CPU link.


So, if I have:
cpu1 - IB card 1 (10.0.0.1), nvme1 (OST1)
cpu2 - IB card 2 (10.0.0.2), nvme2 (OST2)

Both IB cards on the same subnet.  Therefore, by default, packets will be 
routed out of the server over the preferred card, say IB card 1 (I could 
be wrong, but this is my current understanding, and seems to be what the 
Lustre manual says).


Data coming in (being written to the OST) is not a problem.  The client 
will know the IP address of the card to which the OST is closest.   So, 
to write to OST2, it will use the 10.0.0.2 address (since this will be 
the IP address given in mkfs.lustre for that OST).


The slight complication here is pinning.  A cpu thread may run on cpu1, so 
the data has to traverse the inter-cpu link twice.  However, I am assuming 
that this won't happen - i.e. the kernel or lustre are clever enough to 
place this thread on cpu2.  As far as I am aware, this should just work, 
though please correct me if I'm wrong.  Perhaps I have to manually specify 
pinning - how does one do that with Lustre?


Reading is more problematic.  A request from a client (say 10.0.0.100) for 
data on OST2 will come in via card 2 (10.0.0.2).  A thread on CPU2 
(hopefully) will then read the data from OST2, and send it out to the 
client, 10.0.0.100.  However, here, Linux will route the packet through 
the first card on this subnet, so it will go over the inter-cpu link, and 
out of IB card 1.  And this will be the case even if the thread is pinned 
on CPU2.


The question then is whether there is a way to configure Lustre to use IB 
card 2 when sending out data from OST2.


Cheers,
Alastair.

On Wed, 10 Mar 2021, Ms. Megan Larko wrote:


[EXTERNAL EMAIL]
Greetings Alastair,

Bonding is supported on InfiniBand, but  I believe that it is only 
active/passive.
I think what you might be looking for WRT avoiding data travel through the inter-cpu link is cpu 
"affinity" AKA cpu "pinning".

Cheers,
megan

WRT = "with regards to"
AKA = "also known as"


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Multiple IB interfaces

2021-03-09 Thread Alastair Basden via lustre-discuss

Hi,

We are installing some new Lustre servers with 2 InfiniBand cards, 1 
attached to each CPU socket.  Storage is nvme, again, some drives attached 
to each socket.


We want to ensure that data to/from each drive uses the appropriate IB 
card, and doesn't need to travel through the inter-cpu link.  Data being 
written is fairly easy I think, we just set that OST to the appropriate IP 
address.  However, data being read may well go out the other NIC, if I 
understand correctly.


What setup do we need for this?

I think probably not bonding, as that will presumably not tie 
NIC interfaces to cpus.  But I also see a note in the Lustre manual:


"""If the server has multiple interfaces on the same subnet, the Linux 
kernel will send all traffic using the first configured interface. This is 
a limitation of Linux, not Lustre. In this case, network interface bonding 
should be used. For more information about network interface bonding, see 
Chapter 7, Setting Up Network Interface Bonding."""


(plus, no idea if bonding is supported on InfiniBand).

Thanks,
Alastair.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Help mounting MDT

2020-10-05 Thread Alastair Basden

Hi all,

We are having a problem mounting a ldiskfs mdt.  The mount command is 
hanging, with /var/log/messages containing:

Oct  5 16:26:17 c6mds1 kernel: INFO: task mount.lustre:4285 blocked for more 
than 120 seconds.
Oct  5 16:26:17 c6mds1 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  5 16:26:17 c6mds1 kernel: mount.lustreD 92cd279de2a0 0  4285   
4284 0x0082
Oct  5 16:26:17 c6mds1 kernel: Call Trace:
Oct  5 16:26:17 c6mds1 kernel: [] schedule+0x29/0x70
Oct  5 16:26:17 c6mds1 kernel: [] schedule_timeout+0x221/0x2d0
Oct  5 16:26:17 c6mds1 kernel: [] ? 
enqueue_task_fair+0x208/0x6c0
Oct  5 16:26:17 c6mds1 kernel: [] ? sched_clock_cpu+0x85/0xc0
Oct  5 16:26:17 c6mds1 kernel: [] ? 
check_preempt_curr+0x80/0xa0
Oct  5 16:26:17 c6mds1 kernel: [] ? ttwu_do_wakeup+0x19/0xe0
Oct  5 16:26:17 c6mds1 kernel: [] 
wait_for_completion+0xfd/0x140
Oct  5 16:26:17 c6mds1 kernel: [] ? wake_up_state+0x20/0x20
Oct  5 16:26:17 c6mds1 kernel: [] 
llog_process_or_fork+0x244/0x450 [obdclass]
Oct  5 16:26:17 c6mds1 kernel: [] llog_process+0x14/0x20 
[obdclass]
Oct  5 16:26:17 c6mds1 kernel: [] 
class_config_parse_llog+0x125/0x350 [obdclass]
Oct  5 16:26:17 c6mds1 kernel: [] 
mgc_process_cfg_log+0x790/0xc40 [mgc]
Oct  5 16:26:17 c6mds1 kernel: [] mgc_process_log+0x3dc/0x8f0 
[mgc]
Oct  5 16:26:17 c6mds1 kernel: [] ? 
config_recover_log_add+0x13f/0x280 [mgc]
Oct  5 16:26:17 c6mds1 kernel: [] ? 
class_config_dump_handler+0x7e0/0x7e0 [obdclass]
Oct  5 16:26:17 c6mds1 kernel: [] 
mgc_process_config+0x88b/0x13f0 [mgc]
Oct  5 16:26:17 c6mds1 kernel: [] 
lustre_process_log+0x2d8/0xad0 [obdclass]
Oct  5 16:26:17 c6mds1 kernel: [] ? 
libcfs_debug_msg+0x57/0x80 [libcfs]
Oct  5 16:26:17 c6mds1 kernel: [] ? 
lprocfs_counter_add+0xf9/0x160 [obdclass]
Oct  5 16:26:17 c6mds1 kernel: [] 
server_start_targets+0x13a4/0x2a20 [obdclass]
Oct  5 16:26:17 c6mds1 kernel: [] ? 
lustre_start_mgc+0x260/0x2510 [obdclass]
Oct  5 16:26:17 c6mds1 kernel: [] ? 
class_config_dump_handler+0x7e0/0x7e0 [obdclass]
Oct  5 16:26:17 c6mds1 kernel: [] 
server_fill_super+0x10cc/0x1890 [obdclass]
Oct  5 16:26:17 c6mds1 kernel: [] 
lustre_fill_super+0x328/0x950 [obdclass]
Oct  5 16:26:17 c6mds1 kernel: [] ? 
lustre_common_put_super+0x270/0x270 [obdclass]
Oct  5 16:26:17 c6mds1 kernel: [] mount_nodev+0x4f/0xb0
Oct  5 16:26:17 c6mds1 kernel: [] lustre_mount+0x38/0x60 
[obdclass]
Oct  5 16:26:17 c6mds1 kernel: [] mount_fs+0x3e/0x1b0
Oct  5 16:26:17 c6mds1 kernel: [] vfs_kern_mount+0x67/0x110
Oct  5 16:26:17 c6mds1 kernel: [] do_mount+0x1ef/0xce0
Oct  5 16:26:17 c6mds1 kernel: [] ? 
__check_object_size+0x1ca/0x250
Oct  5 16:26:17 c6mds1 kernel: [] ? 
kmem_cache_alloc_trace+0x3c/0x200
Oct  5 16:26:17 c6mds1 kernel: [] SyS_mount+0x83/0xd0
Oct  5 16:26:17 c6mds1 kernel: [] 
system_call_fastpath+0x25/0x2a


This is Lustre 2.12.2 on CentOS 7.6

Does anyone have any suggestions?

Cheers,
Alastair.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Centos 7.7 upgrade

2020-06-02 Thread Alastair Basden

Hi all,

Thanks - the problem turned out to be the kernel-devel module.

Cheers,
Alastair.

On Tue, 2 Jun 2020, Pascal Suter wrote:


Hi

where you using the rpms from the whamcloud repo?

if so, check if you have installed the kmod-lustre-osd-zfs and 
lustre-osd-zfs-mount packages.


IIRC i had the same errors when the kmod-lustre-osd-zfs package was 
missing on my system.


cheers

Pascal

On 6/2/20 1:20 AM, Alastair Basden wrote:

Hi,

We have just upgraded Lustre servers from 2.12.2 on centos 7.6 to 
2.12.3 on centos 7.7.


The OSSs are on top of zfs (0.7.13 as recommended), and we are using 
3.10.0-1062.1.1.el7_lustre.x86_64


After the update, Lustre will no longer mount - and messages such as:
Jun  2 00:02:44 hostname kernel: LustreError: 158-c: Can't load module 
'osd-zfs'
Jun  2 00:02:44 hostname kernel: LustreError: Skipped 875 previous 
similar messages
Jun  2 00:02:44 hostname kernel: LustreError: 
226253:0:(genops.c:397:class_newdev()) OBD: unknown type: osd-zfs
Jun  2 00:02:44 hostname kernel: LustreError: 
226265:0:(obd_config.c:403:class_attach()) Cannot create device 
lustfs-OST0006-osd of type osd-zfs : -19
Jun  2 00:02:44 hostname kernel: LustreError: 
226265:0:(obd_config.c:403:class_attach()) Skipped 881 previous 
similar messages
Jun  2 00:02:44 hostname kernel: LustreError: 
226265:0:(obd_mount.c:197:lustre_start_simple()) lustfs-OST0006-osd 
attach error -19
Jun  2 00:02:44 hostname kernel: LustreError: 
226265:0:(obd_mount.c:197:lustre_start_simple()) Skipped 881 previous 
similar messages
Jun  2 00:02:44 hostname kernel: LustreError: 
226265:0:(obd_mount_server.c:1947:server_fill_super()) Unable to start 
osd on lustfs-ost6/ost6: -19
Jun  2 00:02:44 hostname kernel: LustreError: 
226265:0:(obd_mount_server.c:1947:server_fill_super()) Skipped 881 
previous similar messages
Jun  2 00:02:44 hostname kernel: LustreError: 
226265:0:(obd_mount.c:1608:lustre_fill_super()) Unable to mount (-19)
Jun  2 00:02:44 hostname kernel: LustreError: 
226265:0:(obd_mount.c:1608:lustre_fill_super()) Skipped 881 previous 
similar messages
Jun  2 00:02:44 hostname kernel: LustreError: 
226253:0:(genops.c:397:class_newdev()) Skipped 887 previous similar 
messages


Does anyone have any ideas?

Thanks,
Alastair.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Centos 7.7 upgrade

2020-06-02 Thread Alastair Basden

Hi Jeff,

Yes zfs modules are there - at least, I can see the zpools okay.

But, dkms status reports:
lustre-zfs, 2.12.3, 3.10.0-957.10.1.el7_lustre.x86_64, x86_64: installed 
(WARNING! Diff between built and installed module!) (WARNING! Diff between 
built and installed module!) (WARNING! Diff between built and installed 
module!) (WARNING! Diff between built and installed module!) (WARNING! 
Diff between built and installed module!)

spl, 0.7.13, 3.10.0-1062.9.1.el7.x86_64, x86_64: installed
spl, 0.7.13, 3.10.0-957.10.1.el7_lustre.x86_64, x86_64: installed
zfs, 0.7.13, 3.10.0-1062.9.1.el7.x86_64, x86_64: installed
zfs, 0.7.13, 3.10.0-957.10.1.el7_lustre.x86_64, x86_64: installed

So it would seem there might be problems somehwere (though I've just 
checked on an older 7.6 system which is working, and that also gives the 
same warnings).


Seems to be a problem with the osd-zfs module not loading.

Cheers,
Alastair.

On Mon, 1 Jun 2020, Jeff Johnson wrote:


Alastair,

Are you sure you have functioning ZFS modules for that kernel, and that
they are loaded? Are you able to see your zpools? Did you use DKMS for
either ZFS, Lustre or both? If so, what does `dkms status` report?

--Jeff


On Mon, Jun 1, 2020 at 4:21 PM Alastair Basden 
wrote:


Hi,

We have just upgraded Lustre servers from 2.12.2 on centos 7.6 to 2.12.3
on centos 7.7.

The OSSs are on top of zfs (0.7.13 as recommended), and we are using
3.10.0-1062.1.1.el7_lustre.x86_64

After the update, Lustre will no longer mount - and messages such as:
Jun  2 00:02:44 hostname kernel: LustreError: 158-c: Can't load module
'osd-zfs'
Jun  2 00:02:44 hostname kernel: LustreError: Skipped 875 previous similar
messages
Jun  2 00:02:44 hostname kernel: LustreError:
226253:0:(genops.c:397:class_newdev()) OBD: unknown type: osd-zfs
Jun  2 00:02:44 hostname kernel: LustreError:
226265:0:(obd_config.c:403:class_attach()) Cannot create device
lustfs-OST0006-osd of type osd-zfs : -19
Jun  2 00:02:44 hostname kernel: LustreError:
226265:0:(obd_config.c:403:class_attach()) Skipped 881 previous similar
messages
Jun  2 00:02:44 hostname kernel: LustreError:
226265:0:(obd_mount.c:197:lustre_start_simple()) lustfs-OST0006-osd attach
error -19
Jun  2 00:02:44 hostname kernel: LustreError:
226265:0:(obd_mount.c:197:lustre_start_simple()) Skipped 881 previous
similar messages
Jun  2 00:02:44 hostname kernel: LustreError:
226265:0:(obd_mount_server.c:1947:server_fill_super()) Unable to start osd
on lustfs-ost6/ost6: -19
Jun  2 00:02:44 hostname kernel: LustreError:
226265:0:(obd_mount_server.c:1947:server_fill_super()) Skipped 881 previous
similar messages
Jun  2 00:02:44 hostname kernel: LustreError:
226265:0:(obd_mount.c:1608:lustre_fill_super()) Unable to mount  (-19)
Jun  2 00:02:44 hostname kernel: LustreError:
226265:0:(obd_mount.c:1608:lustre_fill_super()) Skipped 881 previous
similar messages
Jun  2 00:02:44 hostname kernel: LustreError:
226253:0:(genops.c:397:class_newdev()) Skipped 887 previous similar messages

Does anyone have any ideas?

Thanks,
Alastair.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org




--
--
Jeff Johnson
Co-Founder
Aeon Computing

jeff.john...@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite C - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Centos 7.7 upgrade

2020-06-01 Thread Alastair Basden

Hi,

We have just upgraded Lustre servers from 2.12.2 on centos 7.6 to 2.12.3 
on centos 7.7.


The OSSs are on top of zfs (0.7.13 as recommended), and we are using 
3.10.0-1062.1.1.el7_lustre.x86_64


After the update, Lustre will no longer mount - and messages such as:
Jun  2 00:02:44 hostname kernel: LustreError: 158-c: Can't load module 'osd-zfs'
Jun  2 00:02:44 hostname kernel: LustreError: Skipped 875 previous similar 
messages
Jun  2 00:02:44 hostname kernel: LustreError: 
226253:0:(genops.c:397:class_newdev()) OBD: unknown type: osd-zfs
Jun  2 00:02:44 hostname kernel: LustreError: 
226265:0:(obd_config.c:403:class_attach()) Cannot create device 
lustfs-OST0006-osd of type osd-zfs : -19
Jun  2 00:02:44 hostname kernel: LustreError: 
226265:0:(obd_config.c:403:class_attach()) Skipped 881 previous similar messages
Jun  2 00:02:44 hostname kernel: LustreError: 
226265:0:(obd_mount.c:197:lustre_start_simple()) lustfs-OST0006-osd attach 
error -19
Jun  2 00:02:44 hostname kernel: LustreError: 
226265:0:(obd_mount.c:197:lustre_start_simple()) Skipped 881 previous similar 
messages
Jun  2 00:02:44 hostname kernel: LustreError: 
226265:0:(obd_mount_server.c:1947:server_fill_super()) Unable to start osd on 
lustfs-ost6/ost6: -19
Jun  2 00:02:44 hostname kernel: LustreError: 
226265:0:(obd_mount_server.c:1947:server_fill_super()) Skipped 881 previous 
similar messages
Jun  2 00:02:44 hostname kernel: LustreError: 
226265:0:(obd_mount.c:1608:lustre_fill_super()) Unable to mount  (-19)
Jun  2 00:02:44 hostname kernel: LustreError: 
226265:0:(obd_mount.c:1608:lustre_fill_super()) Skipped 881 previous similar 
messages
Jun  2 00:02:44 hostname kernel: LustreError: 
226253:0:(genops.c:397:class_newdev()) Skipped 887 previous similar messages

Does anyone have any ideas?

Thanks,
Alastair.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org