Re: [lustre-discuss] New client mounts fail after deactivating OSTs
Little late to the party here, but I just ran into this myself. I worked around it without having to regenerate everything with --writeconf, which I realize isn't helpful 4 months after the fact, but I figured I'd post here to help anyone else who runs into this issue in the future. In my case I had removed all the llog entries for the OSTs except the conf_param entries setting osc.active=0, assuming for whatever reason those should be retained. This is incorrect, you'll want to remove those too for each relevant OST. I've opened an issue in LUDOC with some suggestions about how phrasing might be improved. On Tue, 2023-07-18 at 23:55 +, Andreas Dilger via lustre-discuss wrote: > Brian, > Please file a ticket in LUDOC with details of how the manual should be > updated. Ideally, including a patch. :-) > > Cheers, Andreas > > > On Jul 11, 2023, at 15:39, Brad Merchant > > wrote: > > > > > > We recreated the issue in a test cluster and it was definitely the > > llog_cancel steps that caused the issue. Clients couldn't process the llog > > properly on new mounts and would fail. We had to completely clear the > > llog and --writeconf every target to regenerate it from scratch. > > > > The cluster is up and running now but I would certainly recommend at least > > revising that section of the manual. > > > > > > > > On Mon, Jul 10, 2023 at 5:22 PM Brad Merchant > > wrote: > > > We deactivated half of 32 OSTs after draining them. We followed the steps > > > in section 14.9.3 of the lustre manual > > > > > > https://doc.lustre.org/lustre_manual.xhtml#lustremaint.remove_ost > > > > > > After running the steps in subhead "3. Deactivate the OST." on OST0010- > > > OST001f, new client mounts fail with the below log messages. Existing > > > client mounts seem to function correctly but are on a bit of a ticking > > > timebomb because they are configured with autofs. > > > > > > The llog_cancel steps are new to me and the issues seemed to appear after > > > those commands were issued (can't say that 100% definitively however). > > > Servers are running 2.12.5 and clients are on 2.14.x > > > > > > > > > Jul 10 15:22:40 adm-sup1 kernel: LustreError: > > > 26814:0:(obd_config.c:1514:class_process_config()) no device for: hydra- > > > OST0010-osc-8be5340c2000 > > > Jul 10 15:22:40 adm-sup1 kernel: LustreError: > > > 26814:0:(obd_config.c:2038:class_config_llog_handler()) > > > MGC172.16.100.101@o2ib: cfg command failed: rc = -22 > > > Jul 10 15:22:40 adm-sup1 kernel: Lustre: cmd=cf00f 0:hydra-OST0010-osc > > > 1:osc.active=0 > > > Jul 10 15:22:40 adm-sup1 kernel: LustreError: 15b-f: > > > MGC172.16.100.101@o2ib: Configuration from log hydra-client failed from > > > MGS -22. Check client and MGS are on compatible version. > > > Jul 10 15:22:40 adm-sup1 kernel: Lustre: hydra: root_squash is set to > > > 99:99 > > > Jul 10 15:22:40 adm-sup1 systemd-udevd[26823]: Process '/usr/sbin/lctl > > > set_param 'llite.hydra-8be5340c2000.nosquash_nids=192.168.80.84@tcp > > > 192.168.80.122@tcp 192.168.80.21@tcp 172.16.90.11@o2ib 172.16.100.211@o2ib > > > 172.16.100.212@o2ib 172.16.100.213@o2ib 172.16.100.214@o2ib > > > 172.16.100.215@o2ib 172.16.90.51@o2ib'' failed with exit code 2. > > > Jul 10 15:22:40 adm-sup1 kernel: Lustre: Unmounted hydra-client > > > Jul 10 15:22:40 adm-sup1 kernel: LustreError: > > > 26803:0:(obd_mount.c:1680:lustre_fill_super()) Unable to mount (-22) > > > > > > > > > > > > > > ___ > > lustre-discuss mailing list > > lustre-discuss@lists.lustre.org > > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > ___ > lustre-discuss mailing list > lustre-discuss@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org smime.p7s Description: S/MIME cryptographic signature ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] New client mounts fail after deactivating OSTs
Brian, Please file a ticket in LUDOC with details of how the manual should be updated. Ideally, including a patch. :-) Cheers, Andreas On Jul 11, 2023, at 15:39, Brad Merchant wrote: We recreated the issue in a test cluster and it was definitely the llog_cancel steps that caused the issue. Clients couldn't process the llog properly on new mounts and would fail. We had to completely clear the llog and --writeconf every target to regenerate it from scratch. The cluster is up and running now but I would certainly recommend at least revising that section of the manual. On Mon, Jul 10, 2023 at 5:22 PM Brad Merchant mailto:bmerch...@cambridgecomputer.com>> wrote: We deactivated half of 32 OSTs after draining them. We followed the steps in section 14.9.3 of the lustre manual https://doc.lustre.org/lustre_manual.xhtml#lustremaint.remove_ost After running the steps in subhead "3. Deactivate the OST." on OST0010-OST001f, new client mounts fail with the below log messages. Existing client mounts seem to function correctly but are on a bit of a ticking timebomb because they are configured with autofs. The llog_cancel steps are new to me and the issues seemed to appear after those commands were issued (can't say that 100% definitively however). Servers are running 2.12.5 and clients are on 2.14.x Jul 10 15:22:40 adm-sup1 kernel: LustreError: 26814:0:(obd_config.c:1514:class_process_config()) no device for: hydra-OST0010-osc-8be5340c2000 Jul 10 15:22:40 adm-sup1 kernel: LustreError: 26814:0:(obd_config.c:2038:class_config_llog_handler()) MGC172.16.100.101@o2ib: cfg command failed: rc = -22 Jul 10 15:22:40 adm-sup1 kernel: Lustre:cmd=cf00f 0:hydra-OST0010-osc 1:osc.active=0 Jul 10 15:22:40 adm-sup1 kernel: LustreError: 15b-f: MGC172.16.100.101@o2ib: Configuration from log hydra-client failed from MGS -22. Check client and MGS are on compatible version. Jul 10 15:22:40 adm-sup1 kernel: Lustre: hydra: root_squash is set to 99:99 Jul 10 15:22:40 adm-sup1 systemd-udevd[26823]: Process '/usr/sbin/lctl set_param 'llite.hydra-8be5340c2000.nosquash_nids=192.168.80.84@tcp 192.168.80.122@tcp 192.168.80.21@tcp 172.16.90.11@o2ib 172.16.100.211@o2ib 172.16.100.212@o2ib 172.16.100.213@o2ib 172.16.100.214@o2ib 172.16.100.215@o2ib 172.16.90.51@o2ib'' failed with exit code 2. Jul 10 15:22:40 adm-sup1 kernel: Lustre: Unmounted hydra-client Jul 10 15:22:40 adm-sup1 kernel: LustreError: 26803:0:(obd_mount.c:1680:lustre_fill_super()) Unable to mount (-22) ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] New client mounts fail after deactivating OSTs
We recreated the issue in a test cluster and it was definitely the llog_cancel steps that caused the issue. Clients couldn't process the llog properly on new mounts and would fail. We had to completely clear the llog and --writeconf every target to regenerate it from scratch. The cluster is up and running now but I would certainly recommend at least revising that section of the manual. On Mon, Jul 10, 2023 at 5:22 PM Brad Merchant < bmerch...@cambridgecomputer.com> wrote: > We deactivated half of 32 OSTs after draining them. We followed the steps > in section 14.9.3 of the lustre manual > > https://doc.lustre.org/lustre_manual.xhtml#lustremaint.remove_ost > > After running the steps in subhead "3. Deactivate the OST." on > OST0010-OST001f, new client mounts fail with the below log messages. > Existing client mounts seem to function correctly but are on a bit of a > ticking timebomb because they are configured with autofs. > > The llog_cancel steps are new to me and the issues seemed to appear after > those commands were issued (can't say that 100% definitively however). > Servers are running 2.12.5 and clients are on 2.14.x > > > Jul 10 15:22:40 adm-sup1 kernel: LustreError: > 26814:0:(obd_config.c:1514:class_process_config()) no device for: > hydra-OST0010-osc-8be5340c2000 > Jul 10 15:22:40 adm-sup1 kernel: LustreError: > 26814:0:(obd_config.c:2038:class_config_llog_handler()) > MGC172.16.100.101@o2ib: cfg command failed: rc = -22 > Jul 10 15:22:40 adm-sup1 kernel: Lustre:cmd=cf00f 0:hydra-OST0010-osc > 1:osc.active=0 > Jul 10 15:22:40 adm-sup1 kernel: LustreError: 15b-f: MGC172.16.100.101@o2ib: > Configuration from log hydra-client failed from MGS -22. Check client and > MGS are on compatible version. > Jul 10 15:22:40 adm-sup1 kernel: Lustre: hydra: root_squash is set to 99:99 > Jul 10 15:22:40 adm-sup1 systemd-udevd[26823]: Process '/usr/sbin/lctl > set_param 'llite.hydra-8be5340c2000.nosquash_nids=192.168.80.84@tcp > 192.168.80.122@tcp 192.168.80.21@tcp 172.16.90.11@o2ib 172.16.100.211@o2ib > 172.16.100.212@o2ib 172.16.100.213@o2ib 172.16.100.214@o2ib > 172.16.100.215@o2ib 172.16.90.51@o2ib'' failed with exit code 2. > Jul 10 15:22:40 adm-sup1 kernel: Lustre: Unmounted hydra-client > Jul 10 15:22:40 adm-sup1 kernel: LustreError: > 26803:0:(obd_mount.c:1680:lustre_fill_super()) Unable to mount (-22) > > > > ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] New client mounts fail after deactivating OSTs
We deactivated half of 32 OSTs after draining them. We followed the steps in section 14.9.3 of the lustre manual https://doc.lustre.org/lustre_manual.xhtml#lustremaint.remove_ost After running the steps in subhead "3. Deactivate the OST." on OST0010-OST001f, new client mounts fail with the below log messages. Existing client mounts seem to function correctly but are on a bit of a ticking timebomb because they are configured with autofs. The llog_cancel steps are new to me and the issues seemed to appear after those commands were issued (can't say that 100% definitively however). Servers are running 2.12.5 and clients are on 2.14.x Jul 10 15:22:40 adm-sup1 kernel: LustreError: 26814:0:(obd_config.c:1514:class_process_config()) no device for: hydra-OST0010-osc-8be5340c2000 Jul 10 15:22:40 adm-sup1 kernel: LustreError: 26814:0:(obd_config.c:2038:class_config_llog_handler()) MGC172.16.100.101@o2ib: cfg command failed: rc = -22 Jul 10 15:22:40 adm-sup1 kernel: Lustre:cmd=cf00f 0:hydra-OST0010-osc 1:osc.active=0 Jul 10 15:22:40 adm-sup1 kernel: LustreError: 15b-f: MGC172.16.100.101@o2ib: Configuration from log hydra-client failed from MGS -22. Check client and MGS are on compatible version. Jul 10 15:22:40 adm-sup1 kernel: Lustre: hydra: root_squash is set to 99:99 Jul 10 15:22:40 adm-sup1 systemd-udevd[26823]: Process '/usr/sbin/lctl set_param 'llite.hydra-8be5340c2000.nosquash_nids=192.168.80.84@tcp 192.168.80.122@tcp 192.168.80.21@tcp 172.16.90.11@o2ib 172.16.100.211@o2ib 172.16.100.212@o2ib 172.16.100.213@o2ib 172.16.100.214@o2ib 172.16.100.215@o2ib 172.16.90.51@o2ib'' failed with exit code 2. Jul 10 15:22:40 adm-sup1 kernel: Lustre: Unmounted hydra-client Jul 10 15:22:40 adm-sup1 kernel: LustreError: 26803:0:(obd_mount.c:1680:lustre_fill_super()) Unable to mount (-22) ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org