Re: [lustre-discuss] How to eliminate zombie OSTs
Following https://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#lustremaint.remove_ost today I did the following with apparently no difference lctl set_param osp.lustre-OST0018-osc-MDT.max_create_count=0 but I also did lctl --device 30 deactivate and now the 10 zombie OSTs appear as IN, not UP # lctl dl|grep OST|grep IN 18 IN osp lustre-OST000a-osc-MDT lustre-MDT-mdtlov_UUID 4 19 IN osp lustre-OST000b-osc-MDT lustre-MDT-mdtlov_UUID 4 20 IN osp lustre-OST000c-osc-MDT lustre-MDT-mdtlov_UUID 4 21 IN osp lustre-OST000d-osc-MDT lustre-MDT-mdtlov_UUID 4 22 IN osp lustre-OST000e-osc-MDT lustre-MDT-mdtlov_UUID 4 28 IN osp lustre-OST0014-osc-MDT lustre-MDT-mdtlov_UUID 4 29 IN osp lustre-OST0015-osc-MDT lustre-MDT-mdtlov_UUID 4 30 IN osp lustre-OST0016-osc-MDT lustre-MDT-mdtlov_UUID 4 31 IN osp lustre-OST0017-osc-MDT lustre-MDT-mdtlov_UUID 4 32 IN osp lustre-OST0018-osc-MDT lustre-MDT-mdtlov_UUID 4 I also deactivated the OSTs in the client with # lctl set_param osc.lustre-OST000b-*.active=0 osc.lustre-OST000b-osc-979fbcc8b800.active=0 but I still get them as errors in the client: # lfs check osts|grep error lfs check: error: check 'lustre-OST000a-osc-979fbcc8b800': Cannot allocate memory (12) lfs check: error: check 'lustre-OST000b-osc-979fbcc8b800': Cannot allocate memory (12) lfs check: error: check 'lustre-OST000c-osc-979fbcc8b800': Cannot allocate memory (12) lfs check: error: check 'lustre-OST000d-osc-979fbcc8b800': Cannot allocate memory (12) lfs check: error: check 'lustre-OST000e-osc-979fbcc8b800': Cannot allocate memory (12) lfs check: error: check 'lustre-OST0014-osc-979fbcc8b800': Cannot allocate memory (12) lfs check: error: check 'lustre-OST0015-osc-979fbcc8b800': Resource temporarily unavailable (11) lfs check: error: check 'lustre-OST0016-osc-979fbcc8b800': Resource temporarily unavailable (11) lfs check: error: check 'lustre-OST0017-osc-979fbcc8b800': Resource temporarily unavailable (11) lfs check: error: check 'lustre-OST0018-osc-979fbcc8b800': Resource temporarily unavailable (11) I will keep reading the reference, but if you have any suggestion, I will appreciate it. El mié, 9 ago 2023 a la(s) 11:08, Horn, Chris (chris.h...@hpe.com) escribió: > > The error message is stating that ‘-P’ is not valid option to the conf_param > command. You may be thinking of lctl set_param -P … > > Did you follow the documented procedure for removing an OST from the > filesystem when you “adjust[ed] the configuration”? > > https://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#lustremaint.remove_ost > > Chris Horn > > > > From: lustre-discuss on behalf of > Alejandro Sierra via lustre-discuss > Date: Wednesday, August 9, 2023 at 11:55 AM > To: Jeff Johnson > Cc: lustre-discuss > Subject: Re: [lustre-discuss] How to eliminate zombie OSTs > > Yes, it is. > > El mié, 9 ago 2023 a la(s) 10:49, Jeff Johnson > (jeff.john...@aeoncomputing.com) escribió: > > > > Alejandro, > > > > Is your MGS located on the same node as your primary MDT? (combined MGS/MDT > > node) > > > > --Jeff > > > > On Wed, Aug 9, 2023 at 9:46 AM Alejandro Sierra via lustre-discuss > > wrote: > >> > >> Hello, > >> > >> In 2018 we implemented a lustre system 2.10.5 with 20 OSTs in two OSS > >> with 4 jboxes, each box with 24 disks of 12 TB each, for a total of > >> nearly 1 PB. In all that time we had power failures and failed raid > >> controller cards, all of which made us adjust the configuration. After > >> the last failure, the system keeps sending error messages about OSTs > >> that are no more in the system. In the MDS I do > >> > >> # lctl dl > >> > >> and I get the 20 currently active OSTs > >> > >> oss01.lanot.unam.mx - OST00 /dev/disk/by-label/lustre-OST > >> oss01.lanot.unam.mx - OST01 /dev/disk/by-label/lustre-OST0001 > >> oss01.lanot.unam.mx - OST02 /dev/disk/by-label/lustre-OST0002 > >> oss01.lanot.unam.mx - OST03 /dev/disk/by-label/lustre-OST0003 > >> oss01.lanot.unam.mx - OST04 /dev/disk/by-label/lustre-OST0004 > >> oss01.lanot.unam.mx - OST05 /dev/disk/by-label/lustre-OST0005 > >> oss01.lanot.unam.mx - OST06 /dev/disk/by-label/lustre-OST0006 > >> oss01.lanot.unam.mx - OST07 /dev/disk/by-label/lustre-OST0007 > >> oss01.lanot.unam.mx - OST08 /dev/disk/by-label/lustre-OST0008 > >> oss01.lanot.unam.mx - OST09 /dev/disk/by-label/lustre-
Re: [lustre-discuss] How to eliminate zombie OSTs
The error message is stating that ‘-P’ is not valid option to the conf_param command. You may be thinking of lctl set_param -P … Did you follow the documented procedure for removing an OST from the filesystem when you “adjust[ed] the configuration”? https://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#lustremaint.remove_ost Chris Horn From: lustre-discuss on behalf of Alejandro Sierra via lustre-discuss Date: Wednesday, August 9, 2023 at 11:55 AM To: Jeff Johnson Cc: lustre-discuss Subject: Re: [lustre-discuss] How to eliminate zombie OSTs Yes, it is. El mié, 9 ago 2023 a la(s) 10:49, Jeff Johnson (jeff.john...@aeoncomputing.com) escribió: > > Alejandro, > > Is your MGS located on the same node as your primary MDT? (combined MGS/MDT > node) > > --Jeff > > On Wed, Aug 9, 2023 at 9:46 AM Alejandro Sierra via lustre-discuss > wrote: >> >> Hello, >> >> In 2018 we implemented a lustre system 2.10.5 with 20 OSTs in two OSS >> with 4 jboxes, each box with 24 disks of 12 TB each, for a total of >> nearly 1 PB. In all that time we had power failures and failed raid >> controller cards, all of which made us adjust the configuration. After >> the last failure, the system keeps sending error messages about OSTs >> that are no more in the system. In the MDS I do >> >> # lctl dl >> >> and I get the 20 currently active OSTs >> >> oss01.lanot.unam.mx - OST00 /dev/disk/by-label/lustre-OST >> oss01.lanot.unam.mx - OST01 /dev/disk/by-label/lustre-OST0001 >> oss01.lanot.unam.mx - OST02 /dev/disk/by-label/lustre-OST0002 >> oss01.lanot.unam.mx - OST03 /dev/disk/by-label/lustre-OST0003 >> oss01.lanot.unam.mx - OST04 /dev/disk/by-label/lustre-OST0004 >> oss01.lanot.unam.mx - OST05 /dev/disk/by-label/lustre-OST0005 >> oss01.lanot.unam.mx - OST06 /dev/disk/by-label/lustre-OST0006 >> oss01.lanot.unam.mx - OST07 /dev/disk/by-label/lustre-OST0007 >> oss01.lanot.unam.mx - OST08 /dev/disk/by-label/lustre-OST0008 >> oss01.lanot.unam.mx - OST09 /dev/disk/by-label/lustre-OST0009 >> oss02.lanot.unam.mx - OST15 /dev/disk/by-label/lustre-OST000f >> oss02.lanot.unam.mx - OST16 /dev/disk/by-label/lustre-OST0010 >> oss02.lanot.unam.mx - OST17 /dev/disk/by-label/lustre-OST0011 >> oss02.lanot.unam.mx - OST18 /dev/disk/by-label/lustre-OST0012 >> oss02.lanot.unam.mx - OST19 /dev/disk/by-label/lustre-OST0013 >> oss02.lanot.unam.mx - OST25 /dev/disk/by-label/lustre-OST0019 >> oss02.lanot.unam.mx - OST26 /dev/disk/by-label/lustre-OST001a >> oss02.lanot.unam.mx - OST27 /dev/disk/by-label/lustre-OST001b >> oss02.lanot.unam.mx - OST28 /dev/disk/by-label/lustre-OST001c >> oss02.lanot.unam.mx - OST29 /dev/disk/by-label/lustre-OST001d >> >> but I also get 5 that are not currently active, in fact doesn't exist >> >> 28 IN osp lustre-OST0014-osc-MDT lustre-MDT-mdtlov_UUID 4 >> 29 UP osp lustre-OST0015-osc-MDT lustre-MDT-mdtlov_UUID 4 >> 30 UP osp lustre-OST0016-osc-MDT lustre-MDT-mdtlov_UUID 4 >> 31 UP osp lustre-OST0017-osc-MDT lustre-MDT-mdtlov_UUID 4 >> 32 UP osp lustre-OST0018-osc-MDT lustre-MDT-mdtlov_UUID 4 >> >> When I try to eliminate them with >> >> lctl conf_param -P osp.lustre-OST0015-osc-MDT.active=0 >> >> I get the error >> >> conf_param: invalid option -- 'P' >> set a permanent config parameter. >> This command must be run on the MGS node >> usage: conf_param [-d] >> -d Remove the permanent setting. >> >> If I do >> >> lctl --device 28 deactivate >> >> I don't get an error, but nothing changes >> >> What can I do? >> >> Thank you in advance for any help. >> >> -- >> Alejandro Aguilar Sierra >> LANOT, ICAyCC, UNAM >> ___ >> lustre-discuss mailing list >> lustre-discuss@lists.lustre.org >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org<http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org> > > > > -- > -- > Jeff Johnson > Co-Founder > Aeon Computing > > jeff.john...@aeoncomputing.com > http://www.aeoncomputing.com<http://www.aeoncomputing.com> > t: 858-412-3810 x1001 f: 858-412-3845 > m: 619-204-9061 > > 4170 Morena Boulevard, Suite C - San Diego, CA 92117 > > High-Performance Computing / Lustre Filesystems / Scale-out Storage ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org<http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org> ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] How to eliminate zombie OSTs
Yes, it is. El mié, 9 ago 2023 a la(s) 10:49, Jeff Johnson (jeff.john...@aeoncomputing.com) escribió: > > Alejandro, > > Is your MGS located on the same node as your primary MDT? (combined MGS/MDT > node) > > --Jeff > > On Wed, Aug 9, 2023 at 9:46 AM Alejandro Sierra via lustre-discuss > wrote: >> >> Hello, >> >> In 2018 we implemented a lustre system 2.10.5 with 20 OSTs in two OSS >> with 4 jboxes, each box with 24 disks of 12 TB each, for a total of >> nearly 1 PB. In all that time we had power failures and failed raid >> controller cards, all of which made us adjust the configuration. After >> the last failure, the system keeps sending error messages about OSTs >> that are no more in the system. In the MDS I do >> >> # lctl dl >> >> and I get the 20 currently active OSTs >> >> oss01.lanot.unam.mx - OST00 /dev/disk/by-label/lustre-OST >> oss01.lanot.unam.mx - OST01 /dev/disk/by-label/lustre-OST0001 >> oss01.lanot.unam.mx - OST02 /dev/disk/by-label/lustre-OST0002 >> oss01.lanot.unam.mx - OST03 /dev/disk/by-label/lustre-OST0003 >> oss01.lanot.unam.mx - OST04 /dev/disk/by-label/lustre-OST0004 >> oss01.lanot.unam.mx - OST05 /dev/disk/by-label/lustre-OST0005 >> oss01.lanot.unam.mx - OST06 /dev/disk/by-label/lustre-OST0006 >> oss01.lanot.unam.mx - OST07 /dev/disk/by-label/lustre-OST0007 >> oss01.lanot.unam.mx - OST08 /dev/disk/by-label/lustre-OST0008 >> oss01.lanot.unam.mx - OST09 /dev/disk/by-label/lustre-OST0009 >> oss02.lanot.unam.mx - OST15 /dev/disk/by-label/lustre-OST000f >> oss02.lanot.unam.mx - OST16 /dev/disk/by-label/lustre-OST0010 >> oss02.lanot.unam.mx - OST17 /dev/disk/by-label/lustre-OST0011 >> oss02.lanot.unam.mx - OST18 /dev/disk/by-label/lustre-OST0012 >> oss02.lanot.unam.mx - OST19 /dev/disk/by-label/lustre-OST0013 >> oss02.lanot.unam.mx - OST25 /dev/disk/by-label/lustre-OST0019 >> oss02.lanot.unam.mx - OST26 /dev/disk/by-label/lustre-OST001a >> oss02.lanot.unam.mx - OST27 /dev/disk/by-label/lustre-OST001b >> oss02.lanot.unam.mx - OST28 /dev/disk/by-label/lustre-OST001c >> oss02.lanot.unam.mx - OST29 /dev/disk/by-label/lustre-OST001d >> >> but I also get 5 that are not currently active, in fact doesn't exist >> >> 28 IN osp lustre-OST0014-osc-MDT lustre-MDT-mdtlov_UUID 4 >> 29 UP osp lustre-OST0015-osc-MDT lustre-MDT-mdtlov_UUID 4 >> 30 UP osp lustre-OST0016-osc-MDT lustre-MDT-mdtlov_UUID 4 >> 31 UP osp lustre-OST0017-osc-MDT lustre-MDT-mdtlov_UUID 4 >> 32 UP osp lustre-OST0018-osc-MDT lustre-MDT-mdtlov_UUID 4 >> >> When I try to eliminate them with >> >> lctl conf_param -P osp.lustre-OST0015-osc-MDT.active=0 >> >> I get the error >> >> conf_param: invalid option -- 'P' >> set a permanent config parameter. >> This command must be run on the MGS node >> usage: conf_param [-d] >> -d Remove the permanent setting. >> >> If I do >> >> lctl --device 28 deactivate >> >> I don't get an error, but nothing changes >> >> What can I do? >> >> Thank you in advance for any help. >> >> -- >> Alejandro Aguilar Sierra >> LANOT, ICAyCC, UNAM >> ___ >> lustre-discuss mailing list >> lustre-discuss@lists.lustre.org >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > > > > -- > -- > Jeff Johnson > Co-Founder > Aeon Computing > > jeff.john...@aeoncomputing.com > www.aeoncomputing.com > t: 858-412-3810 x1001 f: 858-412-3845 > m: 619-204-9061 > > 4170 Morena Boulevard, Suite C - San Diego, CA 92117 > > High-Performance Computing / Lustre Filesystems / Scale-out Storage ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] How to eliminate zombie OSTs
Alejandro, Is your MGS located on the same node as your primary MDT? (combined MGS/MDT node) --Jeff On Wed, Aug 9, 2023 at 9:46 AM Alejandro Sierra via lustre-discuss < lustre-discuss@lists.lustre.org> wrote: > Hello, > > In 2018 we implemented a lustre system 2.10.5 with 20 OSTs in two OSS > with 4 jboxes, each box with 24 disks of 12 TB each, for a total of > nearly 1 PB. In all that time we had power failures and failed raid > controller cards, all of which made us adjust the configuration. After > the last failure, the system keeps sending error messages about OSTs > that are no more in the system. In the MDS I do > > # lctl dl > > and I get the 20 currently active OSTs > > oss01.lanot.unam.mx - OST00 /dev/disk/by-label/lustre-OST > oss01.lanot.unam.mx - OST01 /dev/disk/by-label/lustre-OST0001 > oss01.lanot.unam.mx - OST02 /dev/disk/by-label/lustre-OST0002 > oss01.lanot.unam.mx - OST03 /dev/disk/by-label/lustre-OST0003 > oss01.lanot.unam.mx - OST04 /dev/disk/by-label/lustre-OST0004 > oss01.lanot.unam.mx - OST05 /dev/disk/by-label/lustre-OST0005 > oss01.lanot.unam.mx - OST06 /dev/disk/by-label/lustre-OST0006 > oss01.lanot.unam.mx - OST07 /dev/disk/by-label/lustre-OST0007 > oss01.lanot.unam.mx - OST08 /dev/disk/by-label/lustre-OST0008 > oss01.lanot.unam.mx - OST09 /dev/disk/by-label/lustre-OST0009 > oss02.lanot.unam.mx - OST15 /dev/disk/by-label/lustre-OST000f > oss02.lanot.unam.mx - OST16 /dev/disk/by-label/lustre-OST0010 > oss02.lanot.unam.mx - OST17 /dev/disk/by-label/lustre-OST0011 > oss02.lanot.unam.mx - OST18 /dev/disk/by-label/lustre-OST0012 > oss02.lanot.unam.mx - OST19 /dev/disk/by-label/lustre-OST0013 > oss02.lanot.unam.mx - OST25 /dev/disk/by-label/lustre-OST0019 > oss02.lanot.unam.mx - OST26 /dev/disk/by-label/lustre-OST001a > oss02.lanot.unam.mx - OST27 /dev/disk/by-label/lustre-OST001b > oss02.lanot.unam.mx - OST28 /dev/disk/by-label/lustre-OST001c > oss02.lanot.unam.mx - OST29 /dev/disk/by-label/lustre-OST001d > > but I also get 5 that are not currently active, in fact doesn't exist > > 28 IN osp lustre-OST0014-osc-MDT lustre-MDT-mdtlov_UUID 4 > 29 UP osp lustre-OST0015-osc-MDT lustre-MDT-mdtlov_UUID 4 > 30 UP osp lustre-OST0016-osc-MDT lustre-MDT-mdtlov_UUID 4 > 31 UP osp lustre-OST0017-osc-MDT lustre-MDT-mdtlov_UUID 4 > 32 UP osp lustre-OST0018-osc-MDT lustre-MDT-mdtlov_UUID 4 > > When I try to eliminate them with > > lctl conf_param -P osp.lustre-OST0015-osc-MDT.active=0 > > I get the error > > conf_param: invalid option -- 'P' > set a permanent config parameter. > This command must be run on the MGS node > usage: conf_param [-d] > -d Remove the permanent setting. > > If I do > > lctl --device 28 deactivate > > I don't get an error, but nothing changes > > What can I do? > > Thank you in advance for any help. > > -- > Alejandro Aguilar Sierra > LANOT, ICAyCC, UNAM > ___ > lustre-discuss mailing list > lustre-discuss@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > -- -- Jeff Johnson Co-Founder Aeon Computing jeff.john...@aeoncomputing.com www.aeoncomputing.com t: 858-412-3810 x1001 f: 858-412-3845 m: 619-204-9061 4170 Morena Boulevard, Suite C - San Diego, CA 92117 High-Performance Computing / Lustre Filesystems / Scale-out Storage ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] How to eliminate zombie OSTs
Hello, In 2018 we implemented a lustre system 2.10.5 with 20 OSTs in two OSS with 4 jboxes, each box with 24 disks of 12 TB each, for a total of nearly 1 PB. In all that time we had power failures and failed raid controller cards, all of which made us adjust the configuration. After the last failure, the system keeps sending error messages about OSTs that are no more in the system. In the MDS I do # lctl dl and I get the 20 currently active OSTs oss01.lanot.unam.mx - OST00 /dev/disk/by-label/lustre-OST oss01.lanot.unam.mx - OST01 /dev/disk/by-label/lustre-OST0001 oss01.lanot.unam.mx - OST02 /dev/disk/by-label/lustre-OST0002 oss01.lanot.unam.mx - OST03 /dev/disk/by-label/lustre-OST0003 oss01.lanot.unam.mx - OST04 /dev/disk/by-label/lustre-OST0004 oss01.lanot.unam.mx - OST05 /dev/disk/by-label/lustre-OST0005 oss01.lanot.unam.mx - OST06 /dev/disk/by-label/lustre-OST0006 oss01.lanot.unam.mx - OST07 /dev/disk/by-label/lustre-OST0007 oss01.lanot.unam.mx - OST08 /dev/disk/by-label/lustre-OST0008 oss01.lanot.unam.mx - OST09 /dev/disk/by-label/lustre-OST0009 oss02.lanot.unam.mx - OST15 /dev/disk/by-label/lustre-OST000f oss02.lanot.unam.mx - OST16 /dev/disk/by-label/lustre-OST0010 oss02.lanot.unam.mx - OST17 /dev/disk/by-label/lustre-OST0011 oss02.lanot.unam.mx - OST18 /dev/disk/by-label/lustre-OST0012 oss02.lanot.unam.mx - OST19 /dev/disk/by-label/lustre-OST0013 oss02.lanot.unam.mx - OST25 /dev/disk/by-label/lustre-OST0019 oss02.lanot.unam.mx - OST26 /dev/disk/by-label/lustre-OST001a oss02.lanot.unam.mx - OST27 /dev/disk/by-label/lustre-OST001b oss02.lanot.unam.mx - OST28 /dev/disk/by-label/lustre-OST001c oss02.lanot.unam.mx - OST29 /dev/disk/by-label/lustre-OST001d but I also get 5 that are not currently active, in fact doesn't exist 28 IN osp lustre-OST0014-osc-MDT lustre-MDT-mdtlov_UUID 4 29 UP osp lustre-OST0015-osc-MDT lustre-MDT-mdtlov_UUID 4 30 UP osp lustre-OST0016-osc-MDT lustre-MDT-mdtlov_UUID 4 31 UP osp lustre-OST0017-osc-MDT lustre-MDT-mdtlov_UUID 4 32 UP osp lustre-OST0018-osc-MDT lustre-MDT-mdtlov_UUID 4 When I try to eliminate them with lctl conf_param -P osp.lustre-OST0015-osc-MDT.active=0 I get the error conf_param: invalid option -- 'P' set a permanent config parameter. This command must be run on the MGS node usage: conf_param [-d] -d Remove the permanent setting. If I do lctl --device 28 deactivate I don't get an error, but nothing changes What can I do? Thank you in advance for any help. -- Alejandro Aguilar Sierra LANOT, ICAyCC, UNAM ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org