Re: [lustre-discuss] lnet fails to start on reboot
Are you sure the fabric is up when lnet starts at boot? Double check the order your services start and be sure Lnet waits for the fabric/network before starting. Thanks, Keith > -Original Message- > From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] On > Behalf > Of David Rackley > Sent: Monday, August 13, 2018 2:14 PM > To: lustre-discuss@lists.lustre.org > Cc: sciops > Subject: [lustre-discuss] lnet fails to start on reboot > > Hello, > I have built and installed lustre client 2.10.4-1 with centos 7.3 (3.10.0- > 514.el7.x86_64) and on reboot lnet fails with: > root@scissd1801:~] systemctl status lnet.service ● lnet.service - lnet > management >Loaded: loaded (/usr/lib/systemd/system/lnet.service; enabled; vendor > preset: disabled) >Active: failed (Result: exit-code) since Mon 2018-08-13 16:54:31 EDT; 16min > ago > Process: 2334 ExecStart=/usr/sbin/lnetctl import /etc/lnet.conf > (code=exited, > status=254) > Process: 2331 ExecStart=/usr/sbin/lnetctl lnet configure (code=exited, > status=0/SUCCESS) > Process: 2071 ExecStart=/usr/sbin/modprobe lnet (code=exited, > status=0/SUCCESS) Main PID: 2334 (code=exited, status=254) > > Aug 13 16:54:31 scissd1801 lnetctl[2334]: - net: > Aug 13 16:54:31 scissd1801 lnetctl[2334]: errno: -100 Aug 13 16:54:31 > scissd1801 lnetctl[2334]: descr: "cannot add network: Network is down" > Aug 13 16:54:31 scissd1801 lnetctl[2334]: - numa_range: > Aug 13 16:54:31 scissd1801 lnetctl[2334]: errno: 0 Aug 13 16:54:31 scissd1801 > lnetctl[2334]: descr: "success" > Aug 13 16:54:31 scissd1801 systemd[1]: lnet.service: main process exited, > code=exited, status=254/n/a Aug 13 16:54:31 scissd1801 systemd[1]: Failed to > start lnet management. > Aug 13 16:54:31 scissd1801 systemd[1]: Unit lnet.service entered failed state. > Aug 13 16:54:31 scissd1801 systemd[1]: lnet.service failed. > > The /etc/lnet.conf file exists and when I manually execute /usr/sbin/lnetctl > import /etc/lnet.conf it succeeds and lnet works and I can mount lustre as > expected. > > Any ideas? > > =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > David Rackley | ** **** > CC Sci Comp Sys Admin |** ** *** ** **** ** > rack...@jlab.org | ** ** ** * ** ** > | ** * ** ** * ** **** ** > Phone: 757.269.7041 | ** ** ** *** **** ** > FAX: 757.269.6248 | TJNAF - Thomas Jefferson National Accelerator Facility > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= > ___ > lustre-discuss mailing list > lustre-discuss@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] lnet fails to start on reboot
Hello, I have built and installed lustre client 2.10.4-1 with centos 7.3 (3.10.0-514.el7.x86_64) and on reboot lnet fails with: root@scissd1801:~] systemctl status lnet.service ● lnet.service - lnet management Loaded: loaded (/usr/lib/systemd/system/lnet.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Mon 2018-08-13 16:54:31 EDT; 16min ago Process: 2334 ExecStart=/usr/sbin/lnetctl import /etc/lnet.conf (code=exited, status=254) Process: 2331 ExecStart=/usr/sbin/lnetctl lnet configure (code=exited, status=0/SUCCESS) Process: 2071 ExecStart=/usr/sbin/modprobe lnet (code=exited, status=0/SUCCESS) Main PID: 2334 (code=exited, status=254) Aug 13 16:54:31 scissd1801 lnetctl[2334]: - net: Aug 13 16:54:31 scissd1801 lnetctl[2334]: errno: -100 Aug 13 16:54:31 scissd1801 lnetctl[2334]: descr: "cannot add network: Network is down" Aug 13 16:54:31 scissd1801 lnetctl[2334]: - numa_range: Aug 13 16:54:31 scissd1801 lnetctl[2334]: errno: 0 Aug 13 16:54:31 scissd1801 lnetctl[2334]: descr: "success" Aug 13 16:54:31 scissd1801 systemd[1]: lnet.service: main process exited, code=exited, status=254/n/a Aug 13 16:54:31 scissd1801 systemd[1]: Failed to start lnet management. Aug 13 16:54:31 scissd1801 systemd[1]: Unit lnet.service entered failed state. Aug 13 16:54:31 scissd1801 systemd[1]: lnet.service failed. The /etc/lnet.conf file exists and when I manually execute /usr/sbin/lnetctl import /etc/lnet.conf it succeeds and lnet works and I can mount lustre as expected. Any ideas? =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= David Rackley | ** **** CC Sci Comp Sys Admin |** ** *** ** **** ** rack...@jlab.org | ** ** ** * ** ** | ** * ** ** * ** **** ** Phone: 757.269.7041 | ** ** ** *** **** ** FAX: 757.269.6248 | TJNAF - Thomas Jefferson National Accelerator Facility -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Lustre 2.10.4 failover
> On Aug 13, 2018, at 2:25 PM, David Cohen > wrote: > > the fstab line I use for mounting the Lustre filesystem: > > oss03@tcp:oss01@tcp:/fsname /storagelustre flock,user_xattr,defaults >0 0 OK. That looks correct. > the mds is also configured for failover (unsuccessfully) : > tunefs.lustre --writeconf --erase-params --fsname=fsname --mgs > --mountfsoptions='user_xattr,errors=remount-ro,acl' > --param="mgsnode=oss03@tcp mgsnode=oss01@tcp servicenode=oss01@tcp > servicenode=oss03@tcp" /dev/lustre_pool/MDT I don’t think you need to specify the —msg option (since the writeconf doesn’t change what kind of target it is). Also, you should be able to just specify multiple —mgsnode=XX options like you did for the osts instead of using the —param=“mgsnode..” syntax. I don’t know if that is affecting your failover config or not. (Maybe it doesn’t matter.) -- Rick Mohr Senior HPC System Administrator National Institute for Computational Sciences http://www.nics.tennessee.edu ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Lustre 2.10.4 failover
the fstab line I use for mounting the Lustre filesystem: oss03@tcp:oss01@tcp:/fsname /storagelustre flock,user_xattr,defaults0 0 the mds is also configured for failover (unsuccessfully) : tunefs.lustre --writeconf --erase-params --fsname=fsname --mgs --mountfsoptions='user_xattr,errors=remount-ro,acl' --param="mgsnode=oss03@tcp mgsnode=oss01@tcp servicenode=oss01@tcp servicenode=oss03@tcp" /dev/lustre_pool/MDT On Mon, Aug 13, 2018 at 8:40 PM Mohr Jr, Richard Frank (Rick Mohr) < rm...@utk.edu> wrote: > > > On Aug 13, 2018, at 7:14 AM, David Cohen > wrote: > > > > I installed a new 2.10.4 Lustre file system. > > Running MDS and OSS on the same servers. > > Failover wasn't configured at format time. > > I'm trying to configure failover node with tunefs without success. > > tunefs.lustre --writeconf --erase-params --param="ost.quota_type=ug" > --mgsnode=oss03@tcp --mgsnode=oss01@tcp --servicenode=oss01@tcp > --servicenode=oss03@tcp /dev/mapper/OST0015 > > > > I can mount the ost on the second server but the clients won't restore > the connection. > > Maybe I'm missing something obvious. Do you see any typo in the command? > > What mount command are you using on the client? > > -- > Rick Mohr > Senior HPC System Administrator > National Institute for Computational Sciences > http://www.nics.tennessee.edu > > ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Lustre log messages and log files
Ah, yes, thank you. The goodies of systemd! Regards, Thomas On 08/13/2018 02:10 PM, Julio Pedraza wrote: > > Hi, > > As you are in CentOs 7 maybe try to got through : > > |# journalctl |grep -E "Lustre|LustreErrors|LNet|LDISKFS|ustre"| > > and see if you got what you need > > regards, > J. > > On 08/13/2018 02:00 PM, Thomas Roth wrote: >> Hi all, >> >> we have this rather rare phenomenon of too few Lustre log entries - it would >> seem. >> This is a cluster running Lustre 2.10.4 on CentOS 7.4 >> I do not think I have done anything to deviate from the defaults - neither >> with Lustre nor >> rsyslogd-Config. >> However, >> >> # dmesg | grep LustreError | wc -l >> 50 >> >> # grep -r LustreError * | wc -l >> 0 >> >> >> Before I saw the LustreErrors on the console, I just assumed that newer >> Lustre versions are not that >> chatty and this shiny new cluster just didn't show many errors. >> >> "Lustre:" messages make it to /var/log/messages, as in "kernel: Lustre: >> haven't heard from client...", >> obviously I have not cut off Lustre completely. >> >> What did I do wrong? >> >> >> Many regards, >> Thomas >> >> >> > > > > > ___ > lustre-discuss mailing list > lustre-discuss@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > -- Thomas Roth Department: Informationstechnologie Location: SB3 2.291 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528 Managing Directors / Geschäftsführung: Professor Dr. Paolo Giubellino, Ursula Weyrich, Jörg Blaurock Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats: State Secretary / Staatssekretär Dr. Georg Schütte ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Lustre log messages and log files
Hi, As you are in CentOs 7 maybe try to got through : |# journalctl |grep -E "Lustre|LustreErrors|LNet|LDISKFS|ustre"| and see if you got what you need regards, J. On 08/13/2018 02:00 PM, Thomas Roth wrote: Hi all, we have this rather rare phenomenon of too few Lustre log entries - it would seem. This is a cluster running Lustre 2.10.4 on CentOS 7.4 I do not think I have done anything to deviate from the defaults - neither with Lustre nor rsyslogd-Config. However, # dmesg | grep LustreError | wc -l 50 # grep -r LustreError * | wc -l 0 Before I saw the LustreErrors on the console, I just assumed that newer Lustre versions are not that chatty and this shiny new cluster just didn't show many errors. "Lustre:" messages make it to /var/log/messages, as in "kernel: Lustre: haven't heard from client...", obviously I have not cut off Lustre completely. What did I do wrong? Many regards, Thomas ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Lustre 2.10.4 failover
Hi I installed a new 2.10.4 Lustre file system. Running MDS and OSS on the same servers. Failover wasn't configured at format time. I'm trying to configure failover node with tunefs without success. tunefs.lustre --writeconf --erase-params --param="ost.quota_type=ug" --mgsnode=oss03@tcp --mgsnode=oss01@tcp --servicenode=oss01@tcp --servicenode=oss03@tcp /dev/mapper/OST0015 I can mount the ost on the second server but the clients won't restore the connection. Maybe I'm missing something obvious. Do you see any typo in the command? David ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org