Re: [lustre-discuss] lnet fails to start on reboot

2018-08-13 Thread Mannthey, Keith
Are you sure the fabric is up when lnet starts at boot?  Double check the order 
your services start and be sure Lnet waits for the fabric/network before 
starting. 

 Thanks,
 Keith 



> -Original Message-
> From: lustre-discuss [mailto:lustre-discuss-boun...@lists.lustre.org] On 
> Behalf
> Of David Rackley
> Sent: Monday, August 13, 2018 2:14 PM
> To: lustre-discuss@lists.lustre.org
> Cc: sciops 
> Subject: [lustre-discuss] lnet fails to start on reboot
> 
> Hello,
> I  have built and installed lustre client 2.10.4-1 with centos 7.3  (3.10.0-
> 514.el7.x86_64)  and on reboot lnet fails with:
>  root@scissd1801:~] systemctl status lnet.service ● lnet.service - lnet
> management
>Loaded: loaded (/usr/lib/systemd/system/lnet.service; enabled; vendor
> preset: disabled)
>Active: failed (Result: exit-code) since Mon 2018-08-13 16:54:31 EDT; 16min
> ago
>   Process: 2334 ExecStart=/usr/sbin/lnetctl import /etc/lnet.conf 
> (code=exited,
> status=254)
>   Process: 2331 ExecStart=/usr/sbin/lnetctl lnet configure (code=exited,
> status=0/SUCCESS)
>   Process: 2071 ExecStart=/usr/sbin/modprobe lnet (code=exited,
> status=0/SUCCESS)  Main PID: 2334 (code=exited, status=254)
> 
> Aug 13 16:54:31 scissd1801 lnetctl[2334]: - net:
> Aug 13 16:54:31 scissd1801 lnetctl[2334]: errno: -100 Aug 13 16:54:31
> scissd1801 lnetctl[2334]: descr: "cannot add network: Network is down"
> Aug 13 16:54:31 scissd1801 lnetctl[2334]: - numa_range:
> Aug 13 16:54:31 scissd1801 lnetctl[2334]: errno: 0 Aug 13 16:54:31 scissd1801
> lnetctl[2334]: descr: "success"
> Aug 13 16:54:31 scissd1801 systemd[1]: lnet.service: main process exited,
> code=exited, status=254/n/a Aug 13 16:54:31 scissd1801 systemd[1]: Failed to
> start lnet management.
> Aug 13 16:54:31 scissd1801 systemd[1]: Unit lnet.service entered failed state.
> Aug 13 16:54:31 scissd1801 systemd[1]: lnet.service failed.
> 
> The /etc/lnet.conf file exists and when I manually execute /usr/sbin/lnetctl
> import /etc/lnet.conf it succeeds and lnet works and I can mount lustre as
> expected.
> 
> Any ideas?
> 
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> David Rackley |     **  ****    
> CC Sci Comp Sys Admin |**  **  ***   **  ****  **
> rack...@jlab.org  |   **  **  ** *  **    **
>   |  **  *   **  **  * **  ****  **
> Phone: 757.269.7041   | **  **  **   ***  ****  **
> FAX:   757.269.6248   | TJNAF - Thomas Jefferson National Accelerator Facility
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] lnet fails to start on reboot

2018-08-13 Thread David Rackley
Hello,
I  have built and installed lustre client 2.10.4-1 with centos 7.3  
(3.10.0-514.el7.x86_64)  and on reboot lnet fails with:
 root@scissd1801:~] systemctl status lnet.service
● lnet.service - lnet management
   Loaded: loaded (/usr/lib/systemd/system/lnet.service; enabled; vendor 
preset: disabled)
   Active: failed (Result: exit-code) since Mon 2018-08-13 16:54:31 EDT; 16min 
ago
  Process: 2334 ExecStart=/usr/sbin/lnetctl import /etc/lnet.conf (code=exited, 
status=254)
  Process: 2331 ExecStart=/usr/sbin/lnetctl lnet configure (code=exited, 
status=0/SUCCESS)
  Process: 2071 ExecStart=/usr/sbin/modprobe lnet (code=exited, 
status=0/SUCCESS)
 Main PID: 2334 (code=exited, status=254)

Aug 13 16:54:31 scissd1801 lnetctl[2334]: - net:
Aug 13 16:54:31 scissd1801 lnetctl[2334]: errno: -100
Aug 13 16:54:31 scissd1801 lnetctl[2334]: descr: "cannot add network: Network 
is down"
Aug 13 16:54:31 scissd1801 lnetctl[2334]: - numa_range:
Aug 13 16:54:31 scissd1801 lnetctl[2334]: errno: 0
Aug 13 16:54:31 scissd1801 lnetctl[2334]: descr: "success"
Aug 13 16:54:31 scissd1801 systemd[1]: lnet.service: main process exited, 
code=exited, status=254/n/a
Aug 13 16:54:31 scissd1801 systemd[1]: Failed to start lnet management.
Aug 13 16:54:31 scissd1801 systemd[1]: Unit lnet.service entered failed state.
Aug 13 16:54:31 scissd1801 systemd[1]: lnet.service failed.

The /etc/lnet.conf file exists and when I manually execute /usr/sbin/lnetctl 
import /etc/lnet.conf it succeeds and lnet works and I can mount lustre as 
expected.

Any ideas?

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
David Rackley |     **  ****    
CC Sci Comp Sys Admin |**  **  ***   **  ****  **
rack...@jlab.org  |   **  **  ** *  **    **
  |  **  *   **  **  * **  ****  **
Phone: 757.269.7041   | **  **  **   ***  ****  **
FAX:   757.269.6248   | TJNAF - Thomas Jefferson National Accelerator Facility
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.10.4 failover

2018-08-13 Thread Mohr Jr, Richard Frank (Rick Mohr)

> On Aug 13, 2018, at 2:25 PM, David Cohen  
> wrote:
> 
> the fstab line I use for mounting the Lustre filesystem:
> 
> oss03@tcp:oss01@tcp:/fsname /storagelustre  flock,user_xattr,defaults 
>0 0

OK.  That looks correct.

> the mds is also configured for failover (unsuccessfully) :
> tunefs.lustre --writeconf --erase-params --fsname=fsname --mgs 
> --mountfsoptions='user_xattr,errors=remount-ro,acl' 
> --param="mgsnode=oss03@tcp mgsnode=oss01@tcp servicenode=oss01@tcp 
> servicenode=oss03@tcp" /dev/lustre_pool/MDT

I don’t think you need to specify the —msg option (since the writeconf doesn’t 
change what kind of target it is).  Also, you should be able to just specify 
multiple —mgsnode=XX options like you did for the osts instead of using the 
—param=“mgsnode..” syntax.  I don’t know if that is affecting your failover 
config or not.  (Maybe it doesn’t matter.)

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.10.4 failover

2018-08-13 Thread David Cohen
the fstab line I use for mounting the Lustre filesystem:

oss03@tcp:oss01@tcp:/fsname /storagelustre
flock,user_xattr,defaults0 0

the mds is also configured for failover (unsuccessfully) :
tunefs.lustre --writeconf --erase-params --fsname=fsname --mgs
--mountfsoptions='user_xattr,errors=remount-ro,acl'
--param="mgsnode=oss03@tcp mgsnode=oss01@tcp servicenode=oss01@tcp
servicenode=oss03@tcp" /dev/lustre_pool/MDT




On Mon, Aug 13, 2018 at 8:40 PM Mohr Jr, Richard Frank (Rick Mohr) <
rm...@utk.edu> wrote:

>
> > On Aug 13, 2018, at 7:14 AM, David Cohen 
> wrote:
> >
> > I installed a new 2.10.4 Lustre file system.
> > Running MDS and OSS on the same servers.
> > Failover wasn't configured at format time.
> > I'm trying to configure failover node with tunefs without success.
> > tunefs.lustre --writeconf --erase-params --param="ost.quota_type=ug"
> --mgsnode=oss03@tcp --mgsnode=oss01@tcp --servicenode=oss01@tcp
> --servicenode=oss03@tcp /dev/mapper/OST0015
> >
> > I can mount the ost on the second server but the clients won't restore
> the connection.
> > Maybe I'm missing something obvious. Do you see any typo in the command?
>
> What mount command are you using on the client?
>
> --
> Rick Mohr
> Senior HPC System Administrator
> National Institute for Computational Sciences
> http://www.nics.tennessee.edu
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre log messages and log files

2018-08-13 Thread Thomas Roth
Ah, yes, thank you.
The goodies of systemd!

Regards,
Thomas

On 08/13/2018 02:10 PM, Julio Pedraza wrote:
> 
> Hi,
> 
> As you are in CentOs 7 maybe try to got through :
> 
> |# journalctl |grep -E "Lustre|LustreErrors|LNet|LDISKFS|ustre"|
> 
> and see if you got what you need
> 
> regards,
> J.
> 
> On 08/13/2018 02:00 PM, Thomas Roth wrote:
>> Hi all,
>>
>> we have this rather rare phenomenon of too few Lustre log entries - it would 
>> seem.
>> This is a cluster running Lustre 2.10.4 on CentOS 7.4
>> I do not think I have done anything to deviate from the defaults - neither 
>> with Lustre nor
>> rsyslogd-Config.
>> However,
>>
>> # dmesg | grep LustreError | wc -l
>> 50
>>
>> # grep -r LustreError * | wc -l
>> 0
>>
>>
>> Before I saw the LustreErrors on the console, I just assumed that newer 
>> Lustre versions are not that
>> chatty and this shiny new cluster just didn't show many errors.
>>
>> "Lustre:" messages make it to /var/log/messages, as in "kernel: Lustre: 
>> haven't heard from client...",
>> obviously I have not cut off Lustre completely.
>>
>> What did I do wrong?
>>
>>
>> Many regards,
>> Thomas
>>
>>
>>
> 
> 
> 
> 
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 

-- 

Thomas Roth
Department: Informationstechnologie
Location: SB3 2.291
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986


GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de

Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528
Managing Directors / Geschäftsführung:
Professor Dr. Paolo Giubellino, Ursula Weyrich, Jörg Blaurock
Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats:
State Secretary / Staatssekretär Dr. Georg Schütte
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre log messages and log files

2018-08-13 Thread Julio Pedraza


Hi,

As you are in CentOs 7 maybe try to got through :

|# journalctl |grep -E "Lustre|LustreErrors|LNet|LDISKFS|ustre"|

and see if you got what you need

regards,
J.

On 08/13/2018 02:00 PM, Thomas Roth wrote:

Hi all,

we have this rather rare phenomenon of too few Lustre log entries - it would 
seem.
This is a cluster running Lustre 2.10.4 on CentOS 7.4
I do not think I have done anything to deviate from the defaults - neither with 
Lustre nor
rsyslogd-Config.
However,

# dmesg | grep LustreError | wc -l
50

# grep -r LustreError * | wc -l
0


Before I saw the LustreErrors on the console, I just assumed that newer Lustre 
versions are not that
chatty and this shiny new cluster just didn't show many errors.

"Lustre:" messages make it to /var/log/messages, as in "kernel: Lustre: haven't 
heard from client...",
obviously I have not cut off Lustre completely.

What did I do wrong?


Many regards,
Thomas





___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Lustre 2.10.4 failover

2018-08-13 Thread David Cohen
Hi
I installed a new 2.10.4 Lustre file system.
Running MDS and OSS on the same servers.
Failover wasn't configured at format time.
I'm trying to configure failover node with tunefs without success.
tunefs.lustre --writeconf --erase-params --param="ost.quota_type=ug"
--mgsnode=oss03@tcp --mgsnode=oss01@tcp --servicenode=oss01@tcp
--servicenode=oss03@tcp /dev/mapper/OST0015

I can mount the ost on the second server but the clients won't restore the
connection.
Maybe I'm missing something obvious. Do you see any typo in the command?


David
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org