Re: [lustre-discuss] rebooting nodes

2017-08-10 Thread Colin Faber
In my experience, OFED tends to be unloaded prior to LNet tear down. This
chops the feet out from LNet and LNet module won't cleanly unload,
resulting in hang on reboot. The trick is to ensure that lustre is
unmounted, then LNet is unloaded, then OFED modules are unloaded. Generally
when shutting down in this order, your reboot should be clean.

You can verify this idea by checking your console log during the shutdown.

-cf

On Thu, Aug 10, 2017 at 7:51 AM, Christopher Johnston 
wrote:

> On my systems that use standard ethernet (im in the cloud), 2.9 reboots I
> have no issues I can see.  I did have issues with the lnet driver not being
> able to grab the port on boot-up so I backported the lnet systemd unit file
> from 2.10 to get around that.
>
> On Thu, Aug 10, 2017 at 9:44 AM, Ben Evans  wrote:
>
>> Are the Infiniband drivers disappearing first?  I know that used to be an
>> issue.
>>
>> -Ben
>>
>> On 8/10/17, 8:59 AM, "lustre-discuss on behalf of Michael Di Domenico"
>> > mdidomeni...@gmail.com> wrote:
>>
>> >does anyone else have issues with issue 'reboot' while having a lustre
>> >mount?
>> >
>> >we're running v2.9 clients on our workstations, but when a user goes
>> >to reboot the machine (from the gui) the system stalls under systemd
>> >while i presume it's attempting to unmount the filesystem.
>> >
>> >what i see on the console is; systemd kicks in and starts unmounting
>> >all the nfs shares we have, works fine.  but then it gets to lustre
>> >and starts throwing connection errors on the console.  it's almost as
>> >if systemd raced itself stopping lustre, whereby lnet got yanked out
>> >from under the mount before the unmount actually finished.
>> >
>> >after five minutes or so, it looks like systemd threw in the towel and
>> >gave up trying to unmount, but the system is stuck still trying to
>> >execute more shutdown tasks.
>> >
>> >when we mount lustre on the workstations, i have a script that figures
>> >some stuff out, issues a service lnet start, and then issues a mount
>> >command.  this all works fine, but i'm not sure if that's why systemd
>> >can't figure out what to do correctly.
>> >
>> >and since this is during a shutdown phase, debugging this is
>> >difficult.  any thoughts?
>> >___
>> >lustre-discuss mailing list
>> >lustre-discuss@lists.lustre.org
>> >http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> 
>>
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> 
>>
>
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.
> lustre.org_listinfo.cgi_lustre-2Ddiscuss-2Dlustre.org&
> d=DwICAg=IGDlg0lD0b-nebmJJ0Kp8A=x9pM59OqndbWw-
> lPPdr8w1Vud29EZigcxcNkz0uw5oQ=Gzks6KFhzHoz-saPEKrQSsQKMh_
> 8dil_0_74sCECIlk=_Bb_hwIpGb8sVPVPxSlp1pkUO70bYXITUHEs0m5g26A=
>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] rebooting nodes

2017-08-10 Thread Christopher Johnston
On my systems that use standard ethernet (im in the cloud), 2.9 reboots I
have no issues I can see.  I did have issues with the lnet driver not being
able to grab the port on boot-up so I backported the lnet systemd unit file
from 2.10 to get around that.

On Thu, Aug 10, 2017 at 9:44 AM, Ben Evans  wrote:

> Are the Infiniband drivers disappearing first?  I know that used to be an
> issue.
>
> -Ben
>
> On 8/10/17, 8:59 AM, "lustre-discuss on behalf of Michael Di Domenico"
>  mdidomeni...@gmail.com> wrote:
>
> >does anyone else have issues with issue 'reboot' while having a lustre
> >mount?
> >
> >we're running v2.9 clients on our workstations, but when a user goes
> >to reboot the machine (from the gui) the system stalls under systemd
> >while i presume it's attempting to unmount the filesystem.
> >
> >what i see on the console is; systemd kicks in and starts unmounting
> >all the nfs shares we have, works fine.  but then it gets to lustre
> >and starts throwing connection errors on the console.  it's almost as
> >if systemd raced itself stopping lustre, whereby lnet got yanked out
> >from under the mount before the unmount actually finished.
> >
> >after five minutes or so, it looks like systemd threw in the towel and
> >gave up trying to unmount, but the system is stuck still trying to
> >execute more shutdown tasks.
> >
> >when we mount lustre on the workstations, i have a script that figures
> >some stuff out, issues a service lnet start, and then issues a mount
> >command.  this all works fine, but i'm not sure if that's why systemd
> >can't figure out what to do correctly.
> >
> >and since this is during a shutdown phase, debugging this is
> >difficult.  any thoughts?
> >___
> >lustre-discuss mailing list
> >lustre-discuss@lists.lustre.org
> >http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] rebooting nodes

2017-08-10 Thread Ben Evans
Are the Infiniband drivers disappearing first?  I know that used to be an
issue.

-Ben

On 8/10/17, 8:59 AM, "lustre-discuss on behalf of Michael Di Domenico"
 wrote:

>does anyone else have issues with issue 'reboot' while having a lustre
>mount?
>
>we're running v2.9 clients on our workstations, but when a user goes
>to reboot the machine (from the gui) the system stalls under systemd
>while i presume it's attempting to unmount the filesystem.
>
>what i see on the console is; systemd kicks in and starts unmounting
>all the nfs shares we have, works fine.  but then it gets to lustre
>and starts throwing connection errors on the console.  it's almost as
>if systemd raced itself stopping lustre, whereby lnet got yanked out
>from under the mount before the unmount actually finished.
>
>after five minutes or so, it looks like systemd threw in the towel and
>gave up trying to unmount, but the system is stuck still trying to
>execute more shutdown tasks.
>
>when we mount lustre on the workstations, i have a script that figures
>some stuff out, issues a service lnet start, and then issues a mount
>command.  this all works fine, but i'm not sure if that's why systemd
>can't figure out what to do correctly.
>
>and since this is during a shutdown phase, debugging this is
>difficult.  any thoughts?
>___
>lustre-discuss mailing list
>lustre-discuss@lists.lustre.org
>http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] rebooting nodes

2017-08-10 Thread Michael Di Domenico
does anyone else have issues with issue 'reboot' while having a lustre mount?

we're running v2.9 clients on our workstations, but when a user goes
to reboot the machine (from the gui) the system stalls under systemd
while i presume it's attempting to unmount the filesystem.

what i see on the console is; systemd kicks in and starts unmounting
all the nfs shares we have, works fine.  but then it gets to lustre
and starts throwing connection errors on the console.  it's almost as
if systemd raced itself stopping lustre, whereby lnet got yanked out
from under the mount before the unmount actually finished.

after five minutes or so, it looks like systemd threw in the towel and
gave up trying to unmount, but the system is stuck still trying to
execute more shutdown tasks.

when we mount lustre on the workstations, i have a script that figures
some stuff out, issues a service lnet start, and then issues a mount
command.  this all works fine, but i'm not sure if that's why systemd
can't figure out what to do correctly.

and since this is during a shutdown phase, debugging this is
difficult.  any thoughts?
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org