Re: [lustre-discuss] lustre 2.10.5 or 2.11.0

2018-10-19 Thread Riccardo Veraldi

On 10/19/18 12:37 PM, Mohr Jr, Richard Frank (Rick Mohr) wrote:

On Oct 17, 2018, at 7:30 PM, Riccardo Veraldi  
wrote:

anyway especially regarding the OSSes you may eventually need some ZFS module parameters 
optimizations regarding vdev_write and vdev_read max to increase those values higher than 
default. You may also disable ZIL, change the redundant_metadata to "most"  
atime off.

I could send you a list of parameters that in my case work well.

Riccardo,

Would you mind sharing your ZFS parameters with the mailing list?  I would be 
interested to see which options you have changed.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu


this worked for me on my high performance cluster

options zfs zfs_prefetch_disable=1
options zfs zfs_txg_history=120
options zfs metaslab_debug_unload=1
#
options zfs zfs_vdev_scheduler=deadline
options zfs zfs_vdev_async_write_active_min_dirty_percent=20
#
options zfs zfs_vdev_scrub_min_active=48
options zfs zfs_vdev_scrub_max_active=128
#
options zfs zfs_vdev_sync_write_min_active=8
options zfs zfs_vdev_sync_write_max_active=32
options zfs zfs_vdev_sync_read_min_active=8
options zfs zfs_vdev_sync_read_max_active=32
options zfs zfs_vdev_async_read_min_active=8
options zfs zfs_vdev_async_read_max_active=32
options zfs zfs_top_maxinflight=320
options zfs zfs_txg_timeout=30
options zfs zfs_dirty_data_max_percent=40
options zfs zfs_vdev_async_write_min_active=8
options zfs zfs_vdev_async_write_max_active=32

##

these the zfs attributes that I changed on the OSSes:

zfs set mountpoint=none $ostpool

zfs set sync=disabled $ostpool

zfs set atime=off $ostpool

zfs set redundant_metadata=most $ostpool

zfs set xattr=sa $ostpool

zfs set recordsize=1M $ostpool

#


these the ko2iblnd parameters for FDR Mellanox IB interfaces

options ko2iblnd timeout=100 peer_credits=63 credits=2560 
concurrent_sends=63 ntx=2048 fmr_pool_size=1280 fmr_flush_trigger=1024 
ntx=5120




these the ksocklnd paramaters

options ksocklnd sock_timeout=100 credits=2560 peer_credits=63

##

these other parameters that I did tweak

echo 32 > /sys/module/ptlrpc/parameters/max_ptlrpcds
echo 3 > /sys/module/ptlrpc/parameters/ptlrpcd_bind_policy

lctl set_param timeout=600
lctl set_param ldlm_timeout=200
lctl set_param at_min=250
lctl set_param at_max=600

###

Also I run this script at boot time to redefine IRQ assignments for hard 
drives spanned across all CPUs, not needed for kernel > 4.4


#!/bin/sh
# numa_smp.sh
device=$1
cpu1=$2
cpu2=$3
cpu=$cpu1
grep $1 /proc/interrupts|awk '{print $1}'|sed 's/://'|while read int
do
  echo $cpu > /proc/irq/$int/smp_affinity_list
  echo "echo CPU $cpu > /proc/irq/$a/smp_affinity_list"
  if [ $cpu = $cpu2 ]
  then
 cpu=$cpu1
  else
 ((cpu=$cpu+1))
  fi
done

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] lustre 2.10.5 or 2.11.0

2018-10-19 Thread Mohr Jr, Richard Frank (Rick Mohr)


> On Oct 17, 2018, at 7:30 PM, Riccardo Veraldi  
> wrote:
> 
> anyway especially regarding the OSSes you may eventually need some ZFS module 
> parameters optimizations regarding vdev_write and vdev_read max to increase 
> those values higher than default. You may also disable ZIL, change the 
> redundant_metadata to "most"  atime off.
> 
> I could send you a list of parameters that in my case work well.

Riccardo,

Would you mind sharing your ZFS parameters with the mailing list?  I would be 
interested to see which options you have changed.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5

2018-10-19 Thread Patrick Farrell
There is a somewhat hidden danger with eviction: You can get silent data loss.  
The simplest example is buffered (ie, any that aren't direct I/O) writes - 
Lustre reports completion (ie your write() syscall completes) once the data is 
in the page cache on the client (like any modern file system, including local 
ones - you can get silent data loss on EXT4, XFS, ZFS, etc, if your disk 
becomes unavailable before data is written out of the page cache).

So if that client with pending dirty data is evicted from the OST the data is 
destined for - which is essentially what abort recovery does - that data is 
lost, and the application doesn't get an error (because the write() call has 
already completed).

A message is printed to the console on the client in this case, but you have to 
know to look for it.  The application will run to completion, but you may still 
experience data loss, and not know it.  It's just something to keep in mind - 
applications that run to completion despite evictions may not have completed 
*successfully*.

- Patrick

On 10/19/18, 11:42 AM, "lustre-discuss on behalf of Mohr Jr, Richard Frank 
(Rick Mohr)"  wrote:


> On Oct 19, 2018, at 10:42 AM, Marion Hakanson  wrote:
> 
> Thanks for the feedback.  You're both confirming what we've learned so 
far, that we had to unmount all the clients (which required rebooting most of 
them), then reboot all the storage servers, to get things unstuck until the 
problem recurred.
> 
> I tried abort_recovery on the clients last night, before rebooting the 
MDS, but that did not help.  Could well be I'm not using it right:

Aborting recovery is a server-side action, not something that is done on 
the client.  As mentioned by Peter, you can abort recovery on a single target 
after it is mounted by using “lctl —device  abort_recover”.  But you can 
also just skip over the recovery step when the target is mounted by adding the 
“-o abort_recov” option to the mount command.  For example, 

mount -t lustre -o abort_recov /dev/my/mdt /mnt/lustre/mdt0

And similarly for OSTs.  So you should be able to just unmount your MDT/OST 
on the running file system and then remount with the abort_recov option.  From 
a client perspective, the lustre client will get evicted but should 
automatically reconnect.   

Some applications can ride through a client eviction without causing 
issues, some cannot.  I think it depends largely on how the application does 
its IO and if there is any IO in flight when the eviction occurs.  I have had 
to do this a few times on a running cluster, and in my experience, we have had 
good luck with most of the applications continuing without issues.  Sometimes 
there are a few jobs that abort, but overall this is better than having to stop 
all jobs and remount lustre on all the compute nodes.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5

2018-10-19 Thread Mohr Jr, Richard Frank (Rick Mohr)


> On Oct 19, 2018, at 10:42 AM, Marion Hakanson  wrote:
> 
> Thanks for the feedback.  You're both confirming what we've learned so far, 
> that we had to unmount all the clients (which required rebooting most of 
> them), then reboot all the storage servers, to get things unstuck until the 
> problem recurred.
> 
> I tried abort_recovery on the clients last night, before rebooting the MDS, 
> but that did not help.  Could well be I'm not using it right:

Aborting recovery is a server-side action, not something that is done on the 
client.  As mentioned by Peter, you can abort recovery on a single target after 
it is mounted by using “lctl —device  abort_recover”.  But you can also 
just skip over the recovery step when the target is mounted by adding the “-o 
abort_recov” option to the mount command.  For example, 

mount -t lustre -o abort_recov /dev/my/mdt /mnt/lustre/mdt0

And similarly for OSTs.  So you should be able to just unmount your MDT/OST on 
the running file system and then remount with the abort_recov option.  >From a 
client perspective, the lustre client will get evicted but should automatically 
reconnect.   

Some applications can ride through a client eviction without causing issues, 
some cannot.  I think it depends largely on how the application does its IO and 
if there is any IO in flight when the eviction occurs.  I have had to do this a 
few times on a running cluster, and in my experience, we have had good luck 
with most of the applications continuing without issues.  Sometimes there are a 
few jobs that abort, but overall this is better than having to stop all jobs 
and remount lustre on all the compute nodes.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5

2018-10-19 Thread Andreas Dilger
On Oct 19, 2018, at 08:42, Marion Hakanson  wrote:
> 
> Thanks for the feedback.  You're both confirming what we've learned so far, 
> that we had to unmount all the clients (which required rebooting most of 
> them), then reboot all the storage servers, to get things unstuck until the 
> problem recurred.
> 
> I tried abort_recovery on the clients last night, before rebooting the MDS, 
> but that did not help.  Could well be I'm not using it right:
> 
> - look up the MDT in "lctl dl" list.
> - run "lctl abort_recovery $mdt" on all clients
> - reboot the MDS.
> 
> The MDS still reported recovering all 259 clients at boot time.

The point of abort_recovery is to reset the server recovery engine without 
doing client recovery (i.e. tell it "don't try to recover these clients after 
the server restarted").  It should be run on the MDS and not the clients.  
Also, if you reboot the MDS then it will start recovery again, so don't do 
that...  

> BTW, we have a separate MGS from the MDS.  Could it be we need to reboot both 
> MDS & MGS to clear things?
> 
> Thanks and regards,
> 
> Marion
> 
> 
>> On Oct 19, 2018, at 07:28, Peter Bortas  wrote:
>> 
>> That should fix it, but I'd like to advocate for using abort_recovery.
>> Compared to unmounting thousands of clients abort_recovery is a quick
>> operation that takes a few minutes to do. Wouldn't say it gets used a
>> lot, but I've done it on NSCs live environment six times since 2016,
>> solving the deadlocks each time.
>> 
>> Regards,
>> -- 
>> Peter Bortas
>> Swedish National Supercomputer Centre
>> 
>>> On Fri, Oct 19, 2018 at 3:04 PM Patrick Farrell  wrote:
>>> 
>>> 
>>> Marion,
>>> 
>>> You note the deadlock reoccurs on server reboot, so you’re really stuck.  
>>> This is most likely due to recovery where operations from the clients are 
>>> replayed.
>>> 
>>> If you’re fine with letting any pending I/O fail in order to get the system 
>>> back up, I would suggest a client side action: unmount (-f, and be patient) 
>>> and /or shut down all of your clients.  That will discard things the 
>>> clients are trying to replay, (causing pending I/O to fail).  Then shut 
>>> down your servers and start them up again.  With no clients, there’s 
>>> (almost) nothing to replay, and you probably won’t hit the issue on 
>>> startup.  (There’s also the abort_recovery option covered in the manual, 
>>> but I personally think this is easier.)
>>> 
>>> There’s no guarantee this avoids your deadlock happening again, but it’s 
>>> highly likely it’ll at least get you running.
>>> 
>>> If you need to save your pending I/O, you’ll have to install patched 
>>> software with a fix for this (sounds like WC has identified the bug) and 
>>> then reboot.
>>> 
>>> Good luck!
>>> - Patrick
>>> 
>>> From: lustre-discuss  on behalf of 
>>> Marion Hakanson 
>>> Sent: Friday, October 19, 2018 1:32:10 AM
>>> To: lustre-discuss@lists.lustre.org
>>> Subject: [lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5
>>> 
>>> This issue is really kicking our behinds:
>>> https://jira.whamcloud.com/browse/LU-11465
>>> 
>>> While we're waiting for the issue to get some attention from Lustre 
>>> developers, are there suggestions on how we can recover our cluster from 
>>> this kind of deadlocked, stuck-threads-on-the-MDS (or OSS) situation?  
>>> Rebooting the storage servers does not clear the hang-up, as upon reboot 
>>> the MDS quickly ends up with the same number of D-state threads (around the 
>>> same number as we have clients).  It seems to me like there is some state 
>>> stashed away in the filesystem which restores the deadlock as soon as the 
>>> MDS comes up.
>>> 
>>> Thanks and regards,
>>> 
>>> Marion
>>> 
>>> ___
>>> lustre-discuss mailing list
>>> lustre-discuss@lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
---
Andreas Dilger
Principal Lustre Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5

2018-10-19 Thread Marion Hakanson
Sigh.  Instructions that I've found for that have been a bit on the slim side 
(:-).  We'll give it a try.

Thanks and regards,

Marion


> On Oct 19, 2018, at 07:59, Peter Bortas  wrote:
> 
> So, that is at least not a syntax for abort_recovery I'm familiar
> with. To take an example from last time I did this, I first determined
> which device wasn't completing the recovery, then logged in on the
> server (an OST in this case) and ran:
> 
> # lctl dl|grep obdfilter|grep fouo5-OST
>  3 UP obdfilter fouo5-OST fouo5-OST_UUID 629
> # lctl --device 3 abort_recovery
> 
> Attached is a script that you can invoke with "lustre_watch_recovery
> " that will give you the status of recovery on the named
> server updated once per second. I find it useful for keeping track of
> how things are working out while doing restarts.
> 
> Regards,
> -- 
> Peter Bortas, NSC
> 
> 
> 
> 
> 
> 
> 
> 
>> On Fri, Oct 19, 2018 at 4:42 PM Marion Hakanson  wrote:
>> 
>> Thanks for the feedback.  You're both confirming what we've learned so far, 
>> that we had to unmount all the clients (which required rebooting most of 
>> them), then reboot all the storage servers, to get things unstuck until the 
>> problem recurred.
>> 
>> I tried abort_recovery on the clients last night, before rebooting the MDS, 
>> but that did not help.  Could well be I'm not using it right:
>> 
>> - look up the MDT in "lctl dl" list.
>> - run "lctl abort_recovery $mdt" on all clients
>> - reboot the MDS.
>> 
>> The MDS still reported recovering all 259 clients at boot time.
>> 
>> BTW, we have a separate MGS from the MDS.  Could it be we need to reboot 
>> both MDS & MGS to clear things?
>> 
>> Thanks and regards,
>> 
>> Marion
>> 
>> 
>>> On Oct 19, 2018, at 07:28, Peter Bortas  wrote:
>>> 
>>> That should fix it, but I'd like to advocate for using abort_recovery.
>>> Compared to unmounting thousands of clients abort_recovery is a quick
>>> operation that takes a few minutes to do. Wouldn't say it gets used a
>>> lot, but I've done it on NSCs live environment six times since 2016,
>>> solving the deadlocks each time.
>>> 
>>> Regards,
>>> --
>>> Peter Bortas
>>> Swedish National Supercomputer Centre
>>> 
 On Fri, Oct 19, 2018 at 3:04 PM Patrick Farrell  wrote:
 
 
 Marion,
 
 You note the deadlock reoccurs on server reboot, so you’re really stuck.  
 This is most likely due to recovery where operations from the clients are 
 replayed.
 
 If you’re fine with letting any pending I/O fail in order to get the 
 system back up, I would suggest a client side action: unmount (-f, and be 
 patient) and /or shut down all of your clients.  That will discard things 
 the clients are trying to replay, (causing pending I/O to fail).  Then 
 shut down your servers and start them up again.  With no clients, there’s 
 (almost) nothing to replay, and you probably won’t hit the issue on 
 startup.  (There’s also the abort_recovery option covered in the manual, 
 but I personally think this is easier.)
 
 There’s no guarantee this avoids your deadlock happening again, but it’s 
 highly likely it’ll at least get you running.
 
 If you need to save your pending I/O, you’ll have to install patched 
 software with a fix for this (sounds like WC has identified the bug) and 
 then reboot.
 
 Good luck!
 - Patrick
 
 From: lustre-discuss  on behalf 
 of Marion Hakanson 
 Sent: Friday, October 19, 2018 1:32:10 AM
 To: lustre-discuss@lists.lustre.org
 Subject: [lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5
 
 This issue is really kicking our behinds:
 https://jira.whamcloud.com/browse/LU-11465
 
 While we're waiting for the issue to get some attention from Lustre 
 developers, are there suggestions on how we can recover our cluster from 
 this kind of deadlocked, stuck-threads-on-the-MDS (or OSS) situation?  
 Rebooting the storage servers does not clear the hang-up, as upon reboot 
 the MDS quickly ends up with the same number of D-state threads (around 
 the same number as we have clients).  It seems to me like there is some 
 state stashed away in the filesystem which restores the deadlock as soon 
 as the MDS comes up.
 
 Thanks and regards,
 
 Marion
 
 ___
 lustre-discuss mailing list
 lustre-discuss@lists.lustre.org
 http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5

2018-10-19 Thread Peter Bortas
So, that is at least not a syntax for abort_recovery I'm familiar
with. To take an example from last time I did this, I first determined
which device wasn't completing the recovery, then logged in on the
server (an OST in this case) and ran:

# lctl dl|grep obdfilter|grep fouo5-OST
  3 UP obdfilter fouo5-OST fouo5-OST_UUID 629
# lctl --device 3 abort_recovery

Attached is a script that you can invoke with "lustre_watch_recovery
" that will give you the status of recovery on the named
server updated once per second. I find it useful for keeping track of
how things are working out while doing restarts.

Regards,
-- 
Peter Bortas, NSC








On Fri, Oct 19, 2018 at 4:42 PM Marion Hakanson  wrote:
>
> Thanks for the feedback.  You're both confirming what we've learned so far, 
> that we had to unmount all the clients (which required rebooting most of 
> them), then reboot all the storage servers, to get things unstuck until the 
> problem recurred.
>
> I tried abort_recovery on the clients last night, before rebooting the MDS, 
> but that did not help.  Could well be I'm not using it right:
>
> - look up the MDT in "lctl dl" list.
> - run "lctl abort_recovery $mdt" on all clients
> - reboot the MDS.
>
> The MDS still reported recovering all 259 clients at boot time.
>
> BTW, we have a separate MGS from the MDS.  Could it be we need to reboot both 
> MDS & MGS to clear things?
>
> Thanks and regards,
>
> Marion
>
>
> > On Oct 19, 2018, at 07:28, Peter Bortas  wrote:
> >
> > That should fix it, but I'd like to advocate for using abort_recovery.
> > Compared to unmounting thousands of clients abort_recovery is a quick
> > operation that takes a few minutes to do. Wouldn't say it gets used a
> > lot, but I've done it on NSCs live environment six times since 2016,
> > solving the deadlocks each time.
> >
> > Regards,
> > --
> > Peter Bortas
> > Swedish National Supercomputer Centre
> >
> >> On Fri, Oct 19, 2018 at 3:04 PM Patrick Farrell  wrote:
> >>
> >>
> >> Marion,
> >>
> >> You note the deadlock reoccurs on server reboot, so you’re really stuck.  
> >> This is most likely due to recovery where operations from the clients are 
> >> replayed.
> >>
> >> If you’re fine with letting any pending I/O fail in order to get the 
> >> system back up, I would suggest a client side action: unmount (-f, and be 
> >> patient) and /or shut down all of your clients.  That will discard things 
> >> the clients are trying to replay, (causing pending I/O to fail).  Then 
> >> shut down your servers and start them up again.  With no clients, there’s 
> >> (almost) nothing to replay, and you probably won’t hit the issue on 
> >> startup.  (There’s also the abort_recovery option covered in the manual, 
> >> but I personally think this is easier.)
> >>
> >> There’s no guarantee this avoids your deadlock happening again, but it’s 
> >> highly likely it’ll at least get you running.
> >>
> >> If you need to save your pending I/O, you’ll have to install patched 
> >> software with a fix for this (sounds like WC has identified the bug) and 
> >> then reboot.
> >>
> >> Good luck!
> >> - Patrick
> >> 
> >> From: lustre-discuss  on behalf 
> >> of Marion Hakanson 
> >> Sent: Friday, October 19, 2018 1:32:10 AM
> >> To: lustre-discuss@lists.lustre.org
> >> Subject: [lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5
> >>
> >> This issue is really kicking our behinds:
> >> https://jira.whamcloud.com/browse/LU-11465
> >>
> >> While we're waiting for the issue to get some attention from Lustre 
> >> developers, are there suggestions on how we can recover our cluster from 
> >> this kind of deadlocked, stuck-threads-on-the-MDS (or OSS) situation?  
> >> Rebooting the storage servers does not clear the hang-up, as upon reboot 
> >> the MDS quickly ends up with the same number of D-state threads (around 
> >> the same number as we have clients).  It seems to me like there is some 
> >> state stashed away in the filesystem which restores the deadlock as soon 
> >> as the MDS comes up.
> >>
> >> Thanks and regards,
> >>
> >> Marion
> >>
> >> ___
> >> lustre-discuss mailing list
> >> lustre-discuss@lists.lustre.org
> >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


lustre_watch_recovery
Description: Binary data
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5

2018-10-19 Thread Marion Hakanson
Thanks for the feedback.  You're both confirming what we've learned so far, 
that we had to unmount all the clients (which required rebooting most of them), 
then reboot all the storage servers, to get things unstuck until the problem 
recurred.

I tried abort_recovery on the clients last night, before rebooting the MDS, but 
that did not help.  Could well be I'm not using it right:

- look up the MDT in "lctl dl" list.
- run "lctl abort_recovery $mdt" on all clients
- reboot the MDS.

The MDS still reported recovering all 259 clients at boot time.

BTW, we have a separate MGS from the MDS.  Could it be we need to reboot both 
MDS & MGS to clear things?

Thanks and regards,

Marion


> On Oct 19, 2018, at 07:28, Peter Bortas  wrote:
> 
> That should fix it, but I'd like to advocate for using abort_recovery.
> Compared to unmounting thousands of clients abort_recovery is a quick
> operation that takes a few minutes to do. Wouldn't say it gets used a
> lot, but I've done it on NSCs live environment six times since 2016,
> solving the deadlocks each time.
> 
> Regards,
> -- 
> Peter Bortas
> Swedish National Supercomputer Centre
> 
>> On Fri, Oct 19, 2018 at 3:04 PM Patrick Farrell  wrote:
>> 
>> 
>> Marion,
>> 
>> You note the deadlock reoccurs on server reboot, so you’re really stuck.  
>> This is most likely due to recovery where operations from the clients are 
>> replayed.
>> 
>> If you’re fine with letting any pending I/O fail in order to get the system 
>> back up, I would suggest a client side action: unmount (-f, and be patient) 
>> and /or shut down all of your clients.  That will discard things the clients 
>> are trying to replay, (causing pending I/O to fail).  Then shut down your 
>> servers and start them up again.  With no clients, there’s (almost) nothing 
>> to replay, and you probably won’t hit the issue on startup.  (There’s also 
>> the abort_recovery option covered in the manual, but I personally think this 
>> is easier.)
>> 
>> There’s no guarantee this avoids your deadlock happening again, but it’s 
>> highly likely it’ll at least get you running.
>> 
>> If you need to save your pending I/O, you’ll have to install patched 
>> software with a fix for this (sounds like WC has identified the bug) and 
>> then reboot.
>> 
>> Good luck!
>> - Patrick
>> 
>> From: lustre-discuss  on behalf of 
>> Marion Hakanson 
>> Sent: Friday, October 19, 2018 1:32:10 AM
>> To: lustre-discuss@lists.lustre.org
>> Subject: [lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5
>> 
>> This issue is really kicking our behinds:
>> https://jira.whamcloud.com/browse/LU-11465
>> 
>> While we're waiting for the issue to get some attention from Lustre 
>> developers, are there suggestions on how we can recover our cluster from 
>> this kind of deadlocked, stuck-threads-on-the-MDS (or OSS) situation?  
>> Rebooting the storage servers does not clear the hang-up, as upon reboot the 
>> MDS quickly ends up with the same number of D-state threads (around the same 
>> number as we have clients).  It seems to me like there is some state stashed 
>> away in the filesystem which restores the deadlock as soon as the MDS comes 
>> up.
>> 
>> Thanks and regards,
>> 
>> Marion
>> 
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5

2018-10-19 Thread Peter Bortas
That should fix it, but I'd like to advocate for using abort_recovery.
Compared to unmounting thousands of clients abort_recovery is a quick
operation that takes a few minutes to do. Wouldn't say it gets used a
lot, but I've done it on NSCs live environment six times since 2016,
solving the deadlocks each time.

Regards,
-- 
Peter Bortas
Swedish National Supercomputer Centre

On Fri, Oct 19, 2018 at 3:04 PM Patrick Farrell  wrote:
>
>
> Marion,
>
> You note the deadlock reoccurs on server reboot, so you’re really stuck.  
> This is most likely due to recovery where operations from the clients are 
> replayed.
>
> If you’re fine with letting any pending I/O fail in order to get the system 
> back up, I would suggest a client side action: unmount (-f, and be patient) 
> and /or shut down all of your clients.  That will discard things the clients 
> are trying to replay, (causing pending I/O to fail).  Then shut down your 
> servers and start them up again.  With no clients, there’s (almost) nothing 
> to replay, and you probably won’t hit the issue on startup.  (There’s also 
> the abort_recovery option covered in the manual, but I personally think this 
> is easier.)
>
> There’s no guarantee this avoids your deadlock happening again, but it’s 
> highly likely it’ll at least get you running.
>
> If you need to save your pending I/O, you’ll have to install patched software 
> with a fix for this (sounds like WC has identified the bug) and then reboot.
>
> Good luck!
> - Patrick
> 
> From: lustre-discuss  on behalf of 
> Marion Hakanson 
> Sent: Friday, October 19, 2018 1:32:10 AM
> To: lustre-discuss@lists.lustre.org
> Subject: [lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5
>
> This issue is really kicking our behinds:
> https://jira.whamcloud.com/browse/LU-11465
>
> While we're waiting for the issue to get some attention from Lustre 
> developers, are there suggestions on how we can recover our cluster from this 
> kind of deadlocked, stuck-threads-on-the-MDS (or OSS) situation?  Rebooting 
> the storage servers does not clear the hang-up, as upon reboot the MDS 
> quickly ends up with the same number of D-state threads (around the same 
> number as we have clients).  It seems to me like there is some state stashed 
> away in the filesystem which restores the deadlock as soon as the MDS comes 
> up.
>
> Thanks and regards,
>
> Marion
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5

2018-10-19 Thread Patrick Farrell

Marion,

You note the deadlock reoccurs on server reboot, so you’re really stuck.  This 
is most likely due to recovery where operations from the clients are replayed.

If you’re fine with letting any pending I/O fail in order to get the system 
back up, I would suggest a client side action: unmount (-f, and be patient) and 
/or shut down all of your clients.  That will discard things the clients are 
trying to replay, (causing pending I/O to fail).  Then shut down your servers 
and start them up again.  With no clients, there’s (almost) nothing to replay, 
and you probably won’t hit the issue on startup.  (There’s also the 
abort_recovery option covered in the manual, but I personally think this is 
easier.)

There’s no guarantee this avoids your deadlock happening again, but it’s highly 
likely it’ll at least get you running.

If you need to save your pending I/O, you’ll have to install patched software 
with a fix for this (sounds like WC has identified the bug) and then reboot.

Good luck!
- Patrick

From: lustre-discuss  on behalf of 
Marion Hakanson 
Sent: Friday, October 19, 2018 1:32:10 AM
To: lustre-discuss@lists.lustre.org
Subject: [lustre-discuss] LU-11465 OSS/MDS deadlock in 2.10.5

This issue is really kicking our behinds:
https://jira.whamcloud.com/browse/LU-11465

While we're waiting for the issue to get some attention from Lustre developers, 
are there suggestions on how we can recover our cluster from this kind of 
deadlocked, stuck-threads-on-the-MDS (or OSS) situation?  Rebooting the storage 
servers does not clear the hang-up, as upon reboot the MDS quickly ends up with 
the same number of D-state threads (around the same number as we have clients). 
 It seems to me like there is some state stashed away in the filesystem which 
restores the deadlock as soon as the MDS comes up.

Thanks and regards,

Marion

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org