Re: [lustre-discuss] [EXTERNAL] MDTs will only mount read only

2023-06-22 Thread Mike Mosley via lustre-discuss
Rick,

You were on the right track!

We were fortunate enough to get an expert from Cambridge Computing to take
a look at things and he managed to get us back into a normal state.

He remounted the MDTs with the *abort_recov* option and we were finally
able to get things going again.

Thanks to all who responded and special shout out to Brad at Cambridge
Computing for making time to help us get this fixed.

Mike




On Wed, Jun 21, 2023 at 4:32 PM Mohr, Rick  wrote:

> Mike,
>
> On the off chance that the recovery process is causing the issue, you
> could try mounting the mdt with the "abort_recov" option and see if the
> behavior changes.
>
> --Rick
>
>
>
> On 6/21/23, 2:33 PM, "lustre-discuss on behalf of Jeff Johnson" <
> lustre-discuss-boun...@lists.lustre.org  lustre-discuss-boun...@lists.lustre.org> on behalf of
> jeff.john...@aeoncomputing.com >
> wrote:
>
>
> Maybe someone else in the list can add clarity but I don't believe a
> recovery process on mount would keep the MDS read-only or trigger that
> trace. Something else may be going on.
>
>
> I would start from the ground up. Bring your servers up, unmounted. Ensure
> lnet is loaded and configured properly. Test lnet using ping or
> lnet_selftest from your MDS to all of your OSS nodes. Then mount your
> combined MGS/MDT volume on the MDS and see what happens.
>
>
>
>
> Is your MDS in a high-availability pair?
> What version of Lustre are you running?
>
>
>
>
> ...just a few things readers on the list might want to know.
>
>
>
>
> --Jeff
>
>
>
>
>
>
>
>
> On Wed, Jun 21, 2023 at 11:21 AM Mike Mosley    >> wrote:
>
>
> Jeff,
>
>
> At this point we have the OSS shutdown. We were coming back from. full
> outage and so we are trying to get the MDS up before starting to bring up
> the OSS.
>
>
>
>
> Mike
>
>
>
>
> On Wed, Jun 21, 2023 at 2:15 PM Jeff Johnson <
> jeff.john...@aeoncomputing.com 
> <_blank>> wrote:
>
>
> Mike,
>
>
> Have you made sure the the o2ib interface on all of your Lustre servers
> (MDS & OSS) are functioning properly? Are you able to `lctl ping
> x.x.x.x@o2ib` successfully between MDS and OSS nodes?
>
>
>
>
> --Jeff
>
>
>
>
>
>
>
>
> On Wed, Jun 21, 2023 at 10:08 AM Mike Mosley via lustre-discuss <
> lustre-discuss@lists.lustre.org 
> <_blank>> wrote:
>
>
> Rick,172.16.100.4 is the IB address of one of the OSS servers. I
> believe the mgt and mdt0 are the same target. My understanding is that we
> have a single instanceof the MGT which is on the first MDT server i.e. it
> was created via a comand similar to:
>
>
>
>
> # mkfs.lustre --fsname=scratch --index=0 --mdt --mgs --replace /dev/sdb
>
>
>
>
>
>
> Does that make sense.
>
>
>
>
>
>
> On Wed, Jun 21, 2023 at 12:55 PM Mohr, Rick  moh...@ornl.gov> <_blank>> wrote:
>
>
> Which host is 172.16.100.4? Also, are the mgt and mdt0 on the same target
> or are they two separate targets just on the same host?
>
>
> --Rick
>
>
>
>
> On 6/21/23, 12:52 PM, "Mike Mosley"  mike.mos...@charlotte.edu> <_blank>   <_blank>>> wrote:
>
>
>
>
> Hi Rick,
>
>
>
>
> The MGS/MDS are combined. The output I posted is from the primary.
>
>
>
>
>
>
>
>
> THanks,
>
>
>
>
>
>
>
>
> Mike
>
>
>
>
>
>
>
>
> On Wed, Jun 21, 2023 at 12:27 PM Mohr, Rick  moh...@ornl.gov> <_blank> 
> <_blank>>  <_blank>
>  <_blank wrote:
>
>
>
>
> Mike,
>
>
>
>
> It looks like the mds server is having a problem contacting the mgs
> server. I'm guessing the mgs is a separate host? I would start by looking
> for possible network problems that might explain the LNet timeouts. You can
> try using "lctl ping" to test the LNet connection between nodes, and you
> can also try regular "ping" between the IP addresses on the IB interfaces.
>
>
>
>
> --Rick
>
>
>
>
>
>
>
>
> On 6/21/23, 11:35 AM, "lustre-discuss on behalf of Mike Mosley via
> lustre-discuss"  lustre-discuss-boun...@lists.lustre.org> <_blank>  lustre-discuss-boun...@lists.lustre.org  lustre-discuss-boun...@lists.lustre.org> <_blank>> <_blank>  lustre-discuss-boun...@lists.lustre.org  lustre-discuss-boun...@lists.lustre.org> <_blank>  lustre-discuss-boun...@lists.lustre.org  lustre-discuss-boun...@lists.lustre.org> <_blank>> <_blank>> on behalf of
> lustre-discuss@lists.lustre.org 
> <_blank>  <_blank>> <_blank>  lustre-discuss@lists.lustre.org 
> <_blank>  <_blank>> <_blank>>> wrote:
>
>
>
>
>
>
>
>
> Greetings,
>
>
>
>
>
>
>

Re: [lustre-discuss] [EXTERNAL] MDTs will only mount read only

2023-06-21 Thread Mike Mosley via lustre-discuss
Rick,

Thanks we are going to try some of these suggestions later this evening or
tomorrow.   We are currently backing up the mdt (as described in the Lustre
manual).   I will post further once we get there.

THanks for the suggestions.

Mike

On Wed, Jun 21, 2023 at 4:32 PM Mohr, Rick  wrote:

> Mike,
>
> On the off chance that the recovery process is causing the issue, you
> could try mounting the mdt with the "abort_recov" option and see if the
> behavior changes.
>
> --Rick
>
>
>
> On 6/21/23, 2:33 PM, "lustre-discuss on behalf of Jeff Johnson" <
> lustre-discuss-boun...@lists.lustre.org  lustre-discuss-boun...@lists.lustre.org> on behalf of
> jeff.john...@aeoncomputing.com >
> wrote:
>
>
> Maybe someone else in the list can add clarity but I don't believe a
> recovery process on mount would keep the MDS read-only or trigger that
> trace. Something else may be going on.
>
>
> I would start from the ground up. Bring your servers up, unmounted. Ensure
> lnet is loaded and configured properly. Test lnet using ping or
> lnet_selftest from your MDS to all of your OSS nodes. Then mount your
> combined MGS/MDT volume on the MDS and see what happens.
>
>
>
>
> Is your MDS in a high-availability pair?
> What version of Lustre are you running?
>
>
>
>
> ...just a few things readers on the list might want to know.
>
>
>
>
> --Jeff
>
>
>
>
>
>
>
>
> On Wed, Jun 21, 2023 at 11:21 AM Mike Mosley    >> wrote:
>
>
> Jeff,
>
>
> At this point we have the OSS shutdown. We were coming back from. full
> outage and so we are trying to get the MDS up before starting to bring up
> the OSS.
>
>
>
>
> Mike
>
>
>
>
> On Wed, Jun 21, 2023 at 2:15 PM Jeff Johnson <
> jeff.john...@aeoncomputing.com 
> <_blank>> wrote:
>
>
> Mike,
>
>
> Have you made sure the the o2ib interface on all of your Lustre servers
> (MDS & OSS) are functioning properly? Are you able to `lctl ping
> x.x.x.x@o2ib` successfully between MDS and OSS nodes?
>
>
>
>
> --Jeff
>
>
>
>
>
>
>
>
> On Wed, Jun 21, 2023 at 10:08 AM Mike Mosley via lustre-discuss <
> lustre-discuss@lists.lustre.org 
> <_blank>> wrote:
>
>
> Rick,172.16.100.4 is the IB address of one of the OSS servers. I
> believe the mgt and mdt0 are the same target. My understanding is that we
> have a single instanceof the MGT which is on the first MDT server i.e. it
> was created via a comand similar to:
>
>
>
>
> # mkfs.lustre --fsname=scratch --index=0 --mdt --mgs --replace /dev/sdb
>
>
>
>
>
>
> Does that make sense.
>
>
>
>
>
>
> On Wed, Jun 21, 2023 at 12:55 PM Mohr, Rick  moh...@ornl.gov> <_blank>> wrote:
>
>
> Which host is 172.16.100.4? Also, are the mgt and mdt0 on the same target
> or are they two separate targets just on the same host?
>
>
> --Rick
>
>
>
>
> On 6/21/23, 12:52 PM, "Mike Mosley"  mike.mos...@charlotte.edu> <_blank>   <_blank>>> wrote:
>
>
>
>
> Hi Rick,
>
>
>
>
> The MGS/MDS are combined. The output I posted is from the primary.
>
>
>
>
>
>
>
>
> THanks,
>
>
>
>
>
>
>
>
> Mike
>
>
>
>
>
>
>
>
> On Wed, Jun 21, 2023 at 12:27 PM Mohr, Rick  moh...@ornl.gov> <_blank> 
> <_blank>>  <_blank>
>  <_blank wrote:
>
>
>
>
> Mike,
>
>
>
>
> It looks like the mds server is having a problem contacting the mgs
> server. I'm guessing the mgs is a separate host? I would start by looking
> for possible network problems that might explain the LNet timeouts. You can
> try using "lctl ping" to test the LNet connection between nodes, and you
> can also try regular "ping" between the IP addresses on the IB interfaces.
>
>
>
>
> --Rick
>
>
>
>
>
>
>
>
> On 6/21/23, 11:35 AM, "lustre-discuss on behalf of Mike Mosley via
> lustre-discuss"  lustre-discuss-boun...@lists.lustre.org> <_blank>  lustre-discuss-boun...@lists.lustre.org  lustre-discuss-boun...@lists.lustre.org> <_blank>> <_blank>  lustre-discuss-boun...@lists.lustre.org  lustre-discuss-boun...@lists.lustre.org> <_blank>  lustre-discuss-boun...@lists.lustre.org  lustre-discuss-boun...@lists.lustre.org> <_blank>> <_blank>> on behalf of
> lustre-discuss@lists.lustre.org 
> <_blank>  <_blank>> <_blank>  lustre-discuss@lists.lustre.org 
> <_blank>  <_blank>> <_blank>>> wrote:
>
>
>
>
>
>
>
>
> Greetings,
>
>
>
>
>
>
>
>
> We have experienced some type of issue that is causing both of our MDS
> servers to only be able to mount the mdt device in read only mode. Here are
> some of the error 

Re: [lustre-discuss] [EXTERNAL] MDTs will only mount read only

2023-06-21 Thread Mohr, Rick via lustre-discuss
Mike,

On the off chance that the recovery process is causing the issue, you could try 
mounting the mdt with the "abort_recov" option and see if the behavior changes.

--Rick



On 6/21/23, 2:33 PM, "lustre-discuss on behalf of Jeff Johnson" 
mailto:lustre-discuss-boun...@lists.lustre.org> on behalf of 
jeff.john...@aeoncomputing.com > wrote:


Maybe someone else in the list can add clarity but I don't believe a recovery 
process on mount would keep the MDS read-only or trigger that trace. Something 
else may be going on. 


I would start from the ground up. Bring your servers up, unmounted. Ensure lnet 
is loaded and configured properly. Test lnet using ping or lnet_selftest from 
your MDS to all of your OSS nodes. Then mount your combined MGS/MDT volume on 
the MDS and see what happens. 




Is your MDS in a high-availability pair? 
What version of Lustre are you running? 




...just a few things readers on the list might want to know.




--Jeff








On Wed, Jun 21, 2023 at 11:21 AM Mike Mosley mailto:mike.mos...@charlotte.edu> >> wrote:


Jeff,


At this point we have the OSS shutdown. We were coming back from. full outage 
and so we are trying to get the MDS up before starting to bring up the OSS.




Mike




On Wed, Jun 21, 2023 at 2:15 PM Jeff Johnson mailto:jeff.john...@aeoncomputing.com> <_blank>> wrote:


Mike,


Have you made sure the the o2ib interface on all of your Lustre servers (MDS & 
OSS) are functioning properly? Are you able to `lctl ping x.x.x.x@o2ib` 
successfully between MDS and OSS nodes?




--Jeff








On Wed, Jun 21, 2023 at 10:08 AM Mike Mosley via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org> 
<_blank>> wrote:


Rick,172.16.100.4 is the IB address of one of the OSS servers. I 
believe the mgt and mdt0 are the same target. My understanding is that we have 
a single instanceof the MGT which is on the first MDT server i.e. it was 
created via a comand similar to:




# mkfs.lustre --fsname=scratch --index=0 --mdt --mgs --replace /dev/sdb 






Does that make sense.






On Wed, Jun 21, 2023 at 12:55 PM Mohr, Rick mailto:moh...@ornl.gov> <_blank>> wrote:


Which host is 172.16.100.4? Also, are the mgt and mdt0 on the same target or 
are they two separate targets just on the same host?


--Rick




On 6/21/23, 12:52 PM, "Mike Mosley" mailto:mike.mos...@charlotte.edu> <_blank>  <_blank>>> wrote:




Hi Rick,




The MGS/MDS are combined. The output I posted is from the primary.








THanks,








Mike








On Wed, Jun 21, 2023 at 12:27 PM Mohr, Rick mailto:moh...@ornl.gov> <_blank>  <_blank>>  <_blank>  <_blank wrote:




Mike,




It looks like the mds server is having a problem contacting the mgs server. I'm 
guessing the mgs is a separate host? I would start by looking for possible 
network problems that might explain the LNet timeouts. You can try using "lctl 
ping" to test the LNet connection between nodes, and you can also try regular 
"ping" between the IP addresses on the IB interfaces.




--Rick








On 6/21/23, 11:35 AM, "lustre-discuss on behalf of Mike Mosley via 
lustre-discuss" mailto:lustre-discuss-boun...@lists.lustre.org> <_blank> 
 <_blank>> <_blank> 
 <_blank> 
 <_blank>> <_blank>> on behalf 
of lustre-discuss@lists.lustre.org  
<_blank>  <_blank>> <_blank> 
 <_blank> 
 <_blank>> <_blank>>> wrote:








Greetings,








We have experienced some type of issue that is causing both of our MDS servers 
to only be able to mount the mdt device in read only mode. Here are some of the 
error messages we are seeing in the log files below. We lost our Lustre expert 
a while back and we are not sure how to proceed to troubleshoot this issue. Can 
anybody provide us guidance on how to proceed?
















Thanks,
















Mike
















Jun 20 15:12:14 hyd-mds1 kernel: INFO: task mount.lustre:4123 blocked for more 
than 120 seconds.
Jun 20 15:12:14 hyd-mds1 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 20 15:12:14 hyd-mds1 kernel: mount.lustre D 9f27a3bc5230 0 4123 1 
0x0086
Jun 20 

Re: [lustre-discuss] [EXTERNAL] MDTs will only mount read only

2023-06-21 Thread Jeff Johnson
Maybe someone else in the list can add clarity but I don't believe a
recovery process on mount would keep the MDS read-only or trigger that
trace. Something else may be going on.

I would start from the ground up. Bring your servers up, unmounted. Ensure
lnet is loaded and configured properly. Test lnet using ping or
lnet_selftest from your MDS to all of your OSS nodes. Then mount your
combined MGS/MDT volume on the MDS and see what happens.

Is your MDS in a high-availability pair?
What version of Lustre are you running?

...just a few things readers on the list might want to know.

--Jeff


On Wed, Jun 21, 2023 at 11:21 AM Mike Mosley 
wrote:

> Jeff,
>
> At this point we have the OSS shutdown.  We were coming back from. full
> outage and so we are trying to get the MDS up before starting to bring up
> the OSS.
>
> Mike
>
> On Wed, Jun 21, 2023 at 2:15 PM Jeff Johnson <
> jeff.john...@aeoncomputing.com> wrote:
>
>> Mike,
>>
>> Have you made sure the the o2ib interface on all of your Lustre servers
>> (MDS & OSS) are functioning properly? Are you able to `lctl ping
>> x.x.x.x@o2ib` successfully between MDS and OSS nodes?
>>
>> --Jeff
>>
>>
>> On Wed, Jun 21, 2023 at 10:08 AM Mike Mosley via lustre-discuss <
>> lustre-discuss@lists.lustre.org> wrote:
>>
>>> Rick,
>>> 172.16.100.4 is the IB address of one of the OSS servers.I
>>>  believe the mgt and mdt0 are the same target.   My understanding is
>>> that we have a single instanceof the MGT which is on the first MDT server
>>> i.e. it was created via a comand similar to:
>>>
>>> # mkfs.lustre --fsname=scratch --index=0 --mdt --mgs --replace /dev/sdb
>>>
>>> Does that make sense.
>>>
>>> On Wed, Jun 21, 2023 at 12:55 PM Mohr, Rick  wrote:
>>>
 Which host is 172.16.100.4?  Also, are the mgt and mdt0 on the same
 target or are they two separate targets just on the same host?

 --Rick


 On 6/21/23, 12:52 PM, "Mike Mosley" >>> > wrote:


 Hi Rick,


 The MGS/MDS are combined. The output I posted is from the primary.




 THanks,




 Mike




 On Wed, Jun 21, 2023 at 12:27 PM Mohr, Rick >>> moh...@ornl.gov> >>
 wrote:


 Mike,


 It looks like the mds server is having a problem contacting the mgs
 server. I'm guessing the mgs is a separate host? I would start by looking
 for possible network problems that might explain the LNet timeouts. You can
 try using "lctl ping" to test the LNet connection between nodes, and you
 can also try regular "ping" between the IP addresses on the IB interfaces.


 --Rick




 On 6/21/23, 11:35 AM, "lustre-discuss on behalf of Mike Mosley via
 lustre-discuss" >>> lustre-discuss-boun...@lists.lustre.org> <_blank> >>> lustre-discuss-boun...@lists.lustre.org >>> lustre-discuss-boun...@lists.lustre.org> <_blank>> on behalf of
 lustre-discuss@lists.lustre.org 
 <_blank> >> lustre-discuss@lists.lustre.org> <_blank>>> wrote:




 Greetings,




 We have experienced some type of issue that is causing both of our MDS
 servers to only be able to mount the mdt device in read only mode. Here are
 some of the error messages we are seeing in the log files below. We lost
 our Lustre expert a while back and we are not sure how to proceed to
 troubleshoot this issue. Can anybody provide us guidance on how to proceed?








 Thanks,








 Mike








 Jun 20 15:12:14 hyd-mds1 kernel: INFO: task mount.lustre:4123 blocked
 for more than 120 seconds.
 Jun 20 15:12:14 hyd-mds1 kernel: "echo 0 >
 /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 Jun 20 15:12:14 hyd-mds1 kernel: mount.lustre D 9f27a3bc5230 0 4123
 1 0x0086
 Jun 20 15:12:14 hyd-mds1 kernel: Call Trace:
 Jun 20 15:12:14 hyd-mds1 kernel: [] schedule+0x29/0x70
 Jun 20 15:12:14 hyd-mds1 kernel: []
 schedule_timeout+0x221/0x2d0
 Jun 20 15:12:14 hyd-mds1 kernel: [] ?
 tracing_is_on+0x15/0x30
 Jun 20 15:12:14 hyd-mds1 kernel: [] ?
 tracing_record_cmdline+0x1d/0x120
 Jun 20 15:12:14 hyd-mds1 kernel: [] ?
 probe_sched_wakeup+0x2b/0xa0
 Jun 20 15:12:14 hyd-mds1 kernel: [] ?
 ttwu_do_wakeup+0xb5/0xe0
 Jun 20 15:12:14 hyd-mds1 kernel: []
 wait_for_completion+0xfd/0x140
 Jun 20 15:12:14 hyd-mds1 kernel: [] ?
 wake_up_state+0x20/0x20
 Jun 20 15:12:14 hyd-mds1 kernel: []
 llog_process_or_fork+0x244/0x450 [obdclass]
 Jun 20 15:12:14 hyd-mds1 kernel: []
 llog_process+0x14/0x20 [obdclass]
 Jun 20 15:12:14 hyd-mds1 kernel: []
 

Re: [lustre-discuss] [EXTERNAL] MDTs will only mount read only

2023-06-21 Thread Mike Mosley via lustre-discuss
Jeff,

At this point we have the OSS shutdown.  We were coming back from. full
outage and so we are trying to get the MDS up before starting to bring up
the OSS.

Mike

On Wed, Jun 21, 2023 at 2:15 PM Jeff Johnson 
wrote:

> Mike,
>
> Have you made sure the the o2ib interface on all of your Lustre servers
> (MDS & OSS) are functioning properly? Are you able to `lctl ping
> x.x.x.x@o2ib` successfully between MDS and OSS nodes?
>
> --Jeff
>
>
> On Wed, Jun 21, 2023 at 10:08 AM Mike Mosley via lustre-discuss <
> lustre-discuss@lists.lustre.org> wrote:
>
>> Rick,
>> 172.16.100.4 is the IB address of one of the OSS servers.I
>>  believe the mgt and mdt0 are the same target.   My understanding is that
>> we have a single instanceof the MGT which is on the first MDT server i.e.
>> it was created via a comand similar to:
>>
>> # mkfs.lustre --fsname=scratch --index=0 --mdt --mgs --replace /dev/sdb
>>
>> Does that make sense.
>>
>> On Wed, Jun 21, 2023 at 12:55 PM Mohr, Rick  wrote:
>>
>>> Which host is 172.16.100.4?  Also, are the mgt and mdt0 on the same
>>> target or are they two separate targets just on the same host?
>>>
>>> --Rick
>>>
>>>
>>> On 6/21/23, 12:52 PM, "Mike Mosley" >> mike.mos...@charlotte.edu>> wrote:
>>>
>>>
>>> Hi Rick,
>>>
>>>
>>> The MGS/MDS are combined. The output I posted is from the primary.
>>>
>>>
>>>
>>>
>>> THanks,
>>>
>>>
>>>
>>>
>>> Mike
>>>
>>>
>>>
>>>
>>> On Wed, Jun 21, 2023 at 12:27 PM Mohr, Rick >> moh...@ornl.gov> >>
>>> wrote:
>>>
>>>
>>> Mike,
>>>
>>>
>>> It looks like the mds server is having a problem contacting the mgs
>>> server. I'm guessing the mgs is a separate host? I would start by looking
>>> for possible network problems that might explain the LNet timeouts. You can
>>> try using "lctl ping" to test the LNet connection between nodes, and you
>>> can also try regular "ping" between the IP addresses on the IB interfaces.
>>>
>>>
>>> --Rick
>>>
>>>
>>>
>>>
>>> On 6/21/23, 11:35 AM, "lustre-discuss on behalf of Mike Mosley via
>>> lustre-discuss" >> lustre-discuss-boun...@lists.lustre.org> <_blank> >> lustre-discuss-boun...@lists.lustre.org >> lustre-discuss-boun...@lists.lustre.org> <_blank>> on behalf of
>>> lustre-discuss@lists.lustre.org 
>>> <_blank> > lustre-discuss@lists.lustre.org> <_blank>>> wrote:
>>>
>>>
>>>
>>>
>>> Greetings,
>>>
>>>
>>>
>>>
>>> We have experienced some type of issue that is causing both of our MDS
>>> servers to only be able to mount the mdt device in read only mode. Here are
>>> some of the error messages we are seeing in the log files below. We lost
>>> our Lustre expert a while back and we are not sure how to proceed to
>>> troubleshoot this issue. Can anybody provide us guidance on how to proceed?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Thanks,
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Mike
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Jun 20 15:12:14 hyd-mds1 kernel: INFO: task mount.lustre:4123 blocked
>>> for more than 120 seconds.
>>> Jun 20 15:12:14 hyd-mds1 kernel: "echo 0 >
>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> Jun 20 15:12:14 hyd-mds1 kernel: mount.lustre D 9f27a3bc5230 0 4123
>>> 1 0x0086
>>> Jun 20 15:12:14 hyd-mds1 kernel: Call Trace:
>>> Jun 20 15:12:14 hyd-mds1 kernel: [] schedule+0x29/0x70
>>> Jun 20 15:12:14 hyd-mds1 kernel: []
>>> schedule_timeout+0x221/0x2d0
>>> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
>>> tracing_is_on+0x15/0x30
>>> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
>>> tracing_record_cmdline+0x1d/0x120
>>> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
>>> probe_sched_wakeup+0x2b/0xa0
>>> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
>>> ttwu_do_wakeup+0xb5/0xe0
>>> Jun 20 15:12:14 hyd-mds1 kernel: []
>>> wait_for_completion+0xfd/0x140
>>> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
>>> wake_up_state+0x20/0x20
>>> Jun 20 15:12:14 hyd-mds1 kernel: []
>>> llog_process_or_fork+0x244/0x450 [obdclass]
>>> Jun 20 15:12:14 hyd-mds1 kernel: []
>>> llog_process+0x14/0x20 [obdclass]
>>> Jun 20 15:12:14 hyd-mds1 kernel: []
>>> class_config_parse_llog+0x125/0x350 [obdclass]
>>> Jun 20 15:12:14 hyd-mds1 kernel: []
>>> mgc_process_cfg_log+0x790/0xc40 [mgc]
>>> Jun 20 15:12:14 hyd-mds1 kernel: []
>>> mgc_process_log+0x3dc/0x8f0 [mgc]
>>> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
>>> config_recover_log_add+0x13f/0x280 [mgc]
>>> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
>>> class_config_dump_handler+0x7e0/0x7e0 [obdclass]
>>> Jun 20 15:12:14 hyd-mds1 kernel: []
>>> mgc_process_config+0x88b/0x13f0 [mgc]
>>> Jun 20 15:12:14 hyd-mds1 kernel: []
>>> lustre_process_log+0x2d8/0xad0 [obdclass]
>>> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
>>> libcfs_debug_msg+0x57/0x80 [libcfs]
>>> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
>>> lprocfs_counter_add+0xf9/0x160 [obdclass]
>>> Jun 20 15:12:14 hyd-mds1 kernel: []
>>> server_start_targets+0x13a4/0x2a20 [obdclass]
>>> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
>>> 

Re: [lustre-discuss] [EXTERNAL] MDTs will only mount read only

2023-06-21 Thread Jeff Johnson
Mike,

Have you made sure the the o2ib interface on all of your Lustre servers
(MDS & OSS) are functioning properly? Are you able to `lctl ping
x.x.x.x@o2ib` successfully between MDS and OSS nodes?

--Jeff


On Wed, Jun 21, 2023 at 10:08 AM Mike Mosley via lustre-discuss <
lustre-discuss@lists.lustre.org> wrote:

> Rick,
> 172.16.100.4 is the IB address of one of the OSS servers.I
>  believe the mgt and mdt0 are the same target.   My understanding is that
> we have a single instanceof the MGT which is on the first MDT server i.e.
> it was created via a comand similar to:
>
> # mkfs.lustre --fsname=scratch --index=0 --mdt --mgs --replace /dev/sdb
>
> Does that make sense.
>
> On Wed, Jun 21, 2023 at 12:55 PM Mohr, Rick  wrote:
>
>> Which host is 172.16.100.4?  Also, are the mgt and mdt0 on the same
>> target or are they two separate targets just on the same host?
>>
>> --Rick
>>
>>
>> On 6/21/23, 12:52 PM, "Mike Mosley" > mike.mos...@charlotte.edu>> wrote:
>>
>>
>> Hi Rick,
>>
>>
>> The MGS/MDS are combined. The output I posted is from the primary.
>>
>>
>>
>>
>> THanks,
>>
>>
>>
>>
>> Mike
>>
>>
>>
>>
>> On Wed, Jun 21, 2023 at 12:27 PM Mohr, Rick > moh...@ornl.gov> >>
>> wrote:
>>
>>
>> Mike,
>>
>>
>> It looks like the mds server is having a problem contacting the mgs
>> server. I'm guessing the mgs is a separate host? I would start by looking
>> for possible network problems that might explain the LNet timeouts. You can
>> try using "lctl ping" to test the LNet connection between nodes, and you
>> can also try regular "ping" between the IP addresses on the IB interfaces.
>>
>>
>> --Rick
>>
>>
>>
>>
>> On 6/21/23, 11:35 AM, "lustre-discuss on behalf of Mike Mosley via
>> lustre-discuss" > lustre-discuss-boun...@lists.lustre.org> <_blank> > lustre-discuss-boun...@lists.lustre.org > lustre-discuss-boun...@lists.lustre.org> <_blank>> on behalf of
>> lustre-discuss@lists.lustre.org 
>> <_blank>  lustre-discuss@lists.lustre.org> <_blank>>> wrote:
>>
>>
>>
>>
>> Greetings,
>>
>>
>>
>>
>> We have experienced some type of issue that is causing both of our MDS
>> servers to only be able to mount the mdt device in read only mode. Here are
>> some of the error messages we are seeing in the log files below. We lost
>> our Lustre expert a while back and we are not sure how to proceed to
>> troubleshoot this issue. Can anybody provide us guidance on how to proceed?
>>
>>
>>
>>
>>
>>
>>
>>
>> Thanks,
>>
>>
>>
>>
>>
>>
>>
>>
>> Mike
>>
>>
>>
>>
>>
>>
>>
>>
>> Jun 20 15:12:14 hyd-mds1 kernel: INFO: task mount.lustre:4123 blocked for
>> more than 120 seconds.
>> Jun 20 15:12:14 hyd-mds1 kernel: "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> Jun 20 15:12:14 hyd-mds1 kernel: mount.lustre D 9f27a3bc5230 0 4123 1
>> 0x0086
>> Jun 20 15:12:14 hyd-mds1 kernel: Call Trace:
>> Jun 20 15:12:14 hyd-mds1 kernel: [] schedule+0x29/0x70
>> Jun 20 15:12:14 hyd-mds1 kernel: []
>> schedule_timeout+0x221/0x2d0
>> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
>> tracing_is_on+0x15/0x30
>> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
>> tracing_record_cmdline+0x1d/0x120
>> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
>> probe_sched_wakeup+0x2b/0xa0
>> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
>> ttwu_do_wakeup+0xb5/0xe0
>> Jun 20 15:12:14 hyd-mds1 kernel: []
>> wait_for_completion+0xfd/0x140
>> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
>> wake_up_state+0x20/0x20
>> Jun 20 15:12:14 hyd-mds1 kernel: []
>> llog_process_or_fork+0x244/0x450 [obdclass]
>> Jun 20 15:12:14 hyd-mds1 kernel: []
>> llog_process+0x14/0x20 [obdclass]
>> Jun 20 15:12:14 hyd-mds1 kernel: []
>> class_config_parse_llog+0x125/0x350 [obdclass]
>> Jun 20 15:12:14 hyd-mds1 kernel: []
>> mgc_process_cfg_log+0x790/0xc40 [mgc]
>> Jun 20 15:12:14 hyd-mds1 kernel: []
>> mgc_process_log+0x3dc/0x8f0 [mgc]
>> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
>> config_recover_log_add+0x13f/0x280 [mgc]
>> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
>> class_config_dump_handler+0x7e0/0x7e0 [obdclass]
>> Jun 20 15:12:14 hyd-mds1 kernel: []
>> mgc_process_config+0x88b/0x13f0 [mgc]
>> Jun 20 15:12:14 hyd-mds1 kernel: []
>> lustre_process_log+0x2d8/0xad0 [obdclass]
>> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
>> libcfs_debug_msg+0x57/0x80 [libcfs]
>> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
>> lprocfs_counter_add+0xf9/0x160 [obdclass]
>> Jun 20 15:12:14 hyd-mds1 kernel: []
>> server_start_targets+0x13a4/0x2a20 [obdclass]
>> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
>> lustre_start_mgc+0x260/0x2510 [obdclass]
>> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
>> class_config_dump_handler+0x7e0/0x7e0 [obdclass]
>> Jun 20 15:12:14 hyd-mds1 kernel: []
>> server_fill_super+0x10cc/0x1890 [obdclass]
>> Jun 20 15:12:14 hyd-mds1 kernel: []
>> lustre_fill_super+0x468/0x960 [obdclass]
>> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
>> lustre_common_put_super+0x270/0x270 [obdclass]
>> Jun 20 15:12:14 

Re: [lustre-discuss] [EXTERNAL] MDTs will only mount read only

2023-06-21 Thread Mike Mosley via lustre-discuss
Rick,
172.16.100.4 is the IB address of one of the OSS servers.I
 believe the mgt and mdt0 are the same target.   My understanding is that
we have a single instanceof the MGT which is on the first MDT server i.e.
it was created via a comand similar to:

# mkfs.lustre --fsname=scratch --index=0 --mdt --mgs --replace /dev/sdb

Does that make sense.

On Wed, Jun 21, 2023 at 12:55 PM Mohr, Rick  wrote:

> Which host is 172.16.100.4?  Also, are the mgt and mdt0 on the same target
> or are they two separate targets just on the same host?
>
> --Rick
>
>
> On 6/21/23, 12:52 PM, "Mike Mosley"  mike.mos...@charlotte.edu>> wrote:
>
>
> Hi Rick,
>
>
> The MGS/MDS are combined. The output I posted is from the primary.
>
>
>
>
> THanks,
>
>
>
>
> Mike
>
>
>
>
> On Wed, Jun 21, 2023 at 12:27 PM Mohr, Rick  moh...@ornl.gov> >> wrote:
>
>
> Mike,
>
>
> It looks like the mds server is having a problem contacting the mgs
> server. I'm guessing the mgs is a separate host? I would start by looking
> for possible network problems that might explain the LNet timeouts. You can
> try using "lctl ping" to test the LNet connection between nodes, and you
> can also try regular "ping" between the IP addresses on the IB interfaces.
>
>
> --Rick
>
>
>
>
> On 6/21/23, 11:35 AM, "lustre-discuss on behalf of Mike Mosley via
> lustre-discuss"  lustre-discuss-boun...@lists.lustre.org> <_blank>  lustre-discuss-boun...@lists.lustre.org  lustre-discuss-boun...@lists.lustre.org> <_blank>> on behalf of
> lustre-discuss@lists.lustre.org 
> <_blank>  <_blank>>> wrote:
>
>
>
>
> Greetings,
>
>
>
>
> We have experienced some type of issue that is causing both of our MDS
> servers to only be able to mount the mdt device in read only mode. Here are
> some of the error messages we are seeing in the log files below. We lost
> our Lustre expert a while back and we are not sure how to proceed to
> troubleshoot this issue. Can anybody provide us guidance on how to proceed?
>
>
>
>
>
>
>
>
> Thanks,
>
>
>
>
>
>
>
>
> Mike
>
>
>
>
>
>
>
>
> Jun 20 15:12:14 hyd-mds1 kernel: INFO: task mount.lustre:4123 blocked for
> more than 120 seconds.
> Jun 20 15:12:14 hyd-mds1 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jun 20 15:12:14 hyd-mds1 kernel: mount.lustre D 9f27a3bc5230 0 4123 1
> 0x0086
> Jun 20 15:12:14 hyd-mds1 kernel: Call Trace:
> Jun 20 15:12:14 hyd-mds1 kernel: [] schedule+0x29/0x70
> Jun 20 15:12:14 hyd-mds1 kernel: []
> schedule_timeout+0x221/0x2d0
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> tracing_is_on+0x15/0x30
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> tracing_record_cmdline+0x1d/0x120
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> probe_sched_wakeup+0x2b/0xa0
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> ttwu_do_wakeup+0xb5/0xe0
> Jun 20 15:12:14 hyd-mds1 kernel: []
> wait_for_completion+0xfd/0x140
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> wake_up_state+0x20/0x20
> Jun 20 15:12:14 hyd-mds1 kernel: []
> llog_process_or_fork+0x244/0x450 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> llog_process+0x14/0x20 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> class_config_parse_llog+0x125/0x350 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> mgc_process_cfg_log+0x790/0xc40 [mgc]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> mgc_process_log+0x3dc/0x8f0 [mgc]
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> config_recover_log_add+0x13f/0x280 [mgc]
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> class_config_dump_handler+0x7e0/0x7e0 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> mgc_process_config+0x88b/0x13f0 [mgc]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> lustre_process_log+0x2d8/0xad0 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> libcfs_debug_msg+0x57/0x80 [libcfs]
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> lprocfs_counter_add+0xf9/0x160 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> server_start_targets+0x13a4/0x2a20 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> lustre_start_mgc+0x260/0x2510 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> class_config_dump_handler+0x7e0/0x7e0 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> server_fill_super+0x10cc/0x1890 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> lustre_fill_super+0x468/0x960 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> lustre_common_put_super+0x270/0x270 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: [] mount_nodev+0x4f/0xb0
> Jun 20 15:12:14 hyd-mds1 kernel: []
> lustre_mount+0x38/0x60 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: [] mount_fs+0x3e/0x1b0
> Jun 20 15:12:14 hyd-mds1 kernel: []
> vfs_kern_mount+0x67/0x110
> Jun 20 15:12:14 hyd-mds1 kernel: [] do_mount+0x1ef/0xd00
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> __check_object_size+0x1ca/0x250
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> kmem_cache_alloc_trace+0x3c/0x200
> Jun 20 15:12:14 hyd-mds1 kernel: [] 

Re: [lustre-discuss] [EXTERNAL] MDTs will only mount read only

2023-06-21 Thread Mike Mosley via lustre-discuss
Hi Rick,

The MGS/MDS are combined.   The output I posted is from the primary.

THanks,

Mike

On Wed, Jun 21, 2023 at 12:27 PM Mohr, Rick  wrote:

> Mike,
>
> It looks like the mds server is having a problem contacting the mgs
> server.  I'm guessing the mgs is a separate host?  I would start by looking
> for possible network problems that might explain the LNet timeouts.  You
> can try using "lctl ping" to test the LNet connection between nodes, and
> you can also try regular "ping" between the IP addresses on the IB
> interfaces.
>
> --Rick
>
>
> On 6/21/23, 11:35 AM, "lustre-discuss on behalf of Mike Mosley via
> lustre-discuss"  lustre-discuss-boun...@lists.lustre.org> on behalf of
> lustre-discuss@lists.lustre.org >
> wrote:
>
>
> Greetings,
>
>
> We have experienced some type of issue that is causing both of our MDS
> servers to only be able to mount the mdt device in read only mode. Here are
> some of the error messages we are seeing in the log files below. We lost
> our Lustre expert a while back and we are not sure how to proceed to
> troubleshoot this issue. Can anybody provide us guidance on how to proceed?
>
>
>
>
> Thanks,
>
>
>
>
> Mike
>
>
>
>
> Jun 20 15:12:14 hyd-mds1 kernel: INFO: task mount.lustre:4123 blocked for
> more than 120 seconds.
> Jun 20 15:12:14 hyd-mds1 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jun 20 15:12:14 hyd-mds1 kernel: mount.lustre D 9f27a3bc5230 0 4123 1
> 0x0086
> Jun 20 15:12:14 hyd-mds1 kernel: Call Trace:
> Jun 20 15:12:14 hyd-mds1 kernel: [] schedule+0x29/0x70
> Jun 20 15:12:14 hyd-mds1 kernel: []
> schedule_timeout+0x221/0x2d0
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> tracing_is_on+0x15/0x30
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> tracing_record_cmdline+0x1d/0x120
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> probe_sched_wakeup+0x2b/0xa0
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> ttwu_do_wakeup+0xb5/0xe0
> Jun 20 15:12:14 hyd-mds1 kernel: []
> wait_for_completion+0xfd/0x140
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> wake_up_state+0x20/0x20
> Jun 20 15:12:14 hyd-mds1 kernel: []
> llog_process_or_fork+0x244/0x450 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> llog_process+0x14/0x20 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> class_config_parse_llog+0x125/0x350 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> mgc_process_cfg_log+0x790/0xc40 [mgc]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> mgc_process_log+0x3dc/0x8f0 [mgc]
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> config_recover_log_add+0x13f/0x280 [mgc]
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> class_config_dump_handler+0x7e0/0x7e0 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> mgc_process_config+0x88b/0x13f0 [mgc]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> lustre_process_log+0x2d8/0xad0 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> libcfs_debug_msg+0x57/0x80 [libcfs]
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> lprocfs_counter_add+0xf9/0x160 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> server_start_targets+0x13a4/0x2a20 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> lustre_start_mgc+0x260/0x2510 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> class_config_dump_handler+0x7e0/0x7e0 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> server_fill_super+0x10cc/0x1890 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: []
> lustre_fill_super+0x468/0x960 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> lustre_common_put_super+0x270/0x270 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: [] mount_nodev+0x4f/0xb0
> Jun 20 15:12:14 hyd-mds1 kernel: []
> lustre_mount+0x38/0x60 [obdclass]
> Jun 20 15:12:14 hyd-mds1 kernel: [] mount_fs+0x3e/0x1b0
> Jun 20 15:12:14 hyd-mds1 kernel: []
> vfs_kern_mount+0x67/0x110
> Jun 20 15:12:14 hyd-mds1 kernel: [] do_mount+0x1ef/0xd00
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> __check_object_size+0x1ca/0x250
> Jun 20 15:12:14 hyd-mds1 kernel: [] ?
> kmem_cache_alloc_trace+0x3c/0x200
> Jun 20 15:12:14 hyd-mds1 kernel: [] SyS_mount+0x83/0xd0
> Jun 20 15:12:14 hyd-mds1 kernel: []
> system_call_fastpath+0x25/0x2a
> Jun 20 15:13:14 hyd-mds1 kernel: LNet:
> 4458:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for
> 172.16.100.4@o2ib: 9 seconds
> Jun 20 15:13:14 hyd-mds1 kernel: LNet:
> 4458:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 239 previous
> similar messages
> Jun 20 15:14:14 hyd-mds1 kernel: INFO: task mount.lustre:4123 blocked for
> more than 120 seconds.
> Jun 20 15:14:14 hyd-mds1 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jun 20 15:14:14 hyd-mds1 kernel: mount.lustre D 9f27a3bc5230 0 4123 1
> 0x0086
>
>
>
>
>
>
> dumpe2fs seems to show that the file systems are clean i.e.
>
>
>
>
> dumpe2fs 1.45.6.wc1 (20-Mar-2020)
> Filesystem volume name: hydra-MDT
> Last mounted on: /
> Filesystem UUID: 3ae09231-7f2a-43b3-a4ee-7f36080b5a66
> Filesystem magic number: 0xEF53
> Filesystem revision #: 1 (dynamic)
> 

Re: [lustre-discuss] [EXTERNAL] MDTs will only mount read only

2023-06-21 Thread Mohr, Rick via lustre-discuss
Mike,

It looks like the mds server is having a problem contacting the mgs server.  
I'm guessing the mgs is a separate host?  I would start by looking for possible 
network problems that might explain the LNet timeouts.  You can try using "lctl 
ping" to test the LNet connection between nodes, and you can also try regular 
"ping" between the IP addresses on the IB interfaces.

--Rick


On 6/21/23, 11:35 AM, "lustre-discuss on behalf of Mike Mosley via 
lustre-discuss" mailto:lustre-discuss-boun...@lists.lustre.org> on behalf of 
lustre-discuss@lists.lustre.org > wrote:


Greetings,


We have experienced some type of issue that is causing both of our MDS servers 
to only be able to mount the mdt device in read only mode. Here are some of the 
error messages we are seeing in the log files below. We lost our Lustre expert 
a while back and we are not sure how to proceed to troubleshoot this issue. Can 
anybody provide us guidance on how to proceed?




Thanks,




Mike




Jun 20 15:12:14 hyd-mds1 kernel: INFO: task mount.lustre:4123 blocked for more 
than 120 seconds.
Jun 20 15:12:14 hyd-mds1 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 20 15:12:14 hyd-mds1 kernel: mount.lustre D 9f27a3bc5230 0 4123 1 
0x0086
Jun 20 15:12:14 hyd-mds1 kernel: Call Trace:
Jun 20 15:12:14 hyd-mds1 kernel: [] schedule+0x29/0x70
Jun 20 15:12:14 hyd-mds1 kernel: [] 
schedule_timeout+0x221/0x2d0
Jun 20 15:12:14 hyd-mds1 kernel: [] ? tracing_is_on+0x15/0x30
Jun 20 15:12:14 hyd-mds1 kernel: [] ? 
tracing_record_cmdline+0x1d/0x120
Jun 20 15:12:14 hyd-mds1 kernel: [] ? 
probe_sched_wakeup+0x2b/0xa0
Jun 20 15:12:14 hyd-mds1 kernel: [] ? ttwu_do_wakeup+0xb5/0xe0
Jun 20 15:12:14 hyd-mds1 kernel: [] 
wait_for_completion+0xfd/0x140
Jun 20 15:12:14 hyd-mds1 kernel: [] ? wake_up_state+0x20/0x20
Jun 20 15:12:14 hyd-mds1 kernel: [] 
llog_process_or_fork+0x244/0x450 [obdclass]
Jun 20 15:12:14 hyd-mds1 kernel: [] llog_process+0x14/0x20 
[obdclass]
Jun 20 15:12:14 hyd-mds1 kernel: [] 
class_config_parse_llog+0x125/0x350 [obdclass]
Jun 20 15:12:14 hyd-mds1 kernel: [] 
mgc_process_cfg_log+0x790/0xc40 [mgc]
Jun 20 15:12:14 hyd-mds1 kernel: [] 
mgc_process_log+0x3dc/0x8f0 [mgc]
Jun 20 15:12:14 hyd-mds1 kernel: [] ? 
config_recover_log_add+0x13f/0x280 [mgc]
Jun 20 15:12:14 hyd-mds1 kernel: [] ? 
class_config_dump_handler+0x7e0/0x7e0 [obdclass]
Jun 20 15:12:14 hyd-mds1 kernel: [] 
mgc_process_config+0x88b/0x13f0 [mgc]
Jun 20 15:12:14 hyd-mds1 kernel: [] 
lustre_process_log+0x2d8/0xad0 [obdclass]
Jun 20 15:12:14 hyd-mds1 kernel: [] ? 
libcfs_debug_msg+0x57/0x80 [libcfs]
Jun 20 15:12:14 hyd-mds1 kernel: [] ? 
lprocfs_counter_add+0xf9/0x160 [obdclass]
Jun 20 15:12:14 hyd-mds1 kernel: [] 
server_start_targets+0x13a4/0x2a20 [obdclass]
Jun 20 15:12:14 hyd-mds1 kernel: [] ? 
lustre_start_mgc+0x260/0x2510 [obdclass]
Jun 20 15:12:14 hyd-mds1 kernel: [] ? 
class_config_dump_handler+0x7e0/0x7e0 [obdclass]
Jun 20 15:12:14 hyd-mds1 kernel: [] 
server_fill_super+0x10cc/0x1890 [obdclass]
Jun 20 15:12:14 hyd-mds1 kernel: [] 
lustre_fill_super+0x468/0x960 [obdclass]
Jun 20 15:12:14 hyd-mds1 kernel: [] ? 
lustre_common_put_super+0x270/0x270 [obdclass]
Jun 20 15:12:14 hyd-mds1 kernel: [] mount_nodev+0x4f/0xb0
Jun 20 15:12:14 hyd-mds1 kernel: [] lustre_mount+0x38/0x60 
[obdclass]
Jun 20 15:12:14 hyd-mds1 kernel: [] mount_fs+0x3e/0x1b0
Jun 20 15:12:14 hyd-mds1 kernel: [] vfs_kern_mount+0x67/0x110
Jun 20 15:12:14 hyd-mds1 kernel: [] do_mount+0x1ef/0xd00
Jun 20 15:12:14 hyd-mds1 kernel: [] ? 
__check_object_size+0x1ca/0x250
Jun 20 15:12:14 hyd-mds1 kernel: [] ? 
kmem_cache_alloc_trace+0x3c/0x200
Jun 20 15:12:14 hyd-mds1 kernel: [] SyS_mount+0x83/0xd0
Jun 20 15:12:14 hyd-mds1 kernel: [] 
system_call_fastpath+0x25/0x2a
Jun 20 15:13:14 hyd-mds1 kernel: LNet: 
4458:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 
172.16.100.4@o2ib: 9 seconds
Jun 20 15:13:14 hyd-mds1 kernel: LNet: 
4458:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Skipped 239 previous similar 
messages
Jun 20 15:14:14 hyd-mds1 kernel: INFO: task mount.lustre:4123 blocked for more 
than 120 seconds.
Jun 20 15:14:14 hyd-mds1 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 20 15:14:14 hyd-mds1 kernel: mount.lustre D 9f27a3bc5230 0 4123 1 
0x0086






dumpe2fs seems to show that the file systems are clean i.e.




dumpe2fs 1.45.6.wc1 (20-Mar-2020)
Filesystem volume name: hydra-MDT
Last mounted on: /
Filesystem UUID: 3ae09231-7f2a-43b3-a4ee-7f36080b5a66
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype mmp 
flex_bg dirdata sparse_super large_file huge_file uninit_bg dir_nlink quota
Filesystem flags: signed_directory_hash 
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 2247671504
Block count: