Re: [gpfsug-discuss] mmchdisk suspend / stop

2018-02-14 Thread Jonathan Buzzard

On 13/02/18 15:56, Buterbaugh, Kevin L wrote:

Hi JAB,

OK, let me try one more time to clarify.  I’m not naming the vendor …
they’re a small maker of commodity storage and we’ve been using their
stuff for years and, overall, it’s been very solid.  The problem in
this specific case is that a major version firmware upgrade is
required … if the controllers were only a minor version apart we
could do it live.



That makes more sense, but still do tell which vendor so I can avoid 
them. It's 2018 I expect never to need to take my storage down for *ANY* 
firmware upgrade *EVER* - period.


Any vendor that falls short of that needs to go on my naughty list, for 
specific checking that this is no longer the case before I ever purchase 
any of their kit.


JAB.

--
Jonathan A. Buzzard Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] mmchdisk suspend / stop

2018-02-13 Thread Buterbaugh, Kevin L
Hi JAB,

OK, let me try one more time to clarify.  I’m not naming the vendor … they’re a 
small maker of commodity storage and we’ve been using their stuff for years 
and, overall, it’s been very solid.  The problem in this specific case is that 
a major version firmware upgrade is required … if the controllers were only a 
minor version apart we could do it live.

And yes, we can upgrade our QLogic SAN switches firmware live … in fact, we’ve 
done that in the past.  Should’ve been more clear there … we just try to do 
that as infrequently as possible.

So the bottom line here is that we were unaware that “major version” firmware 
upgrades could not be done live on our storage, but we’ve got a plan to work 
around this this time.

Kevin

> On Feb 13, 2018, at 7:43 AM, Jonathan Buzzard  
> wrote:
> 
> On Fri, 2018-02-09 at 15:07 +, Buterbaugh, Kevin L wrote:
>> Hi All,
>> 
>> Since several people have made this same suggestion, let me respond
>> to that.  We did ask the vendor - twice - to do that.  Their response
>> boils down to, “No, the older version has bugs and we won’t send you
>> a controller with firmware that we know has bugs in it.”
>> 
>> We have not had a full cluster downtime since the summer of 2016 -
>> and then it was only a one day downtime to allow the cleaning of our
>> core network switches after an electrical fire in our data center!
>>  So the firmware on not only our storage arrays, but our SAN switches
>> as well, it a bit out of date, shall we say…
>> 
>> That is an issue we need to address internally … our users love us
>> not having regularly scheduled downtimes quarterly, yearly, or
>> whatever, but there is a cost to doing business that way...
>> 
> 
> What sort of storage arrays are you using that don't allow you to do a
> live update of the controller firmware? Heck these days even cheapy
> Dell MD3 series storage arrays allow you to do live drive firmware
> updates.
> 
> Similarly with SAN switches surely you have separate A/B fabrics and
> can upgrade them one at a time live.
> 
> In a properly designed system one should not need to schedule downtime
> for firmware updates. He says as he plans a firmware update on his
> routers for next Tuesday morning, with no scheduled downtime and no
> interruption to service.
> 
> JAB.
> 
> -- 
> Jonathan A. Buzzard Tel: +44141-5483420
> HPC System Administrator, ARCHIE-WeSt.
> University of Strathclyde, John Anderson Building, Glasgow. G4 0NG
> 
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C16b7c1eca3d846afc65208d572e7b6f1%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C1%7C636541261898197334&sdata=fY66HEDEia55g2x18VETOmE755IH7lXAfoznAewCe5A%3D&reserved=0

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] mmchdisk suspend / stop

2018-02-13 Thread Jonathan Buzzard
On Fri, 2018-02-09 at 15:07 +, Buterbaugh, Kevin L wrote:
> Hi All,
> 
> Since several people have made this same suggestion, let me respond
> to that.  We did ask the vendor - twice - to do that.  Their response
> boils down to, “No, the older version has bugs and we won’t send you
> a controller with firmware that we know has bugs in it.”
> 
> We have not had a full cluster downtime since the summer of 2016 -
> and then it was only a one day downtime to allow the cleaning of our
> core network switches after an electrical fire in our data center!
>  So the firmware on not only our storage arrays, but our SAN switches
> as well, it a bit out of date, shall we say…
> 
> That is an issue we need to address internally … our users love us
> not having regularly scheduled downtimes quarterly, yearly, or
> whatever, but there is a cost to doing business that way...
> 

What sort of storage arrays are you using that don't allow you to do a
live update of the controller firmware? Heck these days even cheapy
Dell MD3 series storage arrays allow you to do live drive firmware
updates.

Similarly with SAN switches surely you have separate A/B fabrics and
can upgrade them one at a time live.

In a properly designed system one should not need to schedule downtime
for firmware updates. He says as he plans a firmware update on his
routers for next Tuesday morning, with no scheduled downtime and no
interruption to service.

JAB.

-- 
Jonathan A. Buzzard Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] mmchdisk suspend / stop

2018-02-09 Thread Paul Ward
Not sure why it took over a day for my message to be sent out by the list?

If it’s the firmware you currently have, I would still prefer to have it sent 
to me then I am able to do a controller firmware update online during an at 
risk period rather than a downtime, all the time you are running on one 
controller is at risk!
Seems you have an alternative.

Paul Ward
Technical Solutions Infrastructure Architect
Natural History Museum
T: 02079426450
E: p.w...@nhm.ac.uk

From: gpfsug-discuss-boun...@spectrumscale.org 
[mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of Buterbaugh, 
Kevin L
Sent: 09 February 2018 15:08
To: gpfsug main discussion list 
Subject: Re: [gpfsug-discuss] mmchdisk suspend / stop

Hi All,

Since several people have made this same suggestion, let me respond to that.  
We did ask the vendor - twice - to do that.  Their response boils down to, “No, 
the older version has bugs and we won’t send you a controller with firmware 
that we know has bugs in it.”

We have not had a full cluster downtime since the summer of 2016 - and then it 
was only a one day downtime to allow the cleaning of our core network switches 
after an electrical fire in our data center!  So the firmware on not only our 
storage arrays, but our SAN switches as well, it a bit out of date, shall we 
say…

That is an issue we need to address internally … our users love us not having 
regularly scheduled downtimes quarterly, yearly, or whatever, but there is a 
cost to doing business that way...

Kevin


On Feb 8, 2018, at 10:46 AM, Paul Ward 
mailto:p.w...@nhm.ac.uk>> wrote:

We tend to get the maintenance company to down-grade the firmware to match what 
we have for our aging hardware, before sending it to us.
I assume this isn’t an option?

Paul Ward
Technical Solutions Infrastructure Architect
Natural History Museum
T: 02079426450
E: p.w...@nhm.ac.uk<mailto:p.w...@nhm.ac.uk>

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] mmchdisk suspend / stop

2018-02-09 Thread Buterbaugh, Kevin L
Hi All,

Since several people have made this same suggestion, let me respond to that.  
We did ask the vendor - twice - to do that.  Their response boils down to, “No, 
the older version has bugs and we won’t send you a controller with firmware 
that we know has bugs in it.”

We have not had a full cluster downtime since the summer of 2016 - and then it 
was only a one day downtime to allow the cleaning of our core network switches 
after an electrical fire in our data center!  So the firmware on not only our 
storage arrays, but our SAN switches as well, it a bit out of date, shall we 
say…

That is an issue we need to address internally … our users love us not having 
regularly scheduled downtimes quarterly, yearly, or whatever, but there is a 
cost to doing business that way...

Kevin

On Feb 8, 2018, at 10:46 AM, Paul Ward 
mailto:p.w...@nhm.ac.uk>> wrote:

We tend to get the maintenance company to down-grade the firmware to match what 
we have for our aging hardware, before sending it to us.
I assume this isn’t an option?

Paul Ward
Technical Solutions Infrastructure Architect
Natural History Museum
T: 02079426450
E: p.w...@nhm.ac.uk

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] mmchdisk suspend / stop

2018-02-09 Thread Paul Ward
We tend to get the maintenance company to down-grade the firmware to match what 
we have for our aging hardware, before sending it to us.
I assume this isn’t an option?

Paul Ward
Technical Solutions Infrastructure Architect
Natural History Museum
T: 02079426450
E: p.w...@nhm.ac.uk

From: gpfsug-discuss-boun...@spectrumscale.org 
[mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of Buterbaugh, 
Kevin L
Sent: 08 February 2018 16:00
To: gpfsug main discussion list 
Subject: [gpfsug-discuss] mmchdisk suspend / stop

Hi All,

We are in a bit of a difficult situation right now with one of our non-IBM 
hardware vendors (I know, I know, I KNOW - buy IBM hardware! ) and are 
looking for some advice on how to deal with this unfortunate situation.

We have a non-IBM FC storage array with dual-“redundant” controllers.  One of 
those controllers is dead and the vendor is sending us a replacement.  However, 
the replacement controller will have mis-matched firmware with the surviving 
controller and - long story short - the vendor says there is no way to resolve 
that without taking the storage array down for firmware upgrades.  Needless to 
say there’s more to that story than what I’ve included here, but I won’t bore 
everyone with unnecessary details.

The storage array has 5 NSDs on it, but fortunately enough they are part of our 
“capacity” pool … i.e. the only way a file lands here is if an mmapplypolicy 
scan moved it there because the *access* time is greater than 90 days.  
Filesystem data replication is set to one.

So … what I was wondering if I could do is to use mmchdisk to either suspend or 
(preferably) stop those NSDs, do the firmware upgrade, and resume the NSDs?  
The problem I see is that suspend doesn’t stop I/O, it only prevents the 
allocation of new blocks … so, in theory, if a user suddenly decided to start 
using a file they hadn’t needed for 3 months then I’ve got a problem.  Stopping 
all I/O to the disks is what I really want to do.  However, according to the 
mmchdisk man page stop cannot be used on a filesystem with replication set to 
one.

There’s over 250 TB of data on those 5 NSDs, so restriping off of them or 
setting replication to two are not options.

It is very unlikely that anyone would try to access a file on those NSDs during 
the hour or so I’d need to do the firmware upgrades, but how would GPFS itself 
react to those (suspended) disks going away for a while?  I’m thinking I could 
be OK if there was just a way to actually stop them rather than suspend them.  
Any undocumented options to mmchdisk that I’m not aware of???

Are there other options - besides buying IBM hardware - that I am overlooking?  
Thanks...
—
Kevin Buterbaugh - Senior System Administrator
Vanderbilt University - Advanced Computing Center for Research and Education
kevin.buterba...@vanderbilt.edu<mailto:kevin.buterba...@vanderbilt.edu> - 
(615)875-9633



___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] mmchdisk suspend / stop

2018-02-08 Thread Edward Wahl

I'm with Richard on this one.   Sounds dubious to me.
Even older style stuff could start a new controller in a 'failed' or 'service' 
state and push firmware back in the 20th
century...  ;)  

Ed


On Thu, 8 Feb 2018 16:23:33 +
"Sobey, Richard A"  wrote:

> Sorry I can’t help… the only thing going round and round my head right now is
> why on earth the existing controller cannot push the required firmware to the
> new one when it comes online. Never heard of anything else! Feel free to name
> and shame so I can avoid 😊
> 
> Richard
> 
> From: gpfsug-discuss-boun...@spectrumscale.org
> [mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of Buterbaugh,
> Kevin L Sent: 08 February 2018 16:00 To: gpfsug main discussion list
>  Subject: [gpfsug-discuss] mmchdisk
> suspend / stop
> 
> Hi All,
> 
> We are in a bit of a difficult situation right now with one of our non-IBM
> hardware vendors (I know, I know, I KNOW - buy IBM hardware! ) and are
> looking for some advice on how to deal with this unfortunate situation.
> 
> We have a non-IBM FC storage array with dual-“redundant” controllers.  One of
> those controllers is dead and the vendor is sending us a replacement.
> However, the replacement controller will have mis-matched firmware with the
> surviving controller and - long story short - the vendor says there is no way
> to resolve that without taking the storage array down for firmware upgrades.
> Needless to say there’s more to that story than what I’ve included here, but
> I won’t bore everyone with unnecessary details.
> 
> The storage array has 5 NSDs on it, but fortunately enough they are part of
> our “capacity” pool … i.e. the only way a file lands here is if an
> mmapplypolicy scan moved it there because the *access* time is greater than
> 90 days.  Filesystem data replication is set to one.
> 
> So … what I was wondering if I could do is to use mmchdisk to either suspend
> or (preferably) stop those NSDs, do the firmware upgrade, and resume the
> NSDs?  The problem I see is that suspend doesn’t stop I/O, it only prevents
> the allocation of new blocks … so, in theory, if a user suddenly decided to
> start using a file they hadn’t needed for 3 months then I’ve got a problem.
> Stopping all I/O to the disks is what I really want to do.  However,
> according to the mmchdisk man page stop cannot be used on a filesystem with
> replication set to one.
> 
> There’s over 250 TB of data on those 5 NSDs, so restriping off of them or
> setting replication to two are not options.
> 
> It is very unlikely that anyone would try to access a file on those NSDs
> during the hour or so I’d need to do the firmware upgrades, but how would
> GPFS itself react to those (suspended) disks going away for a while?  I’m
> thinking I could be OK if there was just a way to actually stop them rather
> than suspend them.  Any undocumented options to mmchdisk that I’m not aware
> of???
> 
> Are there other options - besides buying IBM hardware - that I am
> overlooking?  Thanks... —
> Kevin Buterbaugh - Senior System Administrator
> Vanderbilt University - Advanced Computing Center for Research and Education
> kevin.buterba...@vanderbilt.edu<mailto:kevin.buterba...@vanderbilt.edu> -
> (615)875-9633
> 
> 
> 



-- 

Ed Wahl
Ohio Supercomputer Center
614-292-9302
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] mmchdisk suspend / stop

2018-02-08 Thread valdis . kletnieks
On Thu, 08 Feb 2018 16:25:33 +, "Oesterlin, Robert" said:

> unmountOnDiskFail

> The unmountOnDiskFail specifies how the GPFS daemon responds when a disk
> failure is detected. The valid values of this parameter are yes, no, and meta.
> The  default value is no.

I suspect that the only relevant setting there is the default 'no' - it sounds 
like these
5 NSD's are just one storage pool in a much larger filesystem, and Kevin doesn't
want the entire thing to unmount if GPFS notices that the NSDs went walkies.


pgpmP0Wh3JLDI.pgp
Description: PGP signature
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] mmchdisk suspend / stop

2018-02-08 Thread Oesterlin, Robert
Check out “unmountOnDiskFail” config parameter perhaps?

https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1adm_tuningguide.htm

unmountOnDiskFail
The unmountOnDiskFail specifies how the GPFS daemon responds when a disk 
failure is detected. The valid values of this parameter are yes, no, and meta. 
The default value is no.

I have it set to “meta” which prevents the file system from unmounting  if an 
NSD fails and the metadata is still available. I have 2 replicas of metadata 
and one data.

Bob Oesterlin
Sr Principal Storage Engineer, Nuance

From:  on behalf of "Buterbaugh, 
Kevin L" 
Reply-To: gpfsug main discussion list 
Date: Thursday, February 8, 2018 at 10:15 AM
To: gpfsug main discussion list 
Subject: [EXTERNAL] [gpfsug-discuss] mmchdisk suspend / stop

So … what I was wondering if I could do is to use mmchdisk to either suspend or 
(preferably) stop those NSDs, do the firmware upgrade, and resume the NSDs?  
The problem I see is that suspend doesn’t stop I/O, it only prevents the 
allocation of new blocks … so, in theory, if a user suddenly decided to start 
using a file they hadn’t needed for 3 months then I’ve got a problem.  Stopping 
all I/O to the disks is what I really want to do.  However, according to the 
mmchdisk man page stop cannot be used on a filesystem with replication set to 
one.
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] mmchdisk suspend / stop

2018-02-08 Thread Sobey, Richard A
Sorry I can’t help… the only thing going round and round my head right now is 
why on earth the existing controller cannot push the required firmware to the 
new one when it comes online. Never heard of anything else! Feel free to name 
and shame so I can avoid 😊

Richard

From: gpfsug-discuss-boun...@spectrumscale.org 
[mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of Buterbaugh, 
Kevin L
Sent: 08 February 2018 16:00
To: gpfsug main discussion list 
Subject: [gpfsug-discuss] mmchdisk suspend / stop

Hi All,

We are in a bit of a difficult situation right now with one of our non-IBM 
hardware vendors (I know, I know, I KNOW - buy IBM hardware! ) and are 
looking for some advice on how to deal with this unfortunate situation.

We have a non-IBM FC storage array with dual-“redundant” controllers.  One of 
those controllers is dead and the vendor is sending us a replacement.  However, 
the replacement controller will have mis-matched firmware with the surviving 
controller and - long story short - the vendor says there is no way to resolve 
that without taking the storage array down for firmware upgrades.  Needless to 
say there’s more to that story than what I’ve included here, but I won’t bore 
everyone with unnecessary details.

The storage array has 5 NSDs on it, but fortunately enough they are part of our 
“capacity” pool … i.e. the only way a file lands here is if an mmapplypolicy 
scan moved it there because the *access* time is greater than 90 days.  
Filesystem data replication is set to one.

So … what I was wondering if I could do is to use mmchdisk to either suspend or 
(preferably) stop those NSDs, do the firmware upgrade, and resume the NSDs?  
The problem I see is that suspend doesn’t stop I/O, it only prevents the 
allocation of new blocks … so, in theory, if a user suddenly decided to start 
using a file they hadn’t needed for 3 months then I’ve got a problem.  Stopping 
all I/O to the disks is what I really want to do.  However, according to the 
mmchdisk man page stop cannot be used on a filesystem with replication set to 
one.

There’s over 250 TB of data on those 5 NSDs, so restriping off of them or 
setting replication to two are not options.

It is very unlikely that anyone would try to access a file on those NSDs during 
the hour or so I’d need to do the firmware upgrades, but how would GPFS itself 
react to those (suspended) disks going away for a while?  I’m thinking I could 
be OK if there was just a way to actually stop them rather than suspend them.  
Any undocumented options to mmchdisk that I’m not aware of???

Are there other options - besides buying IBM hardware - that I am overlooking?  
Thanks...
—
Kevin Buterbaugh - Senior System Administrator
Vanderbilt University - Advanced Computing Center for Research and Education
kevin.buterba...@vanderbilt.edu<mailto:kevin.buterba...@vanderbilt.edu> - 
(615)875-9633



___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] mmchdisk suspend / stop

2018-02-08 Thread Buterbaugh, Kevin L
Hi All,

We are in a bit of a difficult situation right now with one of our non-IBM 
hardware vendors (I know, I know, I KNOW - buy IBM hardware! ) and are 
looking for some advice on how to deal with this unfortunate situation.

We have a non-IBM FC storage array with dual-“redundant” controllers.  One of 
those controllers is dead and the vendor is sending us a replacement.  However, 
the replacement controller will have mis-matched firmware with the surviving 
controller and - long story short - the vendor says there is no way to resolve 
that without taking the storage array down for firmware upgrades.  Needless to 
say there’s more to that story than what I’ve included here, but I won’t bore 
everyone with unnecessary details.

The storage array has 5 NSDs on it, but fortunately enough they are part of our 
“capacity” pool … i.e. the only way a file lands here is if an mmapplypolicy 
scan moved it there because the *access* time is greater than 90 days.  
Filesystem data replication is set to one.

So … what I was wondering if I could do is to use mmchdisk to either suspend or 
(preferably) stop those NSDs, do the firmware upgrade, and resume the NSDs?  
The problem I see is that suspend doesn’t stop I/O, it only prevents the 
allocation of new blocks … so, in theory, if a user suddenly decided to start 
using a file they hadn’t needed for 3 months then I’ve got a problem.  Stopping 
all I/O to the disks is what I really want to do.  However, according to the 
mmchdisk man page stop cannot be used on a filesystem with replication set to 
one.

There’s over 250 TB of data on those 5 NSDs, so restriping off of them or 
setting replication to two are not options.

It is very unlikely that anyone would try to access a file on those NSDs during 
the hour or so I’d need to do the firmware upgrades, but how would GPFS itself 
react to those (suspended) disks going away for a while?  I’m thinking I could 
be OK if there was just a way to actually stop them rather than suspend them.  
Any undocumented options to mmchdisk that I’m not aware of???

Are there other options - besides buying IBM hardware - that I am overlooking?  
Thanks...

—
Kevin Buterbaugh - Senior System Administrator
Vanderbilt University - Advanced Computing Center for Research and Education
kevin.buterba...@vanderbilt.edu - 
(615)875-9633



___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss