Re: [ceph-users] Lessons learned upgrading Hammer -> Jewel

2016-07-17 Thread Adrian Saul

I have SELinux disabled and it does the restorecon on /var/lib/ceph regardless 
from the RPM post upgrade scripts.

In my case I chose to kill the restorecon processes to save outage time – it 
didn’t affect the upgrade package completion.


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mykola 
Dvornik
Sent: Friday, 15 July 2016 6:54 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Lessons learned upgrading Hammer -> Jewel

I would also advice people to mind the SELinux if it is enabled on the OSD's 
nodes.
The re-labeling should be done as the part of the upgrade and this is rather 
time consuming process.


-Original Message-
From: Mart van Santen 
<m...@greenhost.nl<mailto:mart%20van%20santen%20%3cm...@greenhost.nl%3e>>
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Lessons learned upgrading Hammer -> Jewel
Date: Fri, 15 Jul 2016 10:48:40 +0200


Hi Wido,

Thank you, we are currently in the same process so this information is very 
usefull. Can you share why you upgraded from hammer directly to jewel, is there 
a reason to skip infernalis? So, I wonder why you didn't do a 
hammer->infernalis->jewel upgrade, as that seems the logical path for me.

(we did indeed saw the same errors "Failed to encode map eXXX with expected 
crc" when upgrading to the latest hammer)


Regards,

Mart






On 07/15/2016 03:08 AM, 席智勇 wrote:
good job, thank you for sharing, Wido~
it's very useful~

2016-07-14 14:33 GMT+08:00 Wido den Hollander 
<w...@42on.com<mailto:w...@42on.com>>:

To add, the RGWs upgraded just fine as well.

No regions in use here (yet!), so that upgraded as it should.

Wido

> Op 13 juli 2016 om 16:56 schreef Wido den Hollander 
> <w...@42on.com<mailto:w...@42on.com>>:
>
>
> Hello,
>
> The last 3 days I worked at a customer with a 1800 OSD cluster which had to 
> be upgraded from Hammer 0.94.5 to Jewel 10.2.2
>
> The cluster in this case is 99% RGW, but also some RBD.
>
> I wanted to share some of the things we encountered during this upgrade.
>
> All 180 nodes are running CentOS 7.1 on a IPv6-only network.
>
> ** Hammer Upgrade **
> At first we upgraded from 0.94.5 to 0.94.7, this went well except for the 
> fact that the monitors got spammed with these kind of messages:
>
>   "Failed to encode map eXXX with expected crc"
>
> Some searching on the list brought me to:
>
>   ceph tell osd.* injectargs -- --clog_to_monitors=false
>
>  This reduced the load on the 5 monitors and made recovery succeed smoothly.
>
>  ** Monitors to Jewel **
>  The next step was to upgrade the monitors from Hammer to Jewel.
>
>  Using Salt we upgraded the packages and afterwards it was simple:
>
>killall ceph-mon
>chown -R ceph:ceph /var/lib/ceph
>chown -R ceph:ceph /var/log/ceph
>
> Now, a systemd quirck. 'systemctl start ceph.target' does not work, I had to 
> manually enabled the monitor and start it:
>
>   systemctl enable 
> ceph-mon@srv-zmb04-05.service<mailto:ceph-mon@srv-zmb04-05.service>
>   systemctl start 
> ceph-mon@srv-zmb04-05.service<mailto:ceph-mon@srv-zmb04-05.service>
>
> Afterwards the monitors were running just fine.
>
> ** OSDs to Jewel **
> To upgrade the OSDs to Jewel we initially used Salt to update the packages on 
> all systems to 10.2.2, we then used a Shell script which we ran on one node 
> at a time.
>
> The failure domain here is 'rack', so we executed this in one rack, then the 
> next one, etc, etc.
>
> Script can be found on Github: 
> https://gist.github.com/wido/06eac901bd42f01ca2f4f1a1d76c49a6
>
> Be aware that the chown can take a long, long, very long time!
>
> We ran into the issue that some OSDs crashed after start. But after trying 
> again they would start.
>
>   "void FileStore::init_temp_collections()"
>
> I reported this in the tracker as I'm not sure what is happening here: 
> http://tracker.ceph.com/issues/16672
>
> ** New OSDs with Jewel **
> We also had some new nodes which we wanted to add to the Jewel cluster.
>
> Using Salt and ceph-disk we ran into a partprobe issue in combination with 
> ceph-disk. There was already a Pull Request for the fix, but that was not 
> included in Jewel 10.2.2.
>
> We manually applied the PR and it fixed our issues: 
> https://github.com/ceph/ceph/pull/9330
>
> Hope this helps other people with their upgrades to Jewel!
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Lessons learned upgrading Hammer -> Jewel

2016-07-15 Thread Wido den Hollander

> Op 15 juli 2016 om 10:48 schreef Mart van Santen :
> 
> 
> 
> Hi Wido,
> 
> Thank you, we are currently in the same process so this information is
> very usefull. Can you share why you upgraded from hammer directly to
> jewel, is there a reason to skip infernalis? So, I wonder why you didn't
> do a hammer->infernalis->jewel upgrade, as that seems the logical path
> for me.
> 

LTS to LTS upgrades, that's why. Tested it in small on a few VMs and afterwards 
did the production cluster.

We needed to go to Jewel due to some fixes for large clusters and RGW features 
(AWS4) and fixes.

Wido

> (we did indeed saw the same errors "Failed to encode map eXXX with
> expected crc" when upgrading to the latest hammer)
> 
> 
> Regards,
> 
> Mart
> 
> 
> 
> 
> 
> 
> 
> On 07/15/2016 03:08 AM, 席智勇 wrote:
> > good job, thank you for sharing, Wido~
> > it's very useful~
> >
> > 2016-07-14 14:33 GMT+08:00 Wido den Hollander  > >:
> >
> > To add, the RGWs upgraded just fine as well.
> >
> > No regions in use here (yet!), so that upgraded as it should.
> >
> > Wido
> >
> > > Op 13 juli 2016 om 16:56 schreef Wido den Hollander
> > >:
> > >
> > >
> > > Hello,
> > >
> > > The last 3 days I worked at a customer with a 1800 OSD cluster
> > which had to be upgraded from Hammer 0.94.5 to Jewel 10.2.2
> > >
> > > The cluster in this case is 99% RGW, but also some RBD.
> > >
> > > I wanted to share some of the things we encountered during this
> > upgrade.
> > >
> > > All 180 nodes are running CentOS 7.1 on a IPv6-only network.
> > >
> > > ** Hammer Upgrade **
> > > At first we upgraded from 0.94.5 to 0.94.7, this went well
> > except for the fact that the monitors got spammed with these kind
> > of messages:
> > >
> > >   "Failed to encode map eXXX with expected crc"
> > >
> > > Some searching on the list brought me to:
> > >
> > >   ceph tell osd.* injectargs -- --clog_to_monitors=false
> > >
> > >  This reduced the load on the 5 monitors and made recovery
> > succeed smoothly.
> > >
> > >  ** Monitors to Jewel **
> > >  The next step was to upgrade the monitors from Hammer to Jewel.
> > >
> > >  Using Salt we upgraded the packages and afterwards it was simple:
> > >
> > >killall ceph-mon
> > >chown -R ceph:ceph /var/lib/ceph
> > >chown -R ceph:ceph /var/log/ceph
> > >
> > > Now, a systemd quirck. 'systemctl start ceph.target' does not
> > work, I had to manually enabled the monitor and start it:
> > >
> > >   systemctl enable ceph-mon@srv-zmb04-05.service
> > >   systemctl start ceph-mon@srv-zmb04-05.service
> > >
> > > Afterwards the monitors were running just fine.
> > >
> > > ** OSDs to Jewel **
> > > To upgrade the OSDs to Jewel we initially used Salt to update
> > the packages on all systems to 10.2.2, we then used a Shell script
> > which we ran on one node at a time.
> > >
> > > The failure domain here is 'rack', so we executed this in one
> > rack, then the next one, etc, etc.
> > >
> > > Script can be found on Github:
> > https://gist.github.com/wido/06eac901bd42f01ca2f4f1a1d76c49a6
> > >
> > > Be aware that the chown can take a long, long, very long time!
> > >
> > > We ran into the issue that some OSDs crashed after start. But
> > after trying again they would start.
> > >
> > >   "void FileStore::init_temp_collections()"
> > >
> > > I reported this in the tracker as I'm not sure what is happening
> > here: http://tracker.ceph.com/issues/16672
> > >
> > > ** New OSDs with Jewel **
> > > We also had some new nodes which we wanted to add to the Jewel
> > cluster.
> > >
> > > Using Salt and ceph-disk we ran into a partprobe issue in
> > combination with ceph-disk. There was already a Pull Request for
> > the fix, but that was not included in Jewel 10.2.2.
> > >
> > > We manually applied the PR and it fixed our issues:
> > https://github.com/ceph/ceph/pull/9330
> > >
> > > Hope this helps other people with their upgrades to Jewel!
> > >
> > > Wido
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com 
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> -- 
> 

Re: [ceph-users] Lessons learned upgrading Hammer -> Jewel

2016-07-15 Thread Christian Balzer
Hello,

On Fri, 15 Jul 2016 10:48:40 +0200 Mart van Santen wrote:

> 
> Hi Wido,
> 
> Thank you, we are currently in the same process so this information is
> very usefull. Can you share why you upgraded from hammer directly to
> jewel, is there a reason to skip infernalis? So, I wonder why you didn't
> do a hammer->infernalis->jewel upgrade, as that seems the logical path
> for me.
> 
Hammer and Jewel are long term (for various definitions of long term)
stable releases.
So upgrades from Hammer to Jewel are the logical thing and the one most
people who don't care about bleeding edge on their production clusters.

Infernalis stopped receiving any updates/bugfixes the moment Jewel was
released. 
So theoretically you might be upgrading into something that has known and
unfixed bugs when going via Infernalis. 

And then there's the whole thing of restarting all your MONs and OSDs with
all the potential fun that can entail (as well as likely being forced to
do this during late night/weekend maintenance windows).
From where I'm standing upgrading to the latest Hammer and then Jewel is
already one step too many, no need to add another one.

Lastly, given all the outstanding issues with 0.94.7 AND the latest Jewel
I'm going to sit on the sidelines some more, especially since my staging
cluster HW just arrived. 

Christian

> (we did indeed saw the same errors "Failed to encode map eXXX with
> expected crc" when upgrading to the latest hammer)
> 
> 
> Regards,
> 
> Mart
> 
> 
> 
> 
> 
> 
> 
> On 07/15/2016 03:08 AM, 席智勇 wrote:
> > good job, thank you for sharing, Wido~
> > it's very useful~
> >
> > 2016-07-14 14:33 GMT+08:00 Wido den Hollander  > >:
> >
> > To add, the RGWs upgraded just fine as well.
> >
> > No regions in use here (yet!), so that upgraded as it should.
> >
> > Wido
> >
> > > Op 13 juli 2016 om 16:56 schreef Wido den Hollander
> > >:
> > >
> > >
> > > Hello,
> > >
> > > The last 3 days I worked at a customer with a 1800 OSD cluster
> > which had to be upgraded from Hammer 0.94.5 to Jewel 10.2.2
> > >
> > > The cluster in this case is 99% RGW, but also some RBD.
> > >
> > > I wanted to share some of the things we encountered during this
> > upgrade.
> > >
> > > All 180 nodes are running CentOS 7.1 on a IPv6-only network.
> > >
> > > ** Hammer Upgrade **
> > > At first we upgraded from 0.94.5 to 0.94.7, this went well
> > except for the fact that the monitors got spammed with these kind
> > of messages:
> > >
> > >   "Failed to encode map eXXX with expected crc"
> > >
> > > Some searching on the list brought me to:
> > >
> > >   ceph tell osd.* injectargs -- --clog_to_monitors=false
> > >
> > >  This reduced the load on the 5 monitors and made recovery
> > succeed smoothly.
> > >
> > >  ** Monitors to Jewel **
> > >  The next step was to upgrade the monitors from Hammer to Jewel.
> > >
> > >  Using Salt we upgraded the packages and afterwards it was simple:
> > >
> > >killall ceph-mon
> > >chown -R ceph:ceph /var/lib/ceph
> > >chown -R ceph:ceph /var/log/ceph
> > >
> > > Now, a systemd quirck. 'systemctl start ceph.target' does not
> > work, I had to manually enabled the monitor and start it:
> > >
> > >   systemctl enable ceph-mon@srv-zmb04-05.service
> > >   systemctl start ceph-mon@srv-zmb04-05.service
> > >
> > > Afterwards the monitors were running just fine.
> > >
> > > ** OSDs to Jewel **
> > > To upgrade the OSDs to Jewel we initially used Salt to update
> > the packages on all systems to 10.2.2, we then used a Shell script
> > which we ran on one node at a time.
> > >
> > > The failure domain here is 'rack', so we executed this in one
> > rack, then the next one, etc, etc.
> > >
> > > Script can be found on Github:
> > https://gist.github.com/wido/06eac901bd42f01ca2f4f1a1d76c49a6
> > >
> > > Be aware that the chown can take a long, long, very long time!
> > >
> > > We ran into the issue that some OSDs crashed after start. But
> > after trying again they would start.
> > >
> > >   "void FileStore::init_temp_collections()"
> > >
> > > I reported this in the tracker as I'm not sure what is happening
> > here: http://tracker.ceph.com/issues/16672
> > >
> > > ** New OSDs with Jewel **
> > > We also had some new nodes which we wanted to add to the Jewel
> > cluster.
> > >
> > > Using Salt and ceph-disk we ran into a partprobe issue in
> > combination with ceph-disk. There was already a Pull Request for
> > the fix, but that was not included in Jewel 10.2.2.
> > >
> > > We manually applied the PR and it fixed our issues:
> > https://github.com/ceph/ceph/pull/9330
> > >
> > > Hope this 

Re: [ceph-users] Lessons learned upgrading Hammer -> Jewel

2016-07-15 Thread Mykola Dvornik
I would also advice people to mind the SELinux if it is enabled on the
OSD's nodes.
The re-labeling should be done as the part of the upgrade and this is
rather time consuming process.

-Original Message-
From: Mart van Santen <m...@greenhost.nl>
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Lessons learned upgrading Hammer -> Jewel
Date: Fri, 15 Jul 2016 10:48:40 +0200


  

  
  


Hi Wido,



Thank you, we are currently in the same process so this information
is very usefull. Can you share why you upgraded from hammer
directly
to jewel, is there a reason to skip infernalis? So, I wonder why
you
didn't do a hammer->infernalis->jewel upgrade, as that seems
the logical path for me.



(we did indeed saw the same errors "Failed to encode map eXXX with
expected crc" when upgrading to the latest hammer)





Regards,



Mart















On 07/15/2016 03:08 AM, 席智勇 wrote:



>   good job, thank you for sharing, Wido~
> it's very useful~
> 
>   
> 
>   
> 
> 2016-07-14 14:33 GMT+08:00 Wido den
> >   Hollander <w...@42on.com>:
> 
> > To add,
> > the RGWs upgraded just fine as well.
> > 
> > 
> > 
> > No regions in use here (yet!), so that upgraded as it
> > should.
> > 
> > 
> > 
> > Wido
> > 
> > 
> > 
> > > Op 13 juli 2016 om 16:56 schreef Wido den Hollander
> > > > <w...@42on.com>:
> > 
> > 
> >   >
> > 
> > >
> > 
> > > Hello,
> > 
> > >
> > 
> > > > > The last 3 days I worked at a customer with a
1800
> > > > OSD cluster which had to be upgraded from Hammer
0.94.5
> > to Jewel 10.2.2
> > 
> > >
> > 
> > > > > The cluster in this case is 99% RGW, but also
some
> > RBD.
> > 
> > >
> > 
> > > > > I wanted to share some of the things we
encountered
> > during this upgrade.
> > 
> > >
> > 
> > > > > All 180 nodes are running CentOS 7.1 on a IPv6-
only
> > network.
> > 
> > >
> > 
> > > ** Hammer Upgrade **
> > 
> > > At first we upgraded from 0.94.5 to 0.94.7, this
> > went well except for the fact that the monitors got
> > spammed with these kind of messages:
> > 
> > >
> > 
> > >   "Failed to encode map eXXX with expected crc"
> > 
> > >
> > 
> > > Some searching on the list brought me to:
> > 
> > >
> > 
> > >   ceph tell osd.* injectargs --
> > --clog_to_monitors=false
> > 
> > >
> > 
> > >  This reduced the load on the 5 monitors and made
> > recovery succeed smoothly.
> > 
> > >
> > 
> > >  ** Monitors to Jewel **
> > 
> > >  The next step was to upgrade the monitors from
> > Hammer to Jewel.
> > 
> > >
> > 
> > > > >  Using Salt we upgraded the packages and
afterwards
> > it was simple:
> > 
> > >
> > 
> > >    killall ceph-mon
> > 
> > >    chown -R ceph:ceph /var/lib/ceph
> > 
> > >    chown -R ceph:ceph /var/log/ceph
> > 
> > >
> > 
> > > Now, a systemd quirck. 'systemctl start
> > > > ceph.target' does not work, I had to manually
enabled
> > the monitor and start it:
> > 
> > >
> > 
> > >   systemctl enable ceph-mon@srv-zmb04-05.service
> > 
> > >   systemctl start ceph-mon@srv-zmb04-05.service
> > 
> > >
> > 
> > > Afterwards the monitors were running just f

Re: [ceph-users] Lessons learned upgrading Hammer -> Jewel

2016-07-15 Thread Sean Redmond
Hi Matt,

I have too followed the upgrade from hammer to jewel, I think it is pretty
accepted to upgrade between LTS releases (H>J) skipping the 'stable'
releases (I) in the middle.

Thanks

On Fri, Jul 15, 2016 at 9:48 AM, Mart van Santen  wrote:

>
> Hi Wido,
>
> Thank you, we are currently in the same process so this information is
> very usefull. Can you share why you upgraded from hammer directly to jewel,
> is there a reason to skip infernalis? So, I wonder why you didn't do a
> hammer->infernalis->jewel upgrade, as that seems the logical path for me.
>
> (we did indeed saw the same errors "Failed to encode map eXXX with
> expected crc" when upgrading to the latest hammer)
>
>
> Regards,
>
> Mart
>
>
>
>
>
>
>
>
> On 07/15/2016 03:08 AM, 席智勇 wrote:
>
> good job, thank you for sharing, Wido~
> it's very useful~
>
> 2016-07-14 14:33 GMT+08:00 Wido den Hollander :
>
>> To add, the RGWs upgraded just fine as well.
>>
>> No regions in use here (yet!), so that upgraded as it should.
>>
>> Wido
>>
>> > Op 13 juli 2016 om 16:56 schreef Wido den Hollander :
>> >
>> >
>> > Hello,
>> >
>> > The last 3 days I worked at a customer with a 1800 OSD cluster which
>> had to be upgraded from Hammer 0.94.5 to Jewel 10.2.2
>> >
>> > The cluster in this case is 99% RGW, but also some RBD.
>> >
>> > I wanted to share some of the things we encountered during this upgrade.
>> >
>> > All 180 nodes are running CentOS 7.1 on a IPv6-only network.
>> >
>> > ** Hammer Upgrade **
>> > At first we upgraded from 0.94.5 to 0.94.7, this went well except for
>> the fact that the monitors got spammed with these kind of messages:
>> >
>> >   "Failed to encode map eXXX with expected crc"
>> >
>> > Some searching on the list brought me to:
>> >
>> >   ceph tell osd.* injectargs -- --clog_to_monitors=false
>> >
>> >  This reduced the load on the 5 monitors and made recovery succeed
>> smoothly.
>> >
>> >  ** Monitors to Jewel **
>> >  The next step was to upgrade the monitors from Hammer to Jewel.
>> >
>> >  Using Salt we upgraded the packages and afterwards it was simple:
>> >
>> >killall ceph-mon
>> >chown -R ceph:ceph /var/lib/ceph
>> >chown -R ceph:ceph /var/log/ceph
>> >
>> > Now, a systemd quirck. 'systemctl start ceph.target' does not work, I
>> had to manually enabled the monitor and start it:
>> >
>> >   systemctl enable ceph-mon@srv-zmb04-05.service
>> >   systemctl start ceph-mon@srv-zmb04-05.service
>> >
>> > Afterwards the monitors were running just fine.
>> >
>> > ** OSDs to Jewel **
>> > To upgrade the OSDs to Jewel we initially used Salt to update the
>> packages on all systems to 10.2.2, we then used a Shell script which we ran
>> on one node at a time.
>> >
>> > The failure domain here is 'rack', so we executed this in one rack,
>> then the next one, etc, etc.
>> >
>> > Script can be found on Github:
>> 
>> https://gist.github.com/wido/06eac901bd42f01ca2f4f1a1d76c49a6
>> >
>> > Be aware that the chown can take a long, long, very long time!
>> >
>> > We ran into the issue that some OSDs crashed after start. But after
>> trying again they would start.
>> >
>> >   "void FileStore::init_temp_collections()"
>> >
>> > I reported this in the tracker as I'm not sure what is happening here:
>> http://tracker.ceph.com/issues/16672
>> >
>> > ** New OSDs with Jewel **
>> > We also had some new nodes which we wanted to add to the Jewel cluster.
>> >
>> > Using Salt and ceph-disk we ran into a partprobe issue in combination
>> with ceph-disk. There was already a Pull Request for the fix, but that was
>> not included in Jewel 10.2.2.
>> >
>> > We manually applied the PR and it fixed our issues:
>> https://github.com/ceph/ceph/pull/9330
>> >
>> > Hope this helps other people with their upgrades to Jewel!
>> >
>> > Wido
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> Mart van Santen
> Greenhost
> E: m...@greenhost.nl
> T: +31 20 4890444
> W: https://greenhost.nl
>
> A PGP signature can be attached to this e-mail,
> you need PGP software to verify it.
> My public key is available in keyserver(s)
> see: http://tinyurl.com/openpgp-manual
>
> PGP Fingerprint: CA85 EB11 2B70 042D AF66  B29A 6437 01A1 10A3 D3A5
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list

Re: [ceph-users] Lessons learned upgrading Hammer -> Jewel

2016-07-15 Thread Mart van Santen

Hi Wido,

Thank you, we are currently in the same process so this information is
very usefull. Can you share why you upgraded from hammer directly to
jewel, is there a reason to skip infernalis? So, I wonder why you didn't
do a hammer->infernalis->jewel upgrade, as that seems the logical path
for me.

(we did indeed saw the same errors "Failed to encode map eXXX with
expected crc" when upgrading to the latest hammer)


Regards,

Mart







On 07/15/2016 03:08 AM, 席智勇 wrote:
> good job, thank you for sharing, Wido~
> it's very useful~
>
> 2016-07-14 14:33 GMT+08:00 Wido den Hollander  >:
>
> To add, the RGWs upgraded just fine as well.
>
> No regions in use here (yet!), so that upgraded as it should.
>
> Wido
>
> > Op 13 juli 2016 om 16:56 schreef Wido den Hollander
> >:
> >
> >
> > Hello,
> >
> > The last 3 days I worked at a customer with a 1800 OSD cluster
> which had to be upgraded from Hammer 0.94.5 to Jewel 10.2.2
> >
> > The cluster in this case is 99% RGW, but also some RBD.
> >
> > I wanted to share some of the things we encountered during this
> upgrade.
> >
> > All 180 nodes are running CentOS 7.1 on a IPv6-only network.
> >
> > ** Hammer Upgrade **
> > At first we upgraded from 0.94.5 to 0.94.7, this went well
> except for the fact that the monitors got spammed with these kind
> of messages:
> >
> >   "Failed to encode map eXXX with expected crc"
> >
> > Some searching on the list brought me to:
> >
> >   ceph tell osd.* injectargs -- --clog_to_monitors=false
> >
> >  This reduced the load on the 5 monitors and made recovery
> succeed smoothly.
> >
> >  ** Monitors to Jewel **
> >  The next step was to upgrade the monitors from Hammer to Jewel.
> >
> >  Using Salt we upgraded the packages and afterwards it was simple:
> >
> >killall ceph-mon
> >chown -R ceph:ceph /var/lib/ceph
> >chown -R ceph:ceph /var/log/ceph
> >
> > Now, a systemd quirck. 'systemctl start ceph.target' does not
> work, I had to manually enabled the monitor and start it:
> >
> >   systemctl enable ceph-mon@srv-zmb04-05.service
> >   systemctl start ceph-mon@srv-zmb04-05.service
> >
> > Afterwards the monitors were running just fine.
> >
> > ** OSDs to Jewel **
> > To upgrade the OSDs to Jewel we initially used Salt to update
> the packages on all systems to 10.2.2, we then used a Shell script
> which we ran on one node at a time.
> >
> > The failure domain here is 'rack', so we executed this in one
> rack, then the next one, etc, etc.
> >
> > Script can be found on Github:
> https://gist.github.com/wido/06eac901bd42f01ca2f4f1a1d76c49a6
> >
> > Be aware that the chown can take a long, long, very long time!
> >
> > We ran into the issue that some OSDs crashed after start. But
> after trying again they would start.
> >
> >   "void FileStore::init_temp_collections()"
> >
> > I reported this in the tracker as I'm not sure what is happening
> here: http://tracker.ceph.com/issues/16672
> >
> > ** New OSDs with Jewel **
> > We also had some new nodes which we wanted to add to the Jewel
> cluster.
> >
> > Using Salt and ceph-disk we ran into a partprobe issue in
> combination with ceph-disk. There was already a Pull Request for
> the fix, but that was not included in Jewel 10.2.2.
> >
> > We manually applied the PR and it fixed our issues:
> https://github.com/ceph/ceph/pull/9330
> >
> > Hope this helps other people with their upgrades to Jewel!
> >
> > Wido
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Mart van Santen
Greenhost
E: m...@greenhost.nl
T: +31 20 4890444
W: https://greenhost.nl

A PGP signature can be attached to this e-mail,
you need PGP software to verify it. 
My public key is available in keyserver(s)
see: http://tinyurl.com/openpgp-manual

PGP Fingerprint: CA85 EB11 2B70 042D AF66  B29A 6437 01A1 10A3 D3A5



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lessons learned upgrading Hammer -> Jewel

2016-07-14 Thread 席智勇
good job, thank you for sharing, Wido~
it's very useful~

2016-07-14 14:33 GMT+08:00 Wido den Hollander :

> To add, the RGWs upgraded just fine as well.
>
> No regions in use here (yet!), so that upgraded as it should.
>
> Wido
>
> > Op 13 juli 2016 om 16:56 schreef Wido den Hollander :
> >
> >
> > Hello,
> >
> > The last 3 days I worked at a customer with a 1800 OSD cluster which had
> to be upgraded from Hammer 0.94.5 to Jewel 10.2.2
> >
> > The cluster in this case is 99% RGW, but also some RBD.
> >
> > I wanted to share some of the things we encountered during this upgrade.
> >
> > All 180 nodes are running CentOS 7.1 on a IPv6-only network.
> >
> > ** Hammer Upgrade **
> > At first we upgraded from 0.94.5 to 0.94.7, this went well except for
> the fact that the monitors got spammed with these kind of messages:
> >
> >   "Failed to encode map eXXX with expected crc"
> >
> > Some searching on the list brought me to:
> >
> >   ceph tell osd.* injectargs -- --clog_to_monitors=false
> >
> >  This reduced the load on the 5 monitors and made recovery succeed
> smoothly.
> >
> >  ** Monitors to Jewel **
> >  The next step was to upgrade the monitors from Hammer to Jewel.
> >
> >  Using Salt we upgraded the packages and afterwards it was simple:
> >
> >killall ceph-mon
> >chown -R ceph:ceph /var/lib/ceph
> >chown -R ceph:ceph /var/log/ceph
> >
> > Now, a systemd quirck. 'systemctl start ceph.target' does not work, I
> had to manually enabled the monitor and start it:
> >
> >   systemctl enable ceph-mon@srv-zmb04-05.service
> >   systemctl start ceph-mon@srv-zmb04-05.service
> >
> > Afterwards the monitors were running just fine.
> >
> > ** OSDs to Jewel **
> > To upgrade the OSDs to Jewel we initially used Salt to update the
> packages on all systems to 10.2.2, we then used a Shell script which we ran
> on one node at a time.
> >
> > The failure domain here is 'rack', so we executed this in one rack, then
> the next one, etc, etc.
> >
> > Script can be found on Github:
> https://gist.github.com/wido/06eac901bd42f01ca2f4f1a1d76c49a6
> >
> > Be aware that the chown can take a long, long, very long time!
> >
> > We ran into the issue that some OSDs crashed after start. But after
> trying again they would start.
> >
> >   "void FileStore::init_temp_collections()"
> >
> > I reported this in the tracker as I'm not sure what is happening here:
> http://tracker.ceph.com/issues/16672
> >
> > ** New OSDs with Jewel **
> > We also had some new nodes which we wanted to add to the Jewel cluster.
> >
> > Using Salt and ceph-disk we ran into a partprobe issue in combination
> with ceph-disk. There was already a Pull Request for the fix, but that was
> not included in Jewel 10.2.2.
> >
> > We manually applied the PR and it fixed our issues:
> https://github.com/ceph/ceph/pull/9330
> >
> > Hope this helps other people with their upgrades to Jewel!
> >
> > Wido
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lessons learned upgrading Hammer -> Jewel

2016-07-14 Thread Wido den Hollander
To add, the RGWs upgraded just fine as well.

No regions in use here (yet!), so that upgraded as it should.

Wido

> Op 13 juli 2016 om 16:56 schreef Wido den Hollander :
> 
> 
> Hello,
> 
> The last 3 days I worked at a customer with a 1800 OSD cluster which had to 
> be upgraded from Hammer 0.94.5 to Jewel 10.2.2
> 
> The cluster in this case is 99% RGW, but also some RBD.
> 
> I wanted to share some of the things we encountered during this upgrade.
> 
> All 180 nodes are running CentOS 7.1 on a IPv6-only network.
> 
> ** Hammer Upgrade **
> At first we upgraded from 0.94.5 to 0.94.7, this went well except for the 
> fact that the monitors got spammed with these kind of messages:
> 
>   "Failed to encode map eXXX with expected crc"
> 
> Some searching on the list brought me to:
> 
>   ceph tell osd.* injectargs -- --clog_to_monitors=false
>   
>  This reduced the load on the 5 monitors and made recovery succeed smoothly.
>  
>  ** Monitors to Jewel **
>  The next step was to upgrade the monitors from Hammer to Jewel.
>  
>  Using Salt we upgraded the packages and afterwards it was simple:
>  
>killall ceph-mon
>chown -R ceph:ceph /var/lib/ceph
>chown -R ceph:ceph /var/log/ceph
> 
> Now, a systemd quirck. 'systemctl start ceph.target' does not work, I had to 
> manually enabled the monitor and start it:
> 
>   systemctl enable ceph-mon@srv-zmb04-05.service
>   systemctl start ceph-mon@srv-zmb04-05.service
> 
> Afterwards the monitors were running just fine.
> 
> ** OSDs to Jewel **
> To upgrade the OSDs to Jewel we initially used Salt to update the packages on 
> all systems to 10.2.2, we then used a Shell script which we ran on one node 
> at a time.
> 
> The failure domain here is 'rack', so we executed this in one rack, then the 
> next one, etc, etc.
> 
> Script can be found on Github: 
> https://gist.github.com/wido/06eac901bd42f01ca2f4f1a1d76c49a6
> 
> Be aware that the chown can take a long, long, very long time!
> 
> We ran into the issue that some OSDs crashed after start. But after trying 
> again they would start.
> 
>   "void FileStore::init_temp_collections()"
>   
> I reported this in the tracker as I'm not sure what is happening here: 
> http://tracker.ceph.com/issues/16672
> 
> ** New OSDs with Jewel **
> We also had some new nodes which we wanted to add to the Jewel cluster.
> 
> Using Salt and ceph-disk we ran into a partprobe issue in combination with 
> ceph-disk. There was already a Pull Request for the fix, but that was not 
> included in Jewel 10.2.2.
> 
> We manually applied the PR and it fixed our issues: 
> https://github.com/ceph/ceph/pull/9330
> 
> Hope this helps other people with their upgrades to Jewel!
> 
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lessons learned upgrading Hammer -> Jewel

2016-07-13 Thread Luis Periquito
Thanks for sharing Wido.

>From your information you only talk about MON and OSD. What about the
RGW nodes? You stated in the beginning that 99% is rgw...

On Wed, Jul 13, 2016 at 3:56 PM, Wido den Hollander  wrote:
> Hello,
>
> The last 3 days I worked at a customer with a 1800 OSD cluster which had to 
> be upgraded from Hammer 0.94.5 to Jewel 10.2.2
>
> The cluster in this case is 99% RGW, but also some RBD.
>
> I wanted to share some of the things we encountered during this upgrade.
>
> All 180 nodes are running CentOS 7.1 on a IPv6-only network.
>
> ** Hammer Upgrade **
> At first we upgraded from 0.94.5 to 0.94.7, this went well except for the 
> fact that the monitors got spammed with these kind of messages:
>
>   "Failed to encode map eXXX with expected crc"
>
> Some searching on the list brought me to:
>
>   ceph tell osd.* injectargs -- --clog_to_monitors=false
>
>  This reduced the load on the 5 monitors and made recovery succeed smoothly.
>
>  ** Monitors to Jewel **
>  The next step was to upgrade the monitors from Hammer to Jewel.
>
>  Using Salt we upgraded the packages and afterwards it was simple:
>
>killall ceph-mon
>chown -R ceph:ceph /var/lib/ceph
>chown -R ceph:ceph /var/log/ceph
>
> Now, a systemd quirck. 'systemctl start ceph.target' does not work, I had to 
> manually enabled the monitor and start it:
>
>   systemctl enable ceph-mon@srv-zmb04-05.service
>   systemctl start ceph-mon@srv-zmb04-05.service
>
> Afterwards the monitors were running just fine.
>
> ** OSDs to Jewel **
> To upgrade the OSDs to Jewel we initially used Salt to update the packages on 
> all systems to 10.2.2, we then used a Shell script which we ran on one node 
> at a time.
>
> The failure domain here is 'rack', so we executed this in one rack, then the 
> next one, etc, etc.
>
> Script can be found on Github: 
> https://gist.github.com/wido/06eac901bd42f01ca2f4f1a1d76c49a6
>
> Be aware that the chown can take a long, long, very long time!
>
> We ran into the issue that some OSDs crashed after start. But after trying 
> again they would start.
>
>   "void FileStore::init_temp_collections()"
>
> I reported this in the tracker as I'm not sure what is happening here: 
> http://tracker.ceph.com/issues/16672
>
> ** New OSDs with Jewel **
> We also had some new nodes which we wanted to add to the Jewel cluster.
>
> Using Salt and ceph-disk we ran into a partprobe issue in combination with 
> ceph-disk. There was already a Pull Request for the fix, but that was not 
> included in Jewel 10.2.2.
>
> We manually applied the PR and it fixed our issues: 
> https://github.com/ceph/ceph/pull/9330
>
> Hope this helps other people with their upgrades to Jewel!
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Lessons learned upgrading Hammer -> Jewel

2016-07-13 Thread Wido den Hollander
Hello,

The last 3 days I worked at a customer with a 1800 OSD cluster which had to be 
upgraded from Hammer 0.94.5 to Jewel 10.2.2

The cluster in this case is 99% RGW, but also some RBD.

I wanted to share some of the things we encountered during this upgrade.

All 180 nodes are running CentOS 7.1 on a IPv6-only network.

** Hammer Upgrade **
At first we upgraded from 0.94.5 to 0.94.7, this went well except for the fact 
that the monitors got spammed with these kind of messages:

  "Failed to encode map eXXX with expected crc"

Some searching on the list brought me to:

  ceph tell osd.* injectargs -- --clog_to_monitors=false
  
 This reduced the load on the 5 monitors and made recovery succeed smoothly.
 
 ** Monitors to Jewel **
 The next step was to upgrade the monitors from Hammer to Jewel.
 
 Using Salt we upgraded the packages and afterwards it was simple:
 
   killall ceph-mon
   chown -R ceph:ceph /var/lib/ceph
   chown -R ceph:ceph /var/log/ceph

Now, a systemd quirck. 'systemctl start ceph.target' does not work, I had to 
manually enabled the monitor and start it:

  systemctl enable ceph-mon@srv-zmb04-05.service
  systemctl start ceph-mon@srv-zmb04-05.service

Afterwards the monitors were running just fine.

** OSDs to Jewel **
To upgrade the OSDs to Jewel we initially used Salt to update the packages on 
all systems to 10.2.2, we then used a Shell script which we ran on one node at 
a time.

The failure domain here is 'rack', so we executed this in one rack, then the 
next one, etc, etc.

Script can be found on Github: 
https://gist.github.com/wido/06eac901bd42f01ca2f4f1a1d76c49a6

Be aware that the chown can take a long, long, very long time!

We ran into the issue that some OSDs crashed after start. But after trying 
again they would start.

  "void FileStore::init_temp_collections()"
  
I reported this in the tracker as I'm not sure what is happening here: 
http://tracker.ceph.com/issues/16672

** New OSDs with Jewel **
We also had some new nodes which we wanted to add to the Jewel cluster.

Using Salt and ceph-disk we ran into a partprobe issue in combination with 
ceph-disk. There was already a Pull Request for the fix, but that was not 
included in Jewel 10.2.2.

We manually applied the PR and it fixed our issues: 
https://github.com/ceph/ceph/pull/9330

Hope this helps other people with their upgrades to Jewel!

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com