[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2022-03-18 Thread James Page
** Changed in: ceph (Ubuntu)
 Assignee: James Page (james-page) => (unassigned)

** Changed in: ceph (Ubuntu)
   Status: Incomplete => Opinion

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2021-01-11 Thread Jay Ring
That sounds promising.

I replaced my node a while ago so I can't verify this one way or the
other, but it certainly sounds like it may be the problem.  Including
why Page could not duplicate it in his new install.

One of the reasons I bothered confirming the bug report was so that
future searches for this error would lead to whatever solution was
eventually found.  Hopefully it will help them.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2021-01-11 Thread Matthias Hüther
I have the same issue.
That's why I've been testing a few things over the last few days:

Upgrade process:
Luminous -> Mimic -> Nautilus -> Octopus
(All Versions run under Bionic)


It doesn't matter whether I activate msgr2 or not. I always get the problem 
after upgrading to Octopus:
2021-01-11T09: 46: 33.674 +  7fb8cf2d1700 1 osd.0 194 tick checking mon for 
new map
2021-01-11T09: 47: 04.490 +  7fb8cf2d1700 1 osd.0 194 tick checking mon for 
new map
2021-01-11T09: 47: 34.514 +  7fb8cf2d1700 1 osd.0 194 tick checking mon for 
new map
2021-01-11T09: 48: 05.451 +  7fb8cf2d1700 1 osd.0 194 tick checking mon for 
new map

With a fresh installed version from Ceph Mimic and update to Nautilus -> 
Octopus I don't get this problem.
The problem apparently only comes from the update process from Luminous to 
Mimic, which then affects Octopus at the latest.

Workaround: Execute command on one of the Ceph monitors:
ceph osd require-osd-release mimic
After that, the octupus osd's can connect again.

Perhaps it is a good idea to run the "ceph osd require-osd-release
[version]" command after every update.

e.g .:
After update Luminous -> Mimic
-> command: ceph osd require-osd-release mimic

After update Mimic -> Nautilus
-> command: ceph osd require-osd-release nautilus

After update Nautilus -> Octopus
-> command: ceph osd require-osd-release octopus

Apparently this is not done by the charms yet. Maybe the charms should
do that or it should be mentioned in the charm documentation. What do
you think about that?

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-12-09 Thread Trent Lloyd
This issue appears to be documented here:
https://docs.ceph.com/en/latest/releases/nautilus/#instructions

Complete the upgrade by disallowing pre-Nautilus OSDs and enabling all
new Nautilus-only functionality:

# ceph osd require-osd-release nautilus
Important This step is mandatory. Failure to execute this step will make it 
impossible for OSDs to communicate after msgrv2 is enabled.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-09-17 Thread madar
I am in the middle of an mimic -> nautilus -> octopus upgrade, and got
the same 'tick checking mon for new map' cycle from my 15.2.3 OSD
daemons. After

$ ceph osd require-osd-release mimic

octopus OSD-s can connect to the cluster.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-05-22 Thread Jay Ring
tail -f /var/log/ceph/ceph-osd.13.log
2020-05-22T17:27:43.909-0500 7f44708ca700  1 osd.13 46107 tick checking mon for 
new map
2020-05-22T17:28:14.825-0500 7f44708ca700  1 osd.13 46107 tick checking mon for 
new map
2020-05-22T17:28:44.838-0500 7f44708ca700  1 osd.13 46107 tick checking mon for 
new map
2020-05-22T17:29:14.914-0500 7f44708ca700  1 osd.13 46107 tick checking mon for 
new map
2020-05-22T17:29:45.718-0500 7f44708ca700  1 osd.13 46107 tick checking mon for 
new map
2020-05-22T17:30:16.515-0500 7f44708ca700  1 osd.13 46107 tick checking mon for 
new map
2020-05-22T17:30:46.539-0500 7f44708ca700  1 osd.13 46107 tick checking mon for 
new map
2020-05-22T17:31:16.543-0500 7f44708ca700  1 osd.13 46107 tick checking mon for 
new map
2020-05-22T17:31:46.671-0500 7f44708ca700  1 osd.13 46107 tick checking mon for 
new map
2020-05-22T17:32:16.792-0500 7f44708ca700  1 osd.13 46107 tick checking mon for 
new map

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-05-22 Thread Jay Ring
/etc/ceph/ceph.conf
mon host = 192.168.120.1 192.168.120.2 192.168.120.3


ceph mon dump:
epoch 7
fsid 
last_changed 2020-05-16T23:16:32.234657-0500
created 2016-04-08T10:30:10.123758-0500
min_mon_release 15 (octopus)
0: [v2:192.168.120.1:3300/0,v1:192.168.120.1:6789/0] mon.temple-h1
1: [v2:192.168.120.2:3300/0,v1:192.168.120.2:6789/0] mon.temple-h2
2: [v2:192.168.120.3:3300/0,v1:192.168.120.3:6789/0] mon.temple-h3


netstat -ltup |grep ceph-mon:
tcp0  0 temple-h1:3300  0.0.0.0:*   LISTEN  
1722/ceph-mon
tcp0  0 temple-h1:6789  0.0.0.0:*   LISTEN  
1722/ceph-mon


I doubt this matters, but it might.  These drives were formatted with 
ceph-disk, not ceph-vol.  They are, however, mounted in the right place, and 
the block device is linked to the correct partition.

SystemD has been ignoring enable/disable instructions for a while, I
don't know why.  I assume new detection code.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-05-22 Thread James Page
For example, the test deployment I have uses:

mon_host = 10.5.0.8,10.5.0.5,10.5.0.19

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-05-22 Thread James Page
To confirm:

tcp0  0 10.5.0.8:3300   0.0.0.0:*   LISTEN  
64045  27128  784/ceph-mon
tcp0  0 10.5.0.8:6789   0.0.0.0:*   LISTEN  
64045  27129  784/ceph-mon 

3300 == v2
6789 == v1

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-05-22 Thread James Page
https://docs.ceph.com/docs/master/rados/configuration/msgr2
/#transitioning-from-v1-only-to-v2-plus-v1

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-05-22 Thread James Page
Something was tickling my brain about upgrades that we dealt with in the
ceph charms a while back.

The MON's can run v1 and v2 messenger ports however if a port is
specified in mon hosts in ceph.conf its possible that the v2 port is
disable, which is why the OSD can't connect back to the cluster.

Please can impacted users provide details of mon hosts from their
ceph.conf files.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-05-22 Thread James Page
On Fri, May 22, 2020 at 11:25 AM Jay Ring <1874...@bugs.launchpad.net>
wrote:

> "However it should be possible to complete the do-release-upgrade to the
> point of requesting a reboot - don't - drop to the CLI and get all
> machines to this point and then:
>
>   restart the mons across all three machines
>   restart the mgrs across all three machines
>   restart the osds across all three machines"
>
> Yes, I believe this would work.
>
> However, that's not normally how I would do an upgrade.  Normally, I
> upgrade one machine, make sure it works, and then upgrade the next.  I
> have done it this way since I built the cluster back in Firefly.  When I
> did this time, and it destroyed every OSD on the node that I upgraded.
>

Although not best practice (upgrading machine at a time, rather than mons,
mgrs and osd ingroups) when I tried this earlier today it did actually work
- hence why I think I'm missing something about impacted deployments.

My testing did a fresh deploy of eoan with nautilus and then upgraded to
focal; maybe deployments which have been about for a while have different
state on disk/characteristic which cause this issue.

I'm endeavouring to get to a point where we understand *why* this happens
in certain situations.

tl;dr I need more details about impacted deployments to be able to debug
this further.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-05-22 Thread Jay Ring
"As a side note - even if there is a bug here (and it sounds like there
might be) I would recommend placing the mon and mgr daemons in LXD
containers ontop of the machines hosting the osd's"

Yes.  I would strongly suggest doing this also.  That is how Ceph now
recommends it anyway.  However, older installs are not usually set up
this way.

And there is no warning that if you aren't set up this way that do-
release-upgrade will destroy the node.

I would have been happy to make the change, I just didn't know it was
necessary.

Also, and not to complain, but if you are setting up this way, there is
no reason for the monitor package to be installed outside of the
container - and it should probably not be.

This would suggest to me that ceph-mon should "conflict" with ceph-osd
since they should never be installed in the same context/container/host.
This would force a user to remove either the monitor or OSDs ,
preventing a reboot from destroying the node.

In a perfect world, ceph-osd would notice that it is connecting to an
old monitor and politely disconnect without destroying all it's OSDs.

For now, however, I suggest some sort of stop-gap measure that prevents
users from nuking their cluster without warning.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-05-22 Thread Jay Ring
"However it should be possible to complete the do-release-upgrade to the
point of requesting a reboot - don't - drop to the CLI and get all
machines to this point and then:

  restart the mons across all three machines
  restart the mgrs across all three machines
  restart the osds across all three machines"

Yes, I believe this would work.

However, that's not normally how I would do an upgrade.  Normally, I
upgrade one machine, make sure it works, and then upgrade the next.  I
have done it this way since I built the cluster back in Firefly.  When I
did this time, and it destroyed every OSD on the node that I upgraded.

This was very unexpected and disappointing, to say the least.

I wanted to warn others and try to prevent it from happening to them.  I
accept some of the blame.  Part of it is on me, part of it is on Ceph,
part of it is on Ubuntu.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-05-22 Thread James Page
Hi Christian

On Fri, May 22, 2020 at 8:10 AM Christian Huebner <
1874...@bugs.launchpad.net> wrote:

> i filed this bug specifically for hyperconverged environments. Upgrading
> monitor nodes first and then upgrading separate OSD nodes is probably
> doable, but in a hyperconverged environment you can not separate.
>

I appreciate that which is why I have endeavoured to reproduce your issue
on a hyperconverged deployment as well.


> I tried do-release-upgrade (a couple of times) without rebooting at the
> end, but found the monitors and OSDs were upgraded and deadlocked at the
> end.
> I tried shutting down all Ceph services first and then do-release upgrade.
> Which started my Ceph services and destroyed my cluster.
> I tried manually upgrading Ceph, which is thwarted by the dependencies,
> it's all or nothing.
>
> I finally accomplished the upgrade by marking all Ceph packages held,
> then digging myself through the dependency jungle to upgrade the
> packages in the right sequence. This was an absolute nightmare and took
> me more than an hour per node. Obviously is not a production ready way
> to do so, but at least Ceph Octopus is running in 20.04 now now.
>
> There are two asks here:
>
> Separate the dependencies so that ceph-mon, ceph-mgr and ceph-osd can be
> installed separately (with the appropriate dependencies, but in a way
> that upgrading ceph-mon does not try to upgrade ceph-osd also. There is
> no good reason why upgrade of ceph-mon should go down and back up the
> dependency tree and try to upgrade ceph-osd too. In fact, I would not
> want monitor packages on my OSD nodes and vice versa in a traditional
> cluster.
>

The versioning between the various binary packages that the ceph source
code produces are strongly versioned so that you can't end up with an
inappropriate/broken mix of binaries on disk at the same time.

Upgrading the ceph-mon package results in an upgrade of the ceph-osd
package because they both depend on ceph-base with a strong version
dependency of a matching binary version.

This is how we enforce a known good set of bits on disks - and is why the
package maintainer scripts don't do restarts of the daemons on upgrade so
that the restart process can be managed with appropriate upgrade ordering.


> And fix do-release-upgrades, so a Ceph cluster does not get restarted
> when the upgrade procedure ends. I can vouch for the services being
> restarted, i tried it several times, once even with the services shut
> down before do-release-upgrade was started.
>

If you shutdown services the postinst script starts
'ceph{-mon,osd,mgr}.target' so they would get started back up, but targets
and services won't get restarted - I tested and validated and checked the
installed maintainer scripts.

I think you'd have to disable and mask the targets *and* services to ensure
that the target start did not force daemons to start as well but I did not
observe any restart behaviour during my upgrade testing (other than due to
the reboot of the system).


>
> An upgrade procedure that breaks customer data should be fixed.
>

Agreed but the first step is reproduction of the issue so that we can
actually identify what the problem is.

I've followed what I think is the same process that you undertook but I've
not seen the same issue when running mixed version MON, MGR and OSD.

So there is something specific in your deployment that we've not captured
in this bug report yet.

Full details of a) /etc/ceph/ceph.conf and b) pool types and configurations
in use would be helpful.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-05-22 Thread Christian Huebner
i filed this bug specifically for hyperconverged environments. Upgrading
monitor nodes first and then upgrading separate OSD nodes is probably
doable, but in a hyperconverged environment you can not separate.

I tried do-release-upgrade (a couple of times) without rebooting at the end, 
but found the monitors and OSDs were upgraded and deadlocked at the end.
I tried shutting down all Ceph services first and then do-release upgrade. 
Which started my Ceph services and destroyed my cluster.
I tried manually upgrading Ceph, which is thwarted by the dependencies, it's 
all or nothing.

I finally accomplished the upgrade by marking all Ceph packages held,
then digging myself through the dependency jungle to upgrade the
packages in the right sequence. This was an absolute nightmare and took
me more than an hour per node. Obviously is not a production ready way
to do so, but at least Ceph Octopus is running in 20.04 now now.

'
There are two asks here: 

Separate the dependencies so that ceph-mon, ceph-mgr and ceph-osd can be
installed separately (with the appropriate dependencies, but in a way
that upgrading ceph-mon does not try to upgrade ceph-osd also. There is
no good reason why upgrade of ceph-mon should go down and back up the
dependency tree and try to upgrade ceph-osd too. In fact, I would not
want monitor packages on my OSD nodes and vice versa in a traditional
cluster.

And fix do-release-upgrades, so a Ceph cluster does not get restarted
when the upgrade procedure ends. I can vouch for the services being
restarted, i tried it several times, once even with the services shut
down before do-release-upgrade was started.

An upgrade procedure that breaks customer data should be fixed.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-05-22 Thread James Page
Other ideas - please could impacted users validate networking esp MTU
configuration between machines in their cluster before, during and post
upgrade.

Ceph can be very sensitive to MTU mismatches and just hang when stuff is
not quite right.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-05-22 Thread James Page
Marking 'Incomplete' for now as unable to reproduce.

** Changed in: ceph (Ubuntu)
   Status: In Progress => Incomplete

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-05-22 Thread James Page
Testing phase 2 - three machine all-in-one deploy.

Deployed using eoan - mon,mgr and 1 x osd on each machine

Deployment seeded with pools a lightweight test data - rbd's in each
pool.

Each machine upgraded in turn (1,2 and then 0) using do-release-upgrade.

ceph versions checked throughout deployment - mixed versions observered.

OSD's booted OK after machine reboots post do-release-upgrade.

During upgrade process:

$ sudo ceph mon dump | grep min_mon_release
dumped monmap epoch 1
min_mon_release 14 (nautilus)

$ sudo ceph versions
{
"mon": {
"ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) 
nautilus (stable)": 1,
"ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus 
(stable)": 2
},
"mgr": {
"ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) 
nautilus (stable)": 1,
"ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus 
(stable)": 2
},
"osd": {
"ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) 
nautilus (stable)": 1,
"ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus 
(stable)": 2
},
"mds": {},
"overall": {
"ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) 
nautilus (stable)": 3,
"ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus 
(stable)": 6
}
}

Post upgrade of last machine:

$ sudo ceph mon dump | grep min_mon_release
dumped monmap epoch 2
min_mon_release 15 (octopus)

$ sudo ceph versions
{
"mon": {
"ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus 
(stable)": 3
},
"mgr": {
"ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus 
(stable)": 3
},
"osd": {
"ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus 
(stable)": 3
},
"mds": {},
"overall": {
"ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus 
(stable)": 9
}
}

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-05-21 Thread James Page
As a side note - even if there is a bug here (and it sounds like there
might be) I would recommend placing the mon and mgr daemons in LXD
containers ontop of the machines hosting the osd's - this will allow you
to manage them independently from an upgrade process for both ceph
upgrades and ubuntu release upgrades.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-05-21 Thread James Page
OK further fact discovery from my testing.

I have a 6 machine cluster deployed - three machines host mon,mgr and
three machines host osd.

Upgrading the mon,mgr cluster first followed by the three osd machine
using do-release-upgrade and allowing the tool to reboot the machine at
the end resulted in an upgraded and functioning cluster.

I also validated that the process of upgrading the packages does not
stop or restart the daemons - so they will run on the 14.2.x series from
eoan until either they are restarted OR the do-release-upgrade tool is
permitted to reboot the box.

I appreciate that the reporters of this bug are deploying all daemons on
all three machines which is different to what I have tested - I'll look
at that next.

However it should be possible to complete the do-release-upgrade to the
point of requesting a reboot - don't - drop to the CLI and get all
machines to this point and then:

  restart the mons across all three machines
  restart the mgrs across all three machines
  restart the osds across all three machines

validating health between each step.  I'm going to test this now.

This is inline with the upstream documented process for upgrading a ceph
cluster:

 https://docs.ceph.com/docs/master/releases/octopus/#upgrading-from-
mimic-or-nautilus

After this has been completed a reboot of each machine will be required
to complete the release upgrade.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-05-21 Thread Jay Ring
You may need more than one node to reproduce the problem.

I had a 3 node system.

I ran do-release-upgrade on node 1.

The OSDs on node 1 connected to the monitor quorum, which had un-
upgraded monitors on hosts 2 & 3.

The upgraded OSDs on node 1 immediately died and could not be revived.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-05-21 Thread James Page
ceph-mon eoan->focal upgrade testing

ceph-mon@`hostname` systemd units not restarted until reboot step of the
upgrade process on each node; mixed version cluster operated as expected
as each mon was upgraded.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-05-21 Thread James Page
working on reproduction for debug and triage.

** Changed in: ceph (Ubuntu)
   Status: Confirmed => In Progress

** Changed in: ceph (Ubuntu)
 Assignee: (unassigned) => James Page (james-page)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-05-18 Thread Jay Ring
Just writing in to confirm this bug.

It's very serious.

Lost a whole node.  No real warning.  Extremely frustrating.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-05-18 Thread Launchpad Bug Tracker
Status changed to 'Confirmed' because the bug affects multiple users.

** Changed in: ceph (Ubuntu)
   Status: New => Confirmed

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-05-05 Thread Christian Huebner
I accomplished the upgrade by marking all Ceph packages held, then
digging myself through the dependency jungle to upgrade the packages
subsequently. This obviously is not a production ready way to do so, but
at least Ceph Octopus is running in 20.04 now now.

This really needs to be fixed.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-04-30 Thread Christian Huebner
One note on importance: If someone runs do-release-upgrade on a
converged Ceph node, it will destroy the node. So far I have not seen
any recovery procedure. The only reason I was able to rapidly redo the
upgrade is because it runs on snapshots and thus can be recovered after
destruction. This is not an expectation that can be made for even
smaller-scale clusters which are going to be upgraded earliest.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-04-30 Thread Christian Huebner
I tried to do the upgrade by hand (disable all the services that can not
be autostarted, do the upgrade (btw, a manpage has been moved from ceph-
deploy to ceph-base and thus the apt upgrade fails. do-release-upgrade
is using --force-overwrite for this, but that's not a clean solution).
Solution is to first uninstall ceph-deploy and then do the upgrade, but
this should be fixed.

I restarted all services manually in the correct order. mon and mgr work
fine, the OSDs do not.

The result is mostly the same. This time at least all OSDs came up, but
like before they hang in peering. I'll continue research on this. The
OSDs still log that they are waiting for a new monmap.

Although started from the 15.2.1 binary they show up in Ceph report as
14.2.8, probably because they have not been converted yet (which should
automatically happen when the OSDs connect to the monitors for the first
time). Next step is tracing the OSDs to see where they hang, but
probably still some futex deadlock.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-04-30 Thread Christian Huebner
I just shut down Ceph on all four nodes completely, then did the do-
release-upgrade. Before the upgrade I verified that all Ceph services
were down so I would be able to start them in the correct order.

After the upgrade (without reboot!) I found that all Ceph services on
all Ceph nodes had been started and thus the upgrade of Ceph again
failed.

There needs to be either a warning that do-release-upgrade cen not be
used for Ceph upgrades, or do-release-upgrade needs to be fixed so Ceph
services are not restarted.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-04-29 Thread Christian Huebner
I redid the whole upgrade:
* do-release-upgrade and finished without reboot (all 4 nodes)
** so ceph daemons should not have been restarted
* restarted all ceph mons sequentially
** verified I get octopus as min mon release
* restarted all ceph-mgrs sequentially
** verified that all ceph-mgr daemons are running
* restarted all OSDs
** OSDs show 
"2020-04-29T16:25:52.132-0700 7f43d2788700  1 osd.4 16945 tick checking mon for 
new map"
* All the logs are full of failed futex requests (connection timed out / 
unfinished'

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-04-27 Thread Dan Hill
The same guidelines apply to hyper-converged architectures.

Package updates are not applied until their corresponding service
restarts. Ceph packaging does not automatically restart any services.
This is by design so you can safely install on a hyper-converged host,
and then control the order in which service updates are applied.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-04-27 Thread Christian Huebner
This would work If all nodes have a single function only (mon, mgr, old). I
tried everything to update the monitors first, but due to the dependencies
between the Ceph packages the monitors and mgr daemons can not simply be
updated separately from the OSDs What I don't get, though, is that once all
three monitors and mgrs are updated the OSDs do not fall back in line after
a reboot.
I will try to force the install of ceph-base, ceph-common and mon/mgr and
then force upgrade the OSDs  to test whether that will work. If not at
least a workflow should be considered that allows upgrade of hyper
converged clusters, which are becoming more and more important for edge
sites.

On Fri, Apr 24, 2020 at 5:50 PM Dan Hill <1874...@bugs.launchpad.net>
wrote:

> Eoan packages Nautilus, while Focal packages Octopus:
>  ceph | 14.2.2-0ubuntu3  | eoan
>  ceph | 14.2.4-0ubuntu0.19.10.2  | eoan-security
>  ceph | 14.2.8-0ubuntu0.19.10.1  | eoan-updates
>  ceph | 15.2.1-0ubuntu1  | focal
>  ceph | 15.2.1-0ubuntu2  | focal-proposed
>
> When upgrading your cluster, make sure to follow the Octopus upgrade
> guidelines [0]. Specifically, the Mon and Mgr nodes must be upgraded and
> their services restarted before upgrading OSD nodes.
>
> [0] https://docs.ceph.com/docs/master/releases/octopus/#upgrading-from-
> mimic-or-nautilus
> 
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1874939
>
> Title:
>   ceph-osd can't connect after upgrade to focal
>
> Status in ceph package in Ubuntu:
>   New
>
> Bug description:
>   Upon upgrading a Ceph node with do-release-upgrade from eoan to focal,
>   the OSD doesn't connect after the upgrade. I rolled back the change
>   (VBox snapshot) and tried again, same result. I also tried to hold
>   back the Ceph packages and upgrade after the fact, but again same
>   result.
>
>   Epected behavior: OSD connects to cluster after upgrade.
>
>   Actual behavior: OSD log shows endlessly repeated
>   'tick_without_osd_lock' messages. OSD will stay down from perspective
>   of the cluster.
>
>   Extract from debug log of OSD:
>
>   2020-04-24T16:25:35.811-0700 7fd70e83d700  5 osd.0 16499 heartbeat
> osd_stat(store_statfs(0x4499/0x4000/0x24000, data
> 0x14bb97877/0x1bb66, compress 0x0/0x0/0x0, omap 0x2bbf, meta
> 0x3fffd441), peers [] op hist [])
>   2020-04-24T16:25:35.811-0700 7fd70e83d700 20 osd.0 16499
> check_full_status cur ratio 0.769796, physical ratio 0.769796, new state
> none
>   2020-04-24T16:25:36.043-0700 7fd7272ea700 10 osd.0 16499 tick
>   2020-04-24T16:25:36.043-0700 7fd7272ea700 10 osd.0 16499 do_waiters --
> start
>   2020-04-24T16:25:36.043-0700 7fd7272ea700 10 osd.0 16499 do_waiters --
> finish
>   2020-04-24T16:25:36.043-0700 7fd7272ea700 20 osd.0 16499 tick
> last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next
> 2020-04-25T15:54:43.601161-0700
>   2020-04-24T16:25:36.631-0700 7fd72606c700 10 osd.0 16499
> tick_without_osd_lock
>   2020-04-24T16:25:37.055-0700 7fd7272ea700 10 osd.0 16499 tick
>   2020-04-24T16:25:37.055-0700 7fd7272ea700 10 osd.0 16499 do_waiters --
> start
>   2020-04-24T16:25:37.055-0700 7fd7272ea700 10 osd.0 16499 do_waiters --
> finish
>   2020-04-24T16:25:37.055-0700 7fd7272ea700 20 osd.0 16499 tick
> last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next
> 2020-04-25T15:54:43.601161-0700
>   2020-04-24T16:25:37.595-0700 7fd72606c700 10 osd.0 16499
> tick_without_osd_lock
>   2020-04-24T16:25:38.071-0700 7fd7272ea700 10 osd.0 16499 tick
>   2020-04-24T16:25:38.071-0700 7fd7272ea700 10 osd.0 16499 do_waiters --
> start
>   2020-04-24T16:25:38.071-0700 7fd7272ea700 10 osd.0 16499 do_waiters --
> finish
>   2020-04-24T16:25:38.071-0700 7fd7272ea700 20 osd.0 16499 tick
> last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next
> 2020-04-25T15:54:43.601161-0700
>   2020-04-24T16:25:38.243-0700 7fd71cc0d700 20 osd.0 16499 reports for 0
> queries
>   2020-04-24T16:25:38.583-0700 7fd72606c700 10 osd.0 16499
> tick_without_osd_lock
>   2020-04-24T16:25:39.103-0700 7fd7272ea700 10 osd.0 16499 tick
>   2020-04-24T16:25:39.103-0700 7fd7272ea700 10 osd.0 16499 do_waiters --
> start
>   2020-04-24T16:25:39.103-0700 7fd7272ea700 10 osd.0 16499 do_waiters --
> finish
>   2020-04-24T16:25:39.103-0700 7fd7272ea700 20 osd.0 16499 tick
> last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next
> 2020-04-25T15:54:43.601161-0700
>
>   This repeats over and over again.
>
>   strace of the process yields lots of unfinished futex access attempts:
>
>   [pid  2130] futex(0x55b17b8e216c,
> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054,
> tv_nsec=937726129}, FUTEX_BITSET_MATCH_ANY 
>   [pid  2100] write(12, "2020-04-24T16:47:33.915-0700 7fd"..., 79) = 79
>   [pid  2100] futex(0x55b17b7108e4, FUTEX_WAIT_PRIVATE, 0, NULL
> 
>   [pid  2190] <... 

[Bug 1874939] Re: ceph-osd can't connect after upgrade to focal

2020-04-24 Thread Dan Hill
Eoan packages Nautilus, while Focal packages Octopus:
 ceph | 14.2.2-0ubuntu3  | eoan
 ceph | 14.2.4-0ubuntu0.19.10.2  | eoan-security   
 ceph | 14.2.8-0ubuntu0.19.10.1  | eoan-updates
 ceph | 15.2.1-0ubuntu1  | focal   
 ceph | 15.2.1-0ubuntu2  | focal-proposed  

When upgrading your cluster, make sure to follow the Octopus upgrade
guidelines [0]. Specifically, the Mon and Mgr nodes must be upgraded and
their services restarted before upgrading OSD nodes.

[0] https://docs.ceph.com/docs/master/releases/octopus/#upgrading-from-
mimic-or-nautilus

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874939

Title:
  ceph-osd can't connect after upgrade to focal

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs