Dear all,

On one of our clusters I started the upgrade process from 16.2.7 to 16.2.11.
Mon and mgr and crash processes were done easily/quickly, then at the first 
attempt of upgrading a OSD container the upgrade process stopped because of the 
OSD process is not able to start after the upgrade.

Does anyone have any hint on how to unblock the upgrade?
Some details below:
Regards,

Giuseppe

I started the upgrade process with the cephadm command:

“””
[root@naret-monitor01 ~]# ceph orch upgrade start --ceph-version 16.2.11
Initiating upgrade to quay.io/ceph/ceph:v16.2.11
“””

After a short time:

“””
[root@naret-monitor01 ~]# ceph orch upgrade status
{
    "target_image": 
quay.io/ceph/ceph@sha256:1b9803c8984bef8b82f05e233e8fe8ed8f0bba8e5cc2c57f6efaccbeea682add<mailto:quay.io/ceph/ceph@sha256:1b9803c8984bef8b82f05e233e8fe8ed8f0bba8e5cc2c57f6efaccbeea682add>,
    "in_progress": true,
    "which": "Upgrading all daemon types on all hosts",
    "services_complete": [
        "crash",
        "mon",
        "mgr"
    ],
    "progress": "64/2039 daemons upgraded",
    "message": "Error: UPGRADE_REDEPLOY_DAEMON: Upgrading daemon osd.4 on host 
naret-osd01 failed.",
    "is_paused": true
}
“””

The ceph health command reports:

“””
[root@naret-monitor01 ~]# ceph health  detail
HEALTH_WARN 1 failed cephadm daemon(s); 1 osds down; Degraded data redundancy: 
2654362/6721382840 objects degraded (0.039%), 14 pgs degraded, 14 pgs 
undersized; Upgrading daemon osd.4 on host naret-osd01 failed.
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
    daemon osd.22 on naret-osd01 is in error state
[WRN] OSD_DOWN: 1 osds down
    osd.4 (root=default,host=naret-osd01) is down
[WRN] PG_DEGRADED: Degraded data redundancy: 2654362/6721382840 objects 
degraded (0.039%), 14 pgs degraded, 14 pgs undersized
    pg 28.88 is stuck undersized for 6m, current state 
active+undersized+degraded, last acting [1373,1337,1508,852,2147483647,483]
    pg 28.528 is stuck undersized for 6m, current state 
active+undersized+degraded, last acting [1063,793,2147483647,931,338,1777]
    pg 28.594 is stuck undersized for 6m, current state 
active+undersized+degraded, last acting [1208,891,1651,364,2147483647,53]
    pg 28.8b4 is stuck undersized for 6m, current state 
active+undersized+degraded, last acting [521,1273,1238,138,1539,2147483647]
    pg 28.a90 is stuck undersized for 6m, current state 
active+undersized+degraded, last acting [237,1665,1836,2147483647,192,1410]
    pg 28.ad6 is stuck undersized for 6m, current state 
active+undersized+degraded, last acting [870,466,350,885,1601,2147483647]
    pg 28.b34 is stuck undersized for 6m, current state 
active+undersized+degraded, last acting [920,1596,2147483647,115,201,941]
    pg 28.c14 is stuck undersized for 6m, current state 
active+undersized+degraded, last acting [1389,424,2147483647,268,1646,632]
    pg 28.dba is stuck undersized for 6m, current state 
active+undersized+degraded, last acting [1099,561,2147483647,1806,1874,1145]
    pg 28.ee2 is stuck undersized for 6m, current state 
active+undersized+degraded, last acting [1621,1904,1044,2147483647,1545,722]
    pg 29.163 is stuck undersized for 6m, current state 
active+undersized+degraded, last acting [1883,2147483647,1509,1697,1187,235]
    pg 29.1c1 is stuck undersized for 6m, current state 
active+undersized+degraded, last acting [122,1226,962,1254,1215,2147483647]
    pg 29.254 is stuck undersized for 6m, current state 
active+undersized+degraded, last acting [1782,1839,1545,412,196,2147483647]
    pg 29.2a1 is stuck undersized for 6m, current state 
active+undersized+degraded, last acting [370,2147483647,575,1423,1755,446]
[WRN] UPGRADE_REDEPLOY_DAEMON: Upgrading daemon osd.4 on host naret-osd01 
failed.
    Upgrade daemon: osd.4: cephadm exited with an error code: 1, 
stderr:Redeploy daemon osd.4 ...
Non-zero exit code 1 from systemctl start 
ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4>
systemctl: stderr Job for 
ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service>
 failed because a timeout was exceeded.
systemctl: stderr See "systemctl status 
ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service>"
 and "journalctl -xe" for details.
Traceback (most recent call last):
  File 
"/var/lib/ceph/63334166-d991-11eb-99de-40a6b72108d0/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2",
 line 9248, in <module>
    main()
  File 
"/var/lib/ceph/63334166-d991-11eb-99de-40a6b72108d0/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2",
 line 9236, in main
    r = ctx.func(ctx)
  File 
"/var/lib/ceph/63334166-d991-11eb-99de-40a6b72108d0/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2",
 line 1990, in _default_image
    return func(ctx)
  File 
"/var/lib/ceph/63334166-d991-11eb-99de-40a6b72108d0/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2",
 line 5041, in command_deploy
    ports=daemon_ports)
  File 
"/var/lib/ceph/63334166-d991-11eb-99de-40a6b72108d0/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2",
 line 2952, in deploy_daemon
    c, osd_fsid=osd_fsid, ports=ports)
  File 
"/var/lib/ceph/63334166-d991-11eb-99de-40a6b72108d0/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2",
 line 3197, in deploy_daemon_units
    call_throws(ctx, ['systemctl', 'start', unit_name])
  File 
"/var/lib/ceph/63334166-d991-11eb-99de-40a6b72108d0/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2",
 line 1657, in call_throws
    raise RuntimeError(f'Failed command: {" ".join(command)}: {s}')
RuntimeError: Failed command: systemctl start 
ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4>:
 Job for 
ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service>
 failed because a timeout was exceeded.
See "systemctl status 
ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service>"
 and "journalctl -xe" for details.
“””

On the OSD server we have:

“””
[root@naret-osd01 ~]# uname -a
Linux naret-osd01 4.18.0-425.10.1.el8_7.x86_64 #1 SMP Wed Dec 14 16:00:01 EST 
2022 x86_64 x86_64 x86_64 GNU/Linux
[root@naret-osd01 ~]# podman -v
podman version 4.2.0
[root@naret-osd01 ~]# ceph -v
ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)
[root@naret-osd01 ~]# cat /etc/os-release
NAME="Red Hat Enterprise Linux"
VERSION="8.7 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.7"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.7 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL=https://www.redhat.com/
DOCUMENTATION_URL=https://access.redhat.com/documentation/red_hat_enterprise_linux/8/
BUG_REPORT_URL=https://bugzilla.redhat.com/

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.7
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.7"
“””

Systemctl says:

“””
systemctl status 
ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service>
…
● 
ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service>
 - Ceph osd.4 for 63334166-d991-11eb-99de-40a6b72108d0
   Loaded: loaded 
(/etc/systemd/system/ceph-63334166-d991-11eb-99de-40a6b72108d0@.service<mailto:/etc/systemd/system/ceph-63334166-d991-11eb-99de-40a6b72108d0@.service>;
 enabled; vendor preset: disabled)
   Active: failed (Result: timeout) since Mon 2023-03-27 15:34:29 CEST; 6min ago
  Process: 730621 ExecStopPost=/bin/rm -f 
/run/ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service-pid<mailto:/run/ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service-pid>
 
/run/ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service-cid<mailto:/run/ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service-cid>
 (code=exited, status=0/SUCCESS)
  Process: 730209 ExecStopPost=/bin/bash 
/var/lib/ceph/63334166-d991-11eb-99de-40a6b72108d0/osd.4/unit.poststop 
(code=exited, status=0/SUCCESS)
  Process: 710355 ExecStartPre=/bin/rm -f 
/run/ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service-pid<mailto:/run/ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service-pid>
 
/run/ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service-cid<mailto:/run/ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service-cid>
 (code=exited, status=0/SUCCESS)
Main PID: 23025 (code=exited, status=0/SUCCESS)
    Tasks: 62 (limit: 1647878)
   Memory: 961.8M
   CGroup: 
/system.slice/system-ceph\x2d63334166\x2dd991\x2d11eb\x2d99de\x2d40a6b72108d0.slice/ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service
           
├─libpod-payload-b4f0ebebdfec38942b614756b6329b04d2939db29a0a9823e314b848680bc58e
           │ └─754976 /usr/bin/ceph-osd -n osd.4 -f --setuser ceph --setgroup 
ceph --default-log-to-file=false --default-log-to-stderr=true 
--default-log-stderr-prefix=debug
           └─runtime
             └─754965 /usr/bin/conmon --api-version 1 -c 
b4f0ebebdfec38942b614756b6329b04d2939db29a0a9823e314b848680bc58e -u 
b4f0ebebdfec38942b614756b6329b04d2939db29a0a9823e314b848680bc58e -r 
/usr/bin/runc -b 
/var/lib/containers/storage/overlay-containers/b4f0ebebdfec38942b614756b6329b04d2939db29a0a9823e314b848680bc58e/userdata
 -p 
/run/containers/storage/overlay-containers/b4f0ebebdfec38942b614756b6329b04d2939db29a0a9823e314b848680bc58e/userdata/pidfile
 -n ceph-63334166-d991-11eb-99de-40a6b72108d0-osd-4 --exit-dir 
/run/libpod/exits --full-attach -l journald --log-level warning --runtime-arg 
--log-format=json --runtime-arg --log 
--runtime-arg=/run/containers/storage/overlay-containers/b4f0ebebdfec38942b614756b6329b04d2939db29a0a9823e314b848680bc58e/userdata/oci-log
 --conmon-pidfile 
/run/ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service-pid<mailto:/run/ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service-pid>
 --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg 
/var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg 
/run/containers/storage --exit-command-arg --log-level --exit-command-arg 
warning --exit-command-arg --cgroup-manager --exit-command-arg systemd 
--exit-command-arg --tmpdir --exit-command-arg /run/libpod --exit-command-arg 
--network-config-dir --exit-command-arg --exit-command-arg --network-backend 
--exit-command-arg cni --exit-command-arg --volumepath --exit-command-arg 
/var/lib/containers/storage/volumes --exit-command-arg --runtime 
--exit-command-arg runc --exit-command-arg --storage-driver --exit-command-arg 
overlay --exit-command-arg --storage-opt --exit-command-arg 
overlay.mountopt=nodev,metacopy=on --exit-command-arg --events-backend 
--exit-command-arg file --exit-command-arg container --exit-command-arg cleanup 
--exit-command-arg --rm --exit-command-arg 
b4f0ebebdfec38942b614756b6329b04d2939db29a0a9823e314b848680bc58e

Mar 27 15:36:56 naret-osd01 
ceph-63334166-d991-11eb-99de-40a6b72108d0-osd-4[754965]: debug 
2023-03-27T13:36:56.886+0000 7f52e6ae1700  1 osd.4 pg_epoch: 821628 
pg[28.dbas2( v 821618'4657799 (819107'4647770,821618'4657799] 
local-lis/les=749842/749843 n=239290 ec=130297/130290 lis/c=821623/749842 
les/c/f=821624/749843/0 sis=821628 pruub=7.751406670s) 
[1099,561,4,1806,1874,1145]p1099(0) r=2 lpr=821628 pi=[749842,821628)/1 
crt=821618'4657799 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 12.039081573s@ 
mbc={} ps=[4~6]] state<Start>: transitioning to Stray
Mar 27 15:36:56 naret-osd01 
ceph-63334166-d991-11eb-99de-40a6b72108d0-osd-4[754965]: debug 
2023-03-27T13:36:56.886+0000 7f52e8ae5700  1 osd.4 pg_epoch: 821628 
pg[29.163s1( v 821572'139334 (776804'129273,821572'139334] 
local-lis/les=749851/749852 n=65683 ec=130801/130801 lis/c=821623/749851 
les/c/f=821624/749852/0 sis=821628 pruub=8.023463249s) 
[1883,4,1509,1697,1187,235]p1883(0) r=1 lpr=821628 pi=[749851,821628)/1 
crt=821572'139334 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 12.311203003s@ 
mbc={}] start_peering_interval up [1883,4,1509,1697,1187,235] -> 
[1883,4,1509,1697,1187,235], acting [1883,2147483647,1509,1697,1187,235] -> 
[1883,4,1509,1697,1187,235], acting_primary 1883(0) -> 1883, up_primary 1883(0) 
-> 1883, role -1 -> 1, features acting 4540138297136906239 upacting 
4540138297136906239
Mar 27 15:36:56 naret-osd01 
ceph-63334166-d991-11eb-99de-40a6b72108d0-osd-4[754965]: debug 
2023-03-27T13:36:56.886+0000 7f52e72e2700  1 osd.4 pg_epoch: 821628 
pg[29.2a1s1( v 821500'140649 (776804'130601,821500'140649] 
local-lis/les=749849/749850 n=65848 ec=130801/130801 lis/c=821623/749849 
les/c/f=821624/749850/0 sis=821628 pruub=7.845988274s) 
[370,4,575,1423,1755,446]p370(0) r=1 lpr=821628 pi=[749849,821628)/1 
crt=821500'140649 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 12.133728981s@ 
mbc={}] start_peering_interval up [370,4,575,1423,1755,446] -> 
[370,4,575,1423,1755,446], acting [370,2147483647,575,1423,1755,446] -> 
[370,4,575,1423,1755,446], acting_primary 370(0) -> 370, up_primary 370(0) -> 
370, role -1 -> 1, features acting 4540138297136906239 upacting 
4540138297136906239
Mar 27 15:36:56 naret-osd01 
ceph-63334166-d991-11eb-99de-40a6b72108d0-osd-4[754965]: debug 
2023-03-27T13:36:56.887+0000 7f52e8ae5700  1 osd.4 pg_epoch: 821628 
pg[29.163s1( v 821572'139334 (776804'129273,821572'139334] 
local-lis/les=749851/749852 n=65683 ec=130801/130801 lis/c=821623/749851 
les/c/f=821624/749852/0 sis=821628 pruub=8.023443222s) 
[1883,4,1509,1697,1187,235]p1883(0) r=1 lpr=821628 pi=[749851,821628)/1 
crt=821572'139334 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 12.311203003s@ 
mbc={}] state<Start>: transitioning to Stray
Mar 27 15:36:56 naret-osd01 
ceph-63334166-d991-11eb-99de-40a6b72108d0-osd-4[754965]: debug 
2023-03-27T13:36:56.887+0000 7f52e72e2700  1 osd.4 pg_epoch: 821628 
pg[29.2a1s1( v 821500'140649 (776804'130601,821500'140649] 
local-lis/les=749849/749850 n=65848 ec=130801/130801 lis/c=821623/749849 
les/c/f=821624/749850/0 sis=821628 pruub=7.845966339s) 
[370,4,575,1423,1755,446]p370(0) r=1 lpr=821628 pi=[749849,821628)/1 
crt=821500'140649 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 12.133728981s@ 
mbc={}] state<Start>: transitioning to Stray
Mar 27 15:36:56 naret-osd01 
ceph-63334166-d991-11eb-99de-40a6b72108d0-osd-4[754965]: debug 
2023-03-27T13:36:56.887+0000 7f52e72e2700  1 osd.4 pg_epoch: 821628 
pg[28.8b4s5( v 821618'2906095 (817032'2896088,821618'2906095] 
local-lis/les=749842/749843 n=239377 ec=130295/130290 lis/c=821623/749842 
les/c/f=821624/749843/0 sis=821628 pruub=8.158309937s) 
[521,1273,1238,138,1539,4]p521(0) r=5 lpr=821628 pi=[749842,821628)/1 
crt=821618'2906095 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 12.446221352s@ 
mbc={} ps=[4~6]] start_peering_interval up [521,1273,1238,138,1539,4] -> 
[521,1273,1238,138,1539,4], acting [521,1273,1238,138,1539,2147483647] -> 
[521,1273,1238,138,1539,4], acting_primary 521(0) -> 521, up_primary 521(0) -> 
521, role -1 -> 5, features acting 4540138297136906239 upacting 
4540138297136906239
Mar 27 15:36:56 naret-osd01 
ceph-63334166-d991-11eb-99de-40a6b72108d0-osd-4[754965]: debug 
2023-03-27T13:36:56.887+0000 7f52e72e2700  1 osd.4 pg_epoch: 821628 
pg[28.8b4s5( v 821618'2906095 (817032'2896088,821618'2906095] 
local-lis/les=749842/749843 n=239377 ec=130295/130290 lis/c=821623/749842 
les/c/f=821624/749843/0 sis=821628 pruub=8.158291817s) 
[521,1273,1238,138,1539,4]p521(0) r=5 lpr=821628 pi=[749842,821628)/1 
crt=821618'2906095 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 12.446221352s@ 
mbc={} ps=[4~6]] state<Start>: transitioning to Stray
Mar 27 15:39:36 naret-osd01 systemd[1]: 
ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service>:
 Start request repeated too quickly.
Mar 27 15:39:36 naret-osd01 systemd[1]: 
ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service>:
 Failed with result 'timeout'.
Mar 27 15:39:36 naret-osd01 systemd[1]: Failed to start Ceph osd.4 for 
63334166-d991-11eb-99de-40a6b72108d0.
“””


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to