Re: [ceph-users] Disk activation issue on 10.2.9, too (Re: v11.2.0 Disk activation issue while booting)
On 21-7-2017 12:45, Fulvio Galeazzi wrote: > Hallo David, all, > sorry for hi-jacking the thread but I am seeing the same issue, > although on 10.2.7/10.2.9... Then this is a problem that had nothing to do with my changes to ceph-disk, since they only went into HEAD and thus end up in Luminous. Which is fortunate, since I know nothing about systemd and all its magic. Not to say anything about the previous reported problem. BUt that also went away when ceph-disk was used differently. --WjW > > > Note that I am using disks taken from a SAN, so the GUIDs in my case are > those relevant to MPATH. > As per other messages in this thread, I modified: > - /usr/lib/systemd/system/ceph-osd.target >adding to [Unit] stanza: > Before=ceph.target > - /usr/lib/udev/rules.d/60-ceph-by-parttypeuuid.rules >added at the end of this line: > ENV{ID_PART_ENTRY_SCHEME}=="gpt", ENV{ID_PART_ENTRY_TYPE}=="?*", > ENV{ID_PART_ENTRY_UUID}=="?*", > SYMLINK+="disk/by-parttypeuuid/$env{ID_PART_ENTRY_TYPE}.$env{ID_PART_ENTRY_UUID}" > >the string: > , SYMLINK+="disk/by-partuuid/$env{ID_PART_ENTRY_UUID}" > > > > df shows (picked a problematic partition and one which mounted OK) > . > /dev/mapper/3600a0980005de737095a56c510cd1 3878873588 142004 > 3878731584 1% /var/lib/ceph/osd/cephba1-27 > /dev/mapper/3600a0980005ddf751e2558e2bac7p1 7779931116 202720 > 7779728396 1% /var/lib/ceph/tmp/mnt.XL7WkY > > Yet, for both the GUIDs seem correct: > > === /dev/mapper/3600a0980005de737095a56c510cd > Partition GUID code: 4FBD7E29-8AE0-4982-BF9D-5A8D867AF560 (Unknown) > Partition unique GUID: B01E2E0D-9903-4F23-A5FD-FC1C1CB458C3 > Partition size: 7761536991 sectors (3.6 TiB) > Partition name: 'ceph data' > Partition GUID code: 45B0969E-8AE0-4982-BF9D-5A8D867AF560 (Unknown) > Partition unique GUID: E1B3970A-FABF-4AC0-8B6A-F7526989FF36 > Partition size: 4096 sectors (19.5 GiB) > Partition name: 'ceph journal' > > === /dev/mapper/3600a0980005ddf751e2558e2bac7 > Partition GUID code: 4FBD7E29-8AE0-4982-BF9D-5A8D867AF560 (Unknown) > Partition unique GUID: 93A91EBF-A531-4002-A49F-B24F27E962DD > Partition size: 15564036063 sectors (7.2 TiB) > Partition name: 'ceph data' > Partition GUID code: 45B0969E-8AE0-4982-BF9D-5A8D867AF560 (Unknown) > Partition unique GUID: 2AF9B162-3398-49BD-B6EF-5D284C4A930B > Partition size: 4096 sectors (19.5 GiB) > Partition name: 'ceph journal' > > I rather suspect some sort of race condition, possibly causing hitting > some timeout within systemctl... (please read the end of this message). > I am led to think this because the OSDs which are successfully mounted > after each reboot are a "random" subset of the configured ones (total > ~40): also, after two or three mounts /var/lib/ceph/mnt... ceph-osd > apparently gives up. > > > The only workaround I found to get things going is re-running > ceph-ansible, but it takes s long... > > Have you any idea as to what is going on here? Has anybody seen (and > solved) the same issue? > > Thanks! > > Fulvio > > > > > > [root@r3srv07.ba1 ~]# cat /var/lib/ceph/tmp/mnt.XL7WkY/whoami > 143 > [root@r3srv07.ba1 ~]# umount /var/lib/ceph/tmp/mnt.XL7WkY > [root@r3srv07.ba1 ~]# systemctl status ceph-osd@143.service > ● ceph-osd@143.service - Ceph object storage daemon >Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; > vendor preset: disabled) >Active: failed (Result: start-limit) since Fri 2017-07-21 11:02:23 > CEST; 1h 35min ago > Process: 40466 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} > --id %i --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE) > Process: 40217 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh > --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS) > Main PID: 40466 (code=exited, status=1/FAILURE) > > > Jul 21 11:02:03 r3srv07.ba1.box.garr systemd[1]: ceph-osd@143.service: > main process exited, code=exited, status=1/FAILURE > Jul 21 11:02:03 r3srv07.ba1.box.garr systemd[1]: Unit > ceph-osd@143.service entered failed state. > Jul 21 11:02:03 r3srv07.ba1.box.garr systemd[1]: ceph-osd@143.service > failed. > Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: ceph-osd@143.service > holdoff time over, scheduling restart. > Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: start request repeated > too quickly for ceph-osd@143.service > Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: Failed to start Ceph > object storage daemon. > Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: Unit > ceph-osd@143.service entered failed state. > Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: ceph-osd@143.service > failed. > [root@r3srv07.ba1 ~]# systemctl restart ceph-osd@143.service > [root@r3srv07.ba1 ~]# systemctl status ceph-osd@143.service > ● ceph-osd@143.service - Ceph object storage daemon >Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; > vendor preset: disabled) >Active: activating (auto-restart) (Result:
Re: [ceph-users] Disk activation issue on 10.2.9, too (Re: v11.2.0 Disk activation issue while booting)
Hallo again, replying to my own message to provide some more info, and ask one more question. Not sure I mentioned, but I am on CentOS 7.3. I tried to insert a sleep in ExecStartPre in /usr/lib/systemd/system/ceph-osd@.service but apparently all ceph-osd are started (and retried) at the same time. I finally noticed that the simple ceph-disk activate is sufficient to recover the OSD. Questions: = why am I not able to restart the OSD via systemctl restart ceph-osd@##.service whereas ceph-disk activates magically works? = (off-topic) I also see systemd complaining about OSD## which at some point existed on the host but later were reassigned to another one. Tried to "systemctl stop/disable ceph-osd@##" but those seem to reappear at boot... any idea how to fix this? I could easily take care of the "OSD not activating at boot" with something simple in rc.local, but I wonder whether someone is aware of a cleaner solution. Thanks! Fulvio smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Disk activation issue on 10.2.9, too (Re: v11.2.0 Disk activation issue while booting)
Hallo David, all, sorry for hi-jacking the thread but I am seeing the same issue, although on 10.2.7/10.2.9... Note that I am using disks taken from a SAN, so the GUIDs in my case are those relevant to MPATH. As per other messages in this thread, I modified: - /usr/lib/systemd/system/ceph-osd.target adding to [Unit] stanza: Before=ceph.target - /usr/lib/udev/rules.d/60-ceph-by-parttypeuuid.rules added at the end of this line: ENV{ID_PART_ENTRY_SCHEME}=="gpt", ENV{ID_PART_ENTRY_TYPE}=="?*", ENV{ID_PART_ENTRY_UUID}=="?*", SYMLINK+="disk/by-parttypeuuid/$env{ID_PART_ENTRY_TYPE}.$env{ID_PART_ENTRY_UUID}" the string: , SYMLINK+="disk/by-partuuid/$env{ID_PART_ENTRY_UUID}" df shows (picked a problematic partition and one which mounted OK) . /dev/mapper/3600a0980005de737095a56c510cd1 3878873588 142004 3878731584 1% /var/lib/ceph/osd/cephba1-27 /dev/mapper/3600a0980005ddf751e2558e2bac7p1 7779931116 202720 7779728396 1% /var/lib/ceph/tmp/mnt.XL7WkY Yet, for both the GUIDs seem correct: === /dev/mapper/3600a0980005de737095a56c510cd Partition GUID code: 4FBD7E29-8AE0-4982-BF9D-5A8D867AF560 (Unknown) Partition unique GUID: B01E2E0D-9903-4F23-A5FD-FC1C1CB458C3 Partition size: 7761536991 sectors (3.6 TiB) Partition name: 'ceph data' Partition GUID code: 45B0969E-8AE0-4982-BF9D-5A8D867AF560 (Unknown) Partition unique GUID: E1B3970A-FABF-4AC0-8B6A-F7526989FF36 Partition size: 4096 sectors (19.5 GiB) Partition name: 'ceph journal' === /dev/mapper/3600a0980005ddf751e2558e2bac7 Partition GUID code: 4FBD7E29-8AE0-4982-BF9D-5A8D867AF560 (Unknown) Partition unique GUID: 93A91EBF-A531-4002-A49F-B24F27E962DD Partition size: 15564036063 sectors (7.2 TiB) Partition name: 'ceph data' Partition GUID code: 45B0969E-8AE0-4982-BF9D-5A8D867AF560 (Unknown) Partition unique GUID: 2AF9B162-3398-49BD-B6EF-5D284C4A930B Partition size: 4096 sectors (19.5 GiB) Partition name: 'ceph journal' I rather suspect some sort of race condition, possibly causing hitting some timeout within systemctl... (please read the end of this message). I am led to think this because the OSDs which are successfully mounted after each reboot are a "random" subset of the configured ones (total ~40): also, after two or three mounts /var/lib/ceph/mnt... ceph-osd apparently gives up. The only workaround I found to get things going is re-running ceph-ansible, but it takes s long... Have you any idea as to what is going on here? Has anybody seen (and solved) the same issue? Thanks! Fulvio [root@r3srv07.ba1 ~]# cat /var/lib/ceph/tmp/mnt.XL7WkY/whoami 143 [root@r3srv07.ba1 ~]# umount /var/lib/ceph/tmp/mnt.XL7WkY [root@r3srv07.ba1 ~]# systemctl status ceph-osd@143.service ● ceph-osd@143.service - Ceph object storage daemon Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; vendor preset: disabled) Active: failed (Result: start-limit) since Fri 2017-07-21 11:02:23 CEST; 1h 35min ago Process: 40466 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE) Process: 40217 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS) Main PID: 40466 (code=exited, status=1/FAILURE) Jul 21 11:02:03 r3srv07.ba1.box.garr systemd[1]: ceph-osd@143.service: main process exited, code=exited, status=1/FAILURE Jul 21 11:02:03 r3srv07.ba1.box.garr systemd[1]: Unit ceph-osd@143.service entered failed state. Jul 21 11:02:03 r3srv07.ba1.box.garr systemd[1]: ceph-osd@143.service failed. Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: ceph-osd@143.service holdoff time over, scheduling restart. Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: start request repeated too quickly for ceph-osd@143.service Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: Failed to start Ceph object storage daemon. Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: Unit ceph-osd@143.service entered failed state. Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: ceph-osd@143.service failed. [root@r3srv07.ba1 ~]# systemctl restart ceph-osd@143.service [root@r3srv07.ba1 ~]# systemctl status ceph-osd@143.service ● ceph-osd@143.service - Ceph object storage daemon Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; vendor preset: disabled) Active: activating (auto-restart) (Result: exit-code) since Fri 2017-07-21 12:38:11 CEST; 1s ago Process: 74658 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE) Process: 74644 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS) Main PID: 74658 (code=exited, status=1/FAILURE) Jul 21 12:38:11 r3srv07.ba1.box.garr systemd[1]: Unit ceph-osd@143.service entered failed state. Jul 21 12:38:11 r3srv07.ba1.box.garr systemd[1]: ceph-osd@143.service failed.