subject:"\[ceph\-users\] Disk activation issue on 10.2.9, too \(Re\: v11.2.0 Disk activation issue while booting\)"

Re: [ceph-users] Disk activation issue on 10.2.9, too (Re: v11.2.0 Disk activation issue while booting)

2017-07-21 Thread Willem Jan Withagen

On 21-7-2017 12:45, Fulvio Galeazzi wrote:
> Hallo David, all,
> sorry for hi-jacking the thread but I am seeing the same issue,
> although on 10.2.7/10.2.9...

Then this is a problem that had nothing to do with my changes to
ceph-disk, since they only went into HEAD and thus end up in Luminous.
Which is fortunate, since I know nothing about systemd and all its magic.

Not to say anything about the previous reported problem.
BUt that also went away when ceph-disk was used differently.

--WjW

> 
> 
> Note that I am using disks taken from a SAN, so the GUIDs in my case are
> those relevant to MPATH.
> As per other messages in this thread, I modified:
>  - /usr/lib/systemd/system/ceph-osd.target
>adding to [Unit] stanza:
> Before=ceph.target
>  - /usr/lib/udev/rules.d/60-ceph-by-parttypeuuid.rules
>added at the end of this line:
> ENV{ID_PART_ENTRY_SCHEME}=="gpt", ENV{ID_PART_ENTRY_TYPE}=="?*",
> ENV{ID_PART_ENTRY_UUID}=="?*",
> SYMLINK+="disk/by-parttypeuuid/$env{ID_PART_ENTRY_TYPE}.$env{ID_PART_ENTRY_UUID}"
> 
>the string:
> , SYMLINK+="disk/by-partuuid/$env{ID_PART_ENTRY_UUID}"
> 
> 
> 
> df shows (picked a problematic partition and one which mounted OK)
> .
> /dev/mapper/3600a0980005de737095a56c510cd1  3878873588  142004
> 3878731584   1% /var/lib/ceph/osd/cephba1-27
> /dev/mapper/3600a0980005ddf751e2558e2bac7p1 7779931116  202720
> 7779728396   1% /var/lib/ceph/tmp/mnt.XL7WkY
> 
> Yet, for both the GUIDs seem correct:
> 
> === /dev/mapper/3600a0980005de737095a56c510cd
> Partition GUID code: 4FBD7E29-8AE0-4982-BF9D-5A8D867AF560 (Unknown)
> Partition unique GUID: B01E2E0D-9903-4F23-A5FD-FC1C1CB458C3
> Partition size: 7761536991 sectors (3.6 TiB)
> Partition name: 'ceph data'
> Partition GUID code: 45B0969E-8AE0-4982-BF9D-5A8D867AF560 (Unknown)
> Partition unique GUID: E1B3970A-FABF-4AC0-8B6A-F7526989FF36
> Partition size: 4096 sectors (19.5 GiB)
> Partition name: 'ceph journal'
> 
> === /dev/mapper/3600a0980005ddf751e2558e2bac7
> Partition GUID code: 4FBD7E29-8AE0-4982-BF9D-5A8D867AF560 (Unknown)
> Partition unique GUID: 93A91EBF-A531-4002-A49F-B24F27E962DD
> Partition size: 15564036063 sectors (7.2 TiB)
> Partition name: 'ceph data'
> Partition GUID code: 45B0969E-8AE0-4982-BF9D-5A8D867AF560 (Unknown)
> Partition unique GUID: 2AF9B162-3398-49BD-B6EF-5D284C4A930B
> Partition size: 4096 sectors (19.5 GiB)
> Partition name: 'ceph journal'
> 
>   I rather suspect some sort of race condition, possibly causing hitting
> some timeout within systemctl... (please read the end of this message).
> I am led to think this because the OSDs which are successfully mounted
> after each reboot are a "random" subset of the configured ones (total
> ~40): also, after two or three mounts /var/lib/ceph/mnt... ceph-osd
> apparently gives up.
> 
> 
> The only workaround I found to get things going is re-running
> ceph-ansible, but it takes s long...
> 
> Have you any idea as to what is going on here? Has anybody seen (and
> solved) the same issue?
> 
>   Thanks!
> 
> Fulvio
> 
> 
> 
> 
> 
> [root@r3srv07.ba1 ~]# cat /var/lib/ceph/tmp/mnt.XL7WkY/whoami
> 143
> [root@r3srv07.ba1 ~]# umount /var/lib/ceph/tmp/mnt.XL7WkY
> [root@r3srv07.ba1 ~]# systemctl status ceph-osd@143.service
> ● ceph-osd@143.service - Ceph object storage daemon
>Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled;
> vendor preset: disabled)
>Active: failed (Result: start-limit) since Fri 2017-07-21 11:02:23
> CEST; 1h 35min ago
>   Process: 40466 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER}
> --id %i --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
>   Process: 40217 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh
> --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
>  Main PID: 40466 (code=exited, status=1/FAILURE)
> 
> 
> Jul 21 11:02:03 r3srv07.ba1.box.garr systemd[1]: ceph-osd@143.service:
> main process exited, code=exited, status=1/FAILURE
> Jul 21 11:02:03 r3srv07.ba1.box.garr systemd[1]: Unit
> ceph-osd@143.service entered failed state.
> Jul 21 11:02:03 r3srv07.ba1.box.garr systemd[1]: ceph-osd@143.service
> failed.
> Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: ceph-osd@143.service
> holdoff time over, scheduling restart.
> Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: start request repeated
> too quickly for ceph-osd@143.service
> Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: Failed to start Ceph
> object storage daemon.
> Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: Unit
> ceph-osd@143.service entered failed state.
> Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: ceph-osd@143.service
> failed.
> [root@r3srv07.ba1 ~]# systemctl restart ceph-osd@143.service
> [root@r3srv07.ba1 ~]# systemctl status ceph-osd@143.service
> ● ceph-osd@143.service - Ceph object storage daemon
>Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled;
> vendor preset: disabled)
>Active: activating (auto-restart) (Result:

Re: [ceph-users] Disk activation issue on 10.2.9, too (Re: v11.2.0 Disk activation issue while booting)

2017-07-21 Thread Fulvio Galeazzi

Hallo again, replying to my own message to provide some more info, and 
ask one more question.


  Not sure I mentioned, but I am on CentOS 7.3.

  I tried to insert a sleep in ExecStartPre in 
/usr/lib/systemd/system/ceph-osd@.service but apparently all ceph-osd 
are started (and retried) at the same time.


  I finally noticed that the simple
ceph-disk activate 
 is sufficient to recover the OSD.


  Questions:

 =  why am I not able to restart the OSD via
systemctl restart ceph-osd@##.service
whereas ceph-disk activates magically works?

 = (off-topic) I also see systemd complaining about OSD## which
   at some point existed on the host but later were reassigned to
   another one. Tried to "systemctl stop/disable ceph-osd@##" but
   those seem to reappear at boot... any idea how to fix this?


  I could easily take care of the "OSD not activating at boot" with 
something simple in rc.local, but I wonder whether someone is aware of a 
cleaner solution.


  Thanks!

Fulvio



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Disk activation issue on 10.2.9, too (Re: v11.2.0 Disk activation issue while booting)

2017-07-21 Thread Fulvio Galeazzi


Hallo David, all,
sorry for hi-jacking the thread but I am seeing the same issue, 
although on 10.2.7/10.2.9...



Note that I am using disks taken from a SAN, so the GUIDs in my case are 
those relevant to MPATH.

As per other messages in this thread, I modified:
 - /usr/lib/systemd/system/ceph-osd.target
   adding to [Unit] stanza:
Before=ceph.target
 - /usr/lib/udev/rules.d/60-ceph-by-parttypeuuid.rules
   added at the end of this line:
ENV{ID_PART_ENTRY_SCHEME}=="gpt", ENV{ID_PART_ENTRY_TYPE}=="?*", 
ENV{ID_PART_ENTRY_UUID}=="?*", 
SYMLINK+="disk/by-parttypeuuid/$env{ID_PART_ENTRY_TYPE}.$env{ID_PART_ENTRY_UUID}"

   the string:
, SYMLINK+="disk/by-partuuid/$env{ID_PART_ENTRY_UUID}"



df shows (picked a problematic partition and one which mounted OK)
.
/dev/mapper/3600a0980005de737095a56c510cd1  3878873588  142004 
3878731584   1% /var/lib/ceph/osd/cephba1-27
/dev/mapper/3600a0980005ddf751e2558e2bac7p1 7779931116  202720 
7779728396   1% /var/lib/ceph/tmp/mnt.XL7WkY


Yet, for both the GUIDs seem correct:

=== /dev/mapper/3600a0980005de737095a56c510cd
Partition GUID code: 4FBD7E29-8AE0-4982-BF9D-5A8D867AF560 (Unknown)
Partition unique GUID: B01E2E0D-9903-4F23-A5FD-FC1C1CB458C3
Partition size: 7761536991 sectors (3.6 TiB)
Partition name: 'ceph data'
Partition GUID code: 45B0969E-8AE0-4982-BF9D-5A8D867AF560 (Unknown)
Partition unique GUID: E1B3970A-FABF-4AC0-8B6A-F7526989FF36
Partition size: 4096 sectors (19.5 GiB)
Partition name: 'ceph journal'

=== /dev/mapper/3600a0980005ddf751e2558e2bac7
Partition GUID code: 4FBD7E29-8AE0-4982-BF9D-5A8D867AF560 (Unknown)
Partition unique GUID: 93A91EBF-A531-4002-A49F-B24F27E962DD
Partition size: 15564036063 sectors (7.2 TiB)
Partition name: 'ceph data'
Partition GUID code: 45B0969E-8AE0-4982-BF9D-5A8D867AF560 (Unknown)
Partition unique GUID: 2AF9B162-3398-49BD-B6EF-5D284C4A930B
Partition size: 4096 sectors (19.5 GiB)
Partition name: 'ceph journal'

  I rather suspect some sort of race condition, possibly causing 
hitting some timeout within systemctl... (please read the end of this 
message).
I am led to think this because the OSDs which are successfully mounted 
after each reboot are a "random" subset of the configured ones (total 
~40): also, after two or three mounts /var/lib/ceph/mnt... ceph-osd 
apparently gives up.



The only workaround I found to get things going is re-running 
ceph-ansible, but it takes s long...


Have you any idea as to what is going on here? Has anybody seen (and 
solved) the same issue?


  Thanks!

Fulvio





[root@r3srv07.ba1 ~]# cat /var/lib/ceph/tmp/mnt.XL7WkY/whoami
143
[root@r3srv07.ba1 ~]# umount /var/lib/ceph/tmp/mnt.XL7WkY
[root@r3srv07.ba1 ~]# systemctl status ceph-osd@143.service
● ceph-osd@143.service - Ceph object storage daemon
   Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; 
vendor preset: disabled)
   Active: failed (Result: start-limit) since Fri 2017-07-21 11:02:23 
CEST; 1h 35min ago
  Process: 40466 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} 
--id %i --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
  Process: 40217 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh 
--cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)

 Main PID: 40466 (code=exited, status=1/FAILURE)


Jul 21 11:02:03 r3srv07.ba1.box.garr systemd[1]: ceph-osd@143.service: 
main process exited, code=exited, status=1/FAILURE
Jul 21 11:02:03 r3srv07.ba1.box.garr systemd[1]: Unit 
ceph-osd@143.service entered failed state.
Jul 21 11:02:03 r3srv07.ba1.box.garr systemd[1]: ceph-osd@143.service 
failed.
Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: ceph-osd@143.service 
holdoff time over, scheduling restart.
Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: start request repeated 
too quickly for ceph-osd@143.service
Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: Failed to start Ceph 
object storage daemon.
Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: Unit 
ceph-osd@143.service entered failed state.
Jul 21 11:02:23 r3srv07.ba1.box.garr systemd[1]: ceph-osd@143.service 
failed.

[root@r3srv07.ba1 ~]# systemctl restart ceph-osd@143.service
[root@r3srv07.ba1 ~]# systemctl status ceph-osd@143.service
● ceph-osd@143.service - Ceph object storage daemon
   Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; 
vendor preset: disabled)
   Active: activating (auto-restart) (Result: exit-code) since Fri 
2017-07-21 12:38:11 CEST; 1s ago
  Process: 74658 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} 
--id %i --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
  Process: 74644 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh 
--cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)

 Main PID: 74658 (code=exited, status=1/FAILURE)

Jul 21 12:38:11 r3srv07.ba1.box.garr systemd[1]: Unit 
ceph-osd@143.service entered failed state.
Jul 21 12:38:11 r3srv07.ba1.box.garr systemd[1]: ceph-osd@143.service 
failed.

Re: [ceph-users] Disk activation issue on 10.2.9, too (Re: v11.2.0 Disk activation issue while booting)

Re: [ceph-users] Disk activation issue on 10.2.9, too (Re: v11.2.0 Disk activation issue while booting)

[ceph-users] Disk activation issue on 10.2.9, too (Re: v11.2.0 Disk activation issue while booting)

3 matches

Site Navigation

Mail list logo

Footer information