Please find the below output. cn1.chn8be1c1.cdn ~# ceph osd metadata 0 { "id": 0, "arch": "x86_64", "back_addr": "[v2:10.50.12.41:6883/12650,v1:10.50.12.41:6887/12650]", "back_iface": "dss-private", "bluefs": "1", "bluefs_single_shared_device": "1", "bluestore_bdev_access_mode": "blk", "bluestore_bdev_block_size": "4096", "bluestore_bdev_dev_node": "/dev/dm-23", "bluestore_bdev_driver": "KernelDevice", "bluestore_bdev_partition_path": "/dev/dm-23", "bluestore_bdev_rotational": "1", "bluestore_bdev_size": "4000749453312", "bluestore_bdev_support_discard": "0", "bluestore_bdev_type": "hdd", "ceph_release": "nautilus", "ceph_version": "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)", "ceph_version_short": "14.2.2", "cpu": "Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz", "default_device_class": "hdd", "device_ids": "sdb=HP_LOGICAL_VOLUME_PDNLL0CRH4208F", "devices": "sdb", "distro": "centos", "distro_description": "CentOS Linux 7 (Core)", "distro_version": "7", "front_addr": "[v2:10.50.11.41:6882/12650,v1:10.50.11.41:6887/12650]", "front_iface": "dss-client", "hb_back_addr": "[v2:10.50.12.41:6888/12650,v1:10.50.12.41:6890/12650]", "hb_front_addr": "[v2:10.50.11.41:6889/12650,v1:10.50.11.41:6890/12650 ]", "hostname": "cn1.chn8be1c1.cdn", "journal_rotational": "1", "kernel_description": "#1 SMP Thu Nov 8 23:39:32 UTC 2018", "kernel_version": "3.10.0-957.el7.x86_64", "mem_swap_kb": "0", "mem_total_kb": "272036636", "network_numa_unknown_ifaces": "dss-client,dss-private", "objectstore_numa_unknown_devices": "sdb", "os": "Linux", "osd_data": "/var/lib/ceph/osd/ceph-0", "osd_objectstore": "bluestore", "rotational": "1" } cn1.chn8be1c1.cdn ~# cat /var/lib/ceph/osd/ceph-0/fsid a1ea2ea3-984d-4c91-86cf-29f452f5a952
On Sun, Nov 10, 2019 at 12:54 PM huang jun <hjwsm1...@gmail.com> wrote: > The same problem: > 2019-11-10 05:26:33.215 7fbfafeef700 7 mon.cn1@0(leader).osd e1819 > preprocess_boot from osd.0 v2:10.50.11.41:6814/2022032 clashes with > existing osd: different fsid (ours: > ccfdbd54-fcd2-467f-ab7b-c152b7e422fb ; theirs: a1ea2ea3-984d > -4c91-86cf-29f452f5a952) > maybe the osd uuid is wrong. > what the output of 'ceph osd metadata 0' and 'cat > /var/lib/ceph/osd/ceph-0/fsid'? > > nokia ceph <nokiacephus...@gmail.com> 于2019年11月10日周日 下午2:47写道: > > > > Hi, > > > > yes still the cluster unrecovered. Not able to even up the osd.0 yet. > > > > osd logs: https://pastebin.com/4WrpgrH5 > > > > Mon logs: > https://drive.google.com/open?id=1_HqK2d52Cgaps203WnZ0mCfvxdcjcBoE > > > > # ceph daemon /var/run/ceph/ceph-mon.cn1.asok config show|grep debug_mon > > "debug_mon": "20/20", > > "debug_monc": "0/0", > > > > > > # date; systemctl restart ceph-osd@0.service;date > > Sun Nov 10 05:25:54 UTC 2019 > > Sun Nov 10 05:25:55 UTC 2019 > > > > > > cn1.chn8be1c1.cdn ~# systemctl status ceph-osd@0.service > > ● ceph-osd@0.service - Ceph object storage daemon osd.0 > > Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; > enabled-runtime; vendor preset: disabled) > > Drop-In: /etc/systemd/system/ceph-osd@.service.d > > └─90-ExecStart_NUMA.conf > > Active: active (running) since Sun 2019-11-10 05:25:55 UTC; 8s ago > > Process: 2022026 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh > --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS) > > Main PID: 2022032 (ceph-osd) > > CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@0.service > > └─2022032 /usr/bin/ceph-osd -f --cluster ceph --id 0 > --setuser ceph --setgroup ceph > > > > Nov 10 05:25:55 cn1.chn8be1c1.cdn systemd[1]: Starting Ceph object > storage daemon osd.0... > > Nov 10 05:25:55 cn1.chn8be1c1.cdn systemd[1]: Started Ceph object > storage daemon osd.0. > > Nov 10 05:26:03 cn1.chn8be1c1.cdn numactl[2022032]: 2019-11-10 > 05:26:03.131 7fbef7bb5d80 -1 osd.0 1795 log_to_monitors {default=true} > > Nov 10 05:26:03 cn1.chn8be1c1.cdn numactl[2022032]: 2019-11-10 > 05:26:03.372 7fbeea1c0700 -1 osd.0 1795 set_numa_affinity unable to > identify public interface 'dss-client' numa node: (2) No such file or > directory > > Hint: Some lines were ellipsized, use -l to show in full. > > > > > > # ceph tell mon.cn1 injectargs '--debug-mon 1/5' > > injectargs: > > > > cn1.chn8be1c1.cdn ~# ceph daemon /var/run/ceph/ceph-mon.cn1.asok config > show|grep debug_mon > > "debug_mon": "1/5", > > "debug_monc": "0/0", > > > > > > > > > > On Sun, Nov 10, 2019 at 11:05 AM huang jun <hjwsm1...@gmail.com> wrote: > >> > >> good, please send me the mon and osd.0 log. > >> the cluster still un-recovered? > >> > >> nokia ceph <nokiacephus...@gmail.com> 于2019年11月10日周日 下午1:24写道: > >> > > >> > Hi Huang, > >> > > >> > Yes the node 10.50.10.45 is the fifth node which is replaced. Yes I > have set the debug_mon to 20 and still it is running with that value only. > If you want I will send you the logs of the mon once again by restarting > the osd.0 > >> > > >> > On Sun, Nov 10, 2019 at 10:17 AM huang jun <hjwsm1...@gmail.com> > wrote: > >> >> > >> >> The mon log shows that the all mismatch fsid osds are from node > 10.50.11.45, > >> >> maybe that the fith node? > >> >> BTW i don't found the osd.0 boot message in ceph-mon.log > >> >> do you set debug_mon=20 first and then restart osd.0 process, and > make > >> >> sure the osd.0 is restarted. > >> >> > >> >> > >> >> nokia ceph <nokiacephus...@gmail.com> 于2019年11月10日周日 下午12:31写道: > >> >> > >> >> > > >> >> > Hi, > >> >> > > >> >> > Please find the ceph osd tree output in the pastebin > https://pastebin.com/Gn93rE6w > >> >> > > >> >> > On Fri, Nov 8, 2019 at 7:58 PM huang jun <hjwsm1...@gmail.com> > wrote: > >> >> >> > >> >> >> can you post your 'ceph osd tree' in pastebin? > >> >> >> do you mean the osds report fsid mismatch is from old removed > nodes? > >> >> >> > >> >> >> nokia ceph <nokiacephus...@gmail.com> 于2019年11月8日周五 下午10:21写道: > >> >> >> > > >> >> >> > Hi, > >> >> >> > > >> >> >> > The fifth node in the cluster was affected by hardware failure > and hence the node was replaced in the ceph cluster. But we were not able > to replace it properly and hence we uninstalled the ceph in all the nodes, > deleted the pools and also zapped the osd's and recreated them as new ceph > cluster. But not sure where from the reference for the old fifth > nodes(failed nodes) osd's fsid's are coming from still. Is this creating > the problem. Because I am seeing that the OSD's in the fifth node are > showing up in the ceph status whereas the other nodes osd's are showing > down. > >> >> >> > > >> >> >> > On Fri, Nov 8, 2019 at 7:25 PM huang jun <hjwsm1...@gmail.com> > wrote: > >> >> >> >> > >> >> >> >> I saw many lines like that > >> >> >> >> > >> >> >> >> mon.cn1@0(leader).osd e1805 preprocess_boot from osd.112 > >> >> >> >> v2:10.50.11.45:6822/158344 clashes with existing osd: > different fsid > >> >> >> >> (ours: 85908622-31bd-4728-9be3-f1f6ca44ed98 ; theirs: > >> >> >> >> 127fdc44-c17e-42ee-bcd4-d577c0ef4479) > >> >> >> >> the osd boot will be ignored if the fsid mismatch > >> >> >> >> what do you do before this happen? > >> >> >> >> > >> >> >> >> nokia ceph <nokiacephus...@gmail.com> 于2019年11月8日周五 下午8:29写道: > >> >> >> >> > > >> >> >> >> > Hi, > >> >> >> >> > > >> >> >> >> > Please find the osd.0 which is restarted after the debug_mon > is increased to 20. > >> >> >> >> > > >> >> >> >> > cn1.chn8be1c1.cdn ~# date;systemctl restart > ceph-osd@0.service > >> >> >> >> > Fri Nov 8 12:25:05 UTC 2019 > >> >> >> >> > > >> >> >> >> > cn1.chn8be1c1.cdn ~# systemctl status ceph-osd@0.service -l > >> >> >> >> > ● ceph-osd@0.service - Ceph object storage daemon osd.0 > >> >> >> >> > Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; > enabled-runtime; vendor preset: disabled) > >> >> >> >> > Drop-In: /etc/systemd/system/ceph-osd@.service.d > >> >> >> >> > └─90-ExecStart_NUMA.conf > >> >> >> >> > Active: active (running) since Fri 2019-11-08 12:25:06 > UTC; 29s ago > >> >> >> >> > Process: 298505 > ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id > %i (code=exited, status=0/SUCCESS) > >> >> >> >> > Main PID: 298512 (ceph-osd) > >> >> >> >> > CGroup: > /system.slice/system-ceph\x2dosd.slice/ceph-osd@0.service > >> >> >> >> > └─298512 /usr/bin/ceph-osd -f --cluster ceph --id > 0 --setuser ceph --setgroup ceph > >> >> >> >> > > >> >> >> >> > Nov 08 12:25:06 cn1.chn8be1c1.cdn systemd[1]: Starting Ceph > object storage daemon osd.0... > >> >> >> >> > Nov 08 12:25:06 cn1.chn8be1c1.cdn systemd[1]: Started Ceph > object storage daemon osd.0. > >> >> >> >> > Nov 08 12:25:11 cn1.chn8be1c1.cdn numactl[298512]: > 2019-11-08 12:25:11.538 7f8515323d80 -1 osd.0 1795 log_to_monitors > {default=true} > >> >> >> >> > Nov 08 12:25:11 cn1.chn8be1c1.cdn numactl[298512]: > 2019-11-08 12:25:11.689 7f850792e700 -1 osd.0 1795 set_numa_affinity unable > to identify public interface 'dss-client' numa node: (2) No such file or > directory > >> >> >> >> > > >> >> >> >> > On Fri, Nov 8, 2019 at 4:48 PM huang jun < > hjwsm1...@gmail.com> wrote: > >> >> >> >> >> > >> >> >> >> >> the osd.0 is still in down state after restart? if so, > maybe the > >> >> >> >> >> problem is in mon, > >> >> >> >> >> can you set the leader mon's debug_mon=20 and restart one > of the down > >> >> >> >> >> state osd. > >> >> >> >> >> and then attach the mon log file. > >> >> >> >> >> > >> >> >> >> >> nokia ceph <nokiacephus...@gmail.com> 于2019年11月8日周五 > 下午6:38写道: > >> >> >> >> >> > > >> >> >> >> >> > Hi, > >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> >> > Below is the status of the OSD after restart. > >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> >> > # systemctl status ceph-osd@0.service > >> >> >> >> >> > > >> >> >> >> >> > ● ceph-osd@0.service - Ceph object storage daemon osd.0 > >> >> >> >> >> > > >> >> >> >> >> > Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; > enabled-runtime; vendor preset: disabled) > >> >> >> >> >> > > >> >> >> >> >> > Drop-In: /etc/systemd/system/ceph-osd@.service.d > >> >> >> >> >> > > >> >> >> >> >> > └─90-ExecStart_NUMA.conf > >> >> >> >> >> > > >> >> >> >> >> > Active: active (running) since Fri 2019-11-08 10:32:51 > UTC; 1min 1s ago > >> >> >> >> >> > > >> >> >> >> >> > Process: 219213 > ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id > %i (code=exited, status=0/SUCCESS) Main PID: 219218 (ceph-osd) > >> >> >> >> >> > > >> >> >> >> >> > CGroup: > /system.slice/system-ceph\x2dosd.slice/ceph-osd@0.service > >> >> >> >> >> > > >> >> >> >> >> > └─219218 /usr/bin/ceph-osd -f --cluster ceph > --id 0 --setuser ceph --setgroup ceph > >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> >> > Nov 08 10:32:51 cn1.chn8be1c1.cdn systemd[1]: Starting > Ceph object storage daemon osd.0... > >> >> >> >> >> > > >> >> >> >> >> > Nov 08 10:32:51 cn1.chn8be1c1.cdn systemd[1]: Started > Ceph object storage daemon osd.0. > >> >> >> >> >> > > >> >> >> >> >> > Nov 08 10:33:03 cn1.chn8be1c1.cdn numactl[219218]: > 2019-11-08 10:33:03.785 7f9adeed4d80 -1 osd.0 1795 log_to_monitors > {default=true} Nov 08 10:33:05 cn1.chn8be1c1.cdn numactl[219218]: > 2019-11-08 10:33:05.474 7f9ad14df700 -1 osd.0 1795 set_numa_affinity unable > to identify public interface 'dss-client' numa n...r directory > >> >> >> >> >> > > >> >> >> >> >> > Hint: Some lines were ellipsized, use -l to show in full. > >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> >> > And I have attached the logs in the file in this mail > while this restart was initiated. > >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> >> > On Fri, Nov 8, 2019 at 3:59 PM huang jun < > hjwsm1...@gmail.com> wrote: > >> >> >> >> >> >> > >> >> >> >> >> >> try to restart some of the down osds in 'ceph osd tree', > and to see > >> >> >> >> >> >> what happened? > >> >> >> >> >> >> > >> >> >> >> >> >> nokia ceph <nokiacephus...@gmail.com> 于2019年11月8日周五 > 下午6:24写道: > >> >> >> >> >> >> > > >> >> >> >> >> >> > Adding my official mail id > >> >> >> >> >> >> > > >> >> >> >> >> >> > ---------- Forwarded message --------- > >> >> >> >> >> >> > From: nokia ceph <nokiacephus...@gmail.com> > >> >> >> >> >> >> > Date: Fri, Nov 8, 2019 at 3:57 PM > >> >> >> >> >> >> > Subject: OSD's not coming up in Nautilus > >> >> >> >> >> >> > To: Ceph Users <ceph-users@lists.ceph.com> > >> >> >> >> >> >> > > >> >> >> >> >> >> > > >> >> >> >> >> >> > Hi Team, > >> >> >> >> >> >> > > >> >> >> >> >> >> > There is one 5 node ceph cluster which we have > upgraded from Luminous to Nautilus and everything was going well until > yesterday when we noticed that the ceph osd's are marked down and not > recognized by the monitors as running eventhough the osd processes are > running. > >> >> >> >> >> >> > > >> >> >> >> >> >> > We noticed that the admin.keyring and the mon.keyring > are missing in the nodes which we have recreated it with the below commands. > >> >> >> >> >> >> > > >> >> >> >> >> >> > ceph-authtool --create-keyring > /etc/ceph/ceph.client.admin.keyring --gen-key -n client.admin --cap mon > 'allow *' --cap osd 'allow *' --cap mds allow > >> >> >> >> >> >> > > >> >> >> >> >> >> > ceph-authtool --create_keyring > /etc/ceph/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *' > >> >> >> >> >> >> > > >> >> >> >> >> >> > In logs we find the below lines. > >> >> >> >> >> >> > > >> >> >> >> >> >> > 2019-11-08 09:01:50.525 7ff61722b700 0 > log_channel(audit) log [DBG] : from='client.? 10.50.11.44:0/2398064782' > entity='client.admin' cmd=[{"prefix": "df", "format": "json"}]: dispatch > >> >> >> >> >> >> > 2019-11-08 09:02:37.686 7ff61722b700 0 > log_channel(cluster) log [INF] : mon.cn1 calling monitor election > >> >> >> >> >> >> > 2019-11-08 09:02:37.686 7ff61722b700 1 > >> >> >> >> >> >> > mon.cn1@0(electing).elector(31157) > init, last seen epoch 31157, mid-election, bumping > >> >> >> >> >> >> > 2019-11-08 09:02:37.688 7ff61722b700 -1 > >> >> >> >> >> >> > mon.cn1@0(electing) > e3 failed to get devid for : udev_device_new_from_subsystem_sysname failed > on '' > >> >> >> >> >> >> > 2019-11-08 09:02:37.770 7ff61722b700 0 > log_channel(cluster) log [INF] : mon.cn1 is new leader, mons > cn1,cn2,cn3,cn4,cn5 in quorum (ranks 0,1,2,3,4) > >> >> >> >> >> >> > 2019-11-08 09:02:37.857 7ff613a24700 0 > log_channel(cluster) log [DBG] : monmap e3: 5 mons at {cn1=[v2: > 10.50.11.41:3300/0,v1:10.50.11.41:6789/0],cn2=[v2: > 10.50.11.42:3300/0,v1:10.50.11.42:6789/0],cn3=[v2: > 10.50.11.43:3300/0,v1:10.50.11.43:6789/0],cn4=[v2: > 10.50.11.44:3300/0,v1:10.50.11.44:6789/0],cn5=[v2: > 10.50.11.45:3300/0,v1:10.50.11.45:6789/0]} > >> >> >> >> >> >> > > >> >> >> >> >> >> > > >> >> >> >> >> >> > > >> >> >> >> >> >> > # ceph mon dump > >> >> >> >> >> >> > dumped monmap epoch 3 > >> >> >> >> >> >> > epoch 3 > >> >> >> >> >> >> > fsid 9dbf207a-561c-48ba-892d-3e79b86be12f > >> >> >> >> >> >> > last_changed 2019-09-03 07:53:39.031174 > >> >> >> >> >> >> > created 2019-08-23 18:30:55.970279 > >> >> >> >> >> >> > min_mon_release 14 (nautilus) > >> >> >> >> >> >> > 0: [v2:10.50.11.41:3300/0,v1:10.50.11.41:6789/0] > mon.cn1 > >> >> >> >> >> >> > 1: [v2:10.50.11.42:3300/0,v1:10.50.11.42:6789/0] > mon.cn2 > >> >> >> >> >> >> > 2: [v2:10.50.11.43:3300/0,v1:10.50.11.43:6789/0] > mon.cn3 > >> >> >> >> >> >> > 3: [v2:10.50.11.44:3300/0,v1:10.50.11.44:6789/0] > mon.cn4 > >> >> >> >> >> >> > 4: [v2:10.50.11.45:3300/0,v1:10.50.11.45:6789/0] > mon.cn5 > >> >> >> >> >> >> > > >> >> >> >> >> >> > > >> >> >> >> >> >> > # ceph -s > >> >> >> >> >> >> > cluster: > >> >> >> >> >> >> > id: 9dbf207a-561c-48ba-892d-3e79b86be12f > >> >> >> >> >> >> > health: HEALTH_WARN > >> >> >> >> >> >> > 85 osds down > >> >> >> >> >> >> > 3 hosts (72 osds) down > >> >> >> >> >> >> > 1 nearfull osd(s) > >> >> >> >> >> >> > 1 pool(s) nearfull > >> >> >> >> >> >> > Reduced data availability: 2048 pgs > inactive > >> >> >> >> >> >> > too few PGs per OSD (17 < min 30) > >> >> >> >> >> >> > 1/5 mons down, quorum cn2,cn3,cn4,cn5 > >> >> >> >> >> >> > > >> >> >> >> >> >> > services: > >> >> >> >> >> >> > mon: 5 daemons, quorum cn2,cn3,cn4,cn5 (age 57s), > out of quorum: cn1 > >> >> >> >> >> >> > mgr: cn1(active, since 73m), standbys: cn2, cn3, > cn4, cn5 > >> >> >> >> >> >> > osd: 120 osds: 35 up, 120 in; 909 remapped pgs > >> >> >> >> >> >> > > >> >> >> >> >> >> > data: > >> >> >> >> >> >> > pools: 1 pools, 2048 pgs > >> >> >> >> >> >> > objects: 0 objects, 0 B > >> >> >> >> >> >> > usage: 176 TiB used, 260 TiB / 437 TiB avail > >> >> >> >> >> >> > pgs: 100.000% pgs unknown > >> >> >> >> >> >> > 2048 unknown > >> >> >> >> >> >> > > >> >> >> >> >> >> > > >> >> >> >> >> >> > The osd logs show the below logs. > >> >> >> >> >> >> > > >> >> >> >> >> >> > 2019-11-08 09:05:33.332 7fd1a36eed80 0 _get_class not > permitted to load kvs > >> >> >> >> >> >> > 2019-11-08 09:05:33.332 7fd1a36eed80 0 _get_class not > permitted to load lua > >> >> >> >> >> >> > 2019-11-08 09:05:33.337 7fd1a36eed80 0 _get_class not > permitted to load sdk > >> >> >> >> >> >> > 2019-11-08 09:05:33.337 7fd1a36eed80 0 osd.0 1795 > crush map has features 432629308056666112, adjusting msgr requires for > clients > >> >> >> >> >> >> > 2019-11-08 09:05:33.337 7fd1a36eed80 0 osd.0 1795 > crush map has features 432629308056666112 was 8705, adjusting msgr requires > for mons > >> >> >> >> >> >> > 2019-11-08 09:05:33.337 7fd1a36eed80 0 osd.0 1795 > crush map has features 1009090060360105984, adjusting msgr requires for osds > >> >> >> >> >> >> > > >> >> >> >> >> >> > Please let us know what might be the issue. There > seems to be no network issues in any of the servers public and private > interfaces. > >> >> >> >> >> >> > > >> >> >> >> >> >> > _______________________________________________ > >> >> >> >> >> >> > ceph-users mailing list > >> >> >> >> >> >> > ceph-users@lists.ceph.com > >> >> >> >> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com