Re: [ceph-users] cluster is not stable

2019-03-15 Thread Janne Johansson
Den tors 14 mars 2019 kl 17:00 skrev Zhenshi Zhou : > I think I've found the root cause which make the monmap contains no > feature. As I moved the servers from one place to another, I modified > the monmap once. If this was the empty cluster that you refused to redo from scratch, then I feel it

Re: [ceph-users] cluster is not stable

2019-03-14 Thread Zhenshi Zhou
Hi huang, I think I've found the root cause which make the monmap contains no feature. As I moved the servers from one place to another, I modified the monmap once. However, not all monmap is the same on all mons. I modified monmap on one of the mons, and create from scratch on the other two

Re: [ceph-users] cluster is not stable

2019-03-14 Thread Zhenshi Zhou
Hi, I'll try that command soon. It's a new cluster installed mimic. Not sure what the exact reason, but as far as I can think of, 2 things may cause this issue. One is that I moved these servers from a datacenter to this one, followed by steps [1]. Another is that I create a bridge using the

Re: [ceph-users] cluster is not stable

2019-03-14 Thread huang jun
You can try that commands, but maybe you need to find the root cause why the current monmap contains no features at all, do you upgrade cluster from luminous to mimic, or it's a new cluster installed mimic? Zhenshi Zhou 于2019年3月14日周四 下午2:37写道: > > Hi huang, > > It's a pre-production

Re: [ceph-users] cluster is not stable

2019-03-14 Thread Zhenshi Zhou
Hi huang, It's a pre-production environment. If everything is fine, I'll use it for production. My cluster is version mimic, should I set all features you listed in the command? Thanks huang jun 于2019年3月14日周四 下午2:11写道: > sorry, the script should be > for f in kraken luminous mimic

Re: [ceph-users] cluster is not stable

2019-03-14 Thread huang jun
sorry, the script should be for f in kraken luminous mimic osdmap-prune; do ceph mon feature set $f --yes-i-really-mean-it done huang jun 于2019年3月14日周四 下午2:04写道: > > ok, if this is a **test environment**, you can try > for f in 'kraken,luminous,mimic,osdmap-prune'; do > ceph mon feature set

Re: [ceph-users] cluster is not stable

2019-03-14 Thread huang jun
ok, if this is a **test environment**, you can try for f in 'kraken,luminous,mimic,osdmap-prune'; do ceph mon feature set $f --yes-i-really-mean-it done If it is a production environment, you should eval the risk first, and maybe setup a test cluster to testing first. Zhenshi Zhou

Re: [ceph-users] cluster is not stable

2019-03-13 Thread Zhenshi Zhou
# ceph mon feature ls all features supported: [kraken,luminous,mimic,osdmap-prune] persistent: [kraken,luminous,mimic,osdmap-prune] on current monmap (epoch 2) persistent: [none] required: [none] huang jun 于2019年3月14日周四 下午1:50写道: > what's the output of 'ceph mon

Re: [ceph-users] cluster is not stable

2019-03-13 Thread huang jun
what's the output of 'ceph mon feature ls'? from the code, maybe mon features not contain luminous 6263 void OSD::send_beacon(const ceph::coarse_mono_clock::time_point& now) 6264 { 6265 const auto& monmap = monc->monmap; 6266 // send beacon to mon even if we are just connected, and the

Re: [ceph-users] cluster is not stable

2019-03-13 Thread Zhenshi Zhou
Hi, One of the log says the beacon not sending as below: 2019-03-14 12:41:15.722 7f3c27684700 10 osd.5 17032 tick_without_osd_lock 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032 can_inc_scrubs_pending 0 -> 1 (max 1, active 0) 2019-03-14 12:41:15.722 7f3c27684700 20 osd.5 17032

Re: [ceph-users] cluster is not stable

2019-03-13 Thread huang jun
osd will not send beacons to mon if its not in ACTIVE state, so you maybe turn on one osd's debug_osd=20 to see what is going on Zhenshi Zhou 于2019年3月14日周四 上午11:07写道: > > What's more, I find that the osds don't send beacons all the time, some osds > send beacons > for a period of time and then

Re: [ceph-users] cluster is not stable

2019-03-13 Thread Zhenshi Zhou
Hi I set the config on every osd and check whether all osds send beacons to monitors. The result shows that only part of the osds send beacons and the monitor receives all beacons from which the osd send out. But why some osds don't send beacon? huang jun 于2019年3月13日周三 下午11:02写道: > sorry for

Re: [ceph-users] cluster is not stable

2019-03-13 Thread huang jun
sorry for not make it clearly, you may need to set one of your osd's osd_beacon_report_interval = 5 and debug_ms=1 and then restart the osd process, then check the osd log by 'grep beacon /var/log/ceph/ceph-osd.$id.log' to make sure osd send beacons to mon, if osd send beacon to mon, you should

Re: [ceph-users] cluster is not stable

2019-03-13 Thread Zhenshi Zhou
And now, new errors are cliaming.. [image: image.png] Zhenshi Zhou 于2019年3月13日周三 下午2:58写道: > Hi, > > I didn't set osd_beacon_report_interval as it must be the default value. > I have set osd_beacon_report_interval to 60 and debug_mon to 10. > > Attachment is the leader monitor log, the

Re: [ceph-users] cluster is not stable

2019-03-13 Thread huang jun
can you get the value of osd_beacon_report_interval item? the default is 300, you can set to 60, or maybe turn on debug_ms=1 debug_mon=10 can get more infos. Zhenshi Zhou 于2019年3月13日周三 下午1:20写道: > > Hi, > > The servers are cennected to the same switch. > I can ping from anyone of the servers

Re: [ceph-users] cluster is not stable

2019-03-12 Thread Zhenshi Zhou
Hi, The servers are cennected to the same switch. I can ping from anyone of the servers to other servers without a packet lost and the average round trip time is under 0.1 ms. Thanks Ashley Merrick 于2019年3月13日周三 下午12:06写道: > Can you ping all your OSD servers from all your mons, and ping your

Re: [ceph-users] cluster is not stable

2019-03-12 Thread Ashley Merrick
Can you ping all your OSD servers from all your mons, and ping your mons from all your OSD servers? I’ve seen this where a route wasn’t working one direction, so it made OSDs flap when it used that mon to check availability: On Wed, 13 Mar 2019 at 11:50 AM, Zhenshi Zhou wrote: > After checking

Re: [ceph-users] cluster is not stable

2019-03-12 Thread Zhenshi Zhou
After checking the network and syslog/dmsg, I think it's not the network or hardware issue. Now there're some osds being marked down every 15 minutes. here is ceph.log: 2019-03-13 11:06:26.290701 mon.ceph-mon1 mon.0 10.39.0.34:6789/0 6756 : cluster [INF] Cluster is now healthy 2019-03-13

Re: [ceph-users] cluster is not stable

2019-03-12 Thread Zhenshi Zhou
Hi Kevin, I'm sure the firewalld are disabled on each host. Well, the network is not a problem. The servers are connected to the same switch and the connection is good when the osds are marked as down. There was no interruption or delay. I restart the leader monitor daemon and it seems return

Re: [ceph-users] cluster is not stable

2019-03-12 Thread Kevin Olbrich
Are you sure that firewalld is stopped and disabled? Looks exactly like that when I missed one host in a test cluster. Kevin Am Di., 12. März 2019 um 09:31 Uhr schrieb Zhenshi Zhou : > Hi, > > I deployed a ceph cluster with good performance. But the logs > indicate that the cluster is not as

Re: [ceph-users] cluster is not stable

2019-03-12 Thread Zhenshi Zhou
Yep, I think it maybe a network issue as well. I'll check the connections. Thanks Eugen:) Eugen Block 于2019年3月12日周二 下午4:35写道: > Hi, > > my first guess would be a network issue. Double-check your connections > and make sure the network setup works as expected. Check syslogs, > dmesg, switches

Re: [ceph-users] cluster is not stable

2019-03-12 Thread Eugen Block
Hi, my first guess would be a network issue. Double-check your connections and make sure the network setup works as expected. Check syslogs, dmesg, switches etc. for hints that a network interruption may have occured. Regards, Eugen Zitat von Zhenshi Zhou : Hi, I deployed a ceph