Re: [ceph-users] Bluestore 32bit max_object_size limit

2019-01-18 Thread KEVIN MICHAEL HRPCEK


On 1/18/19 7:26 AM, Igor Fedotov wrote:

Hi Kevin,

On 1/17/2019 10:50 PM, KEVIN MICHAEL HRPCEK wrote:
Hey,

I recall reading about this somewhere but I can't find it in the docs or list 
archive and confirmation from a dev or someone who knows for sure would be 
nice. What I recall is that bluestore has a max 4GB file size limit based on 
the design of bluestore not the osd_max_object_size setting. The bluestore 
source seems to suggest that by setting the OBJECT_MAX_SIZE to a 32bit max, 
giving an error if osd_max_object_size is > OBJECT_MAX_SIZE, and not writing 
the data if offset+length >= OBJECT_MAX_SIZE. So it seems like the in osd file 
size int can't exceed 32 bits which is 4GB, like FAT32. Am I correct or maybe 
I'm reading all this wrong..?

You're correct, BlueStore doesn't support object larger than 
OBJECT_MAX_SIZE(i.e. 4Gb)

Thanks for confirming that!


If bluestore has a hard 4GB object limit using radosstriper to break up an 
object would work, but does using an EC pool that breaks up the object to 
shards smaller than OBJECT_MAX_SIZE have the same effect as radosstriper to get 
around a 4GB limit? We use rados directly and would like to move to bluestore 
but we have some large objects <= 13G that may need attention if this 4GB limit 
does exist and an ec pool doesn't get around it.
Theoretically object split using EC might help. But I'm not sure whether one 
needs to adjust osd_max_object_size greater than 4Gb to permit 13Gb object 
usage in EC pool. If it's needed than tosd_max_object_size <= OBJECT_MAX_SIZE 
constraint is violated and BlueStore wouldn't start.
In my experience I had to increase osd_max_object_size from the 128M default it 
changed to a couple versions ago to ~20G to be able to write our largest 
objects with some margin. Do you think there is another way to handle 
osd_max_object_size > OBJECT_MAX_SIZE so that bluestore will start and EC pools 
or striping can be used to write objects that are greater than OBJECT_MAX_SIZE 
but each stripe/shard ends up smaller than OBJECT_MAX_SIZE after striping or 
being in an ec pool?



https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L88
#define OBJECT_MAX_SIZE 0x // 32 bits

https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L4395

 // sanity check(s)
  auto osd_max_object_size =
cct->_conf.get_val("osd_max_object_size");
  if (osd_max_object_size >= (size_t)OBJECT_MAX_SIZE) {
derr << __func__ << " osd_max_object_size >= 0x" << std::hex << 
OBJECT_MAX_SIZE
  << "; BlueStore has hard limit of 0x" << OBJECT_MAX_SIZE << "." <<  
std::dec << dendl;
return -EINVAL;
  }


https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L12331
  if (offset + length >= OBJECT_MAX_SIZE) {
r = -E2BIG;
  } else {
_assign_nid(txc, o);
r = _do_write(txc, c, o, offset, length, bl, fadvise_flags);
txc->write_onode(o);
  }

Thanks!
Kevin


--
Kevin Hrpcek
NASA SNPP Atmosphere SIPS
Space Science & Engineering Center
University of Wisconsin-Madison



___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Thanks,

Igor

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluestore 32bit max_object_size limit

2019-01-17 Thread KEVIN MICHAEL HRPCEK
Hey,

I recall reading about this somewhere but I can't find it in the docs or list 
archive and confirmation from a dev or someone who knows for sure would be 
nice. What I recall is that bluestore has a max 4GB file size limit based on 
the design of bluestore not the osd_max_object_size setting. The bluestore 
source seems to suggest that by setting the OBJECT_MAX_SIZE to a 32bit max, 
giving an error if osd_max_object_size is > OBJECT_MAX_SIZE, and not writing 
the data if offset+length >= OBJECT_MAX_SIZE. So it seems like the in osd file 
size int can't exceed 32 bits which is 4GB, like FAT32. Am I correct or maybe 
I'm reading all this wrong..?

If bluestore has a hard 4GB object limit using radosstriper to break up an 
object would work, but does using an EC pool that breaks up the object to 
shards smaller than OBJECT_MAX_SIZE have the same effect as radosstriper to get 
around a 4GB limit? We use rados directly and would like to move to bluestore 
but we have some large objects <= 13G that may need attention if this 4GB limit 
does exist and an ec pool doesn't get around it.


https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L88
#define OBJECT_MAX_SIZE 0x // 32 bits

https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L4395

 // sanity check(s)
  auto osd_max_object_size =
cct->_conf.get_val("osd_max_object_size");
  if (osd_max_object_size >= (size_t)OBJECT_MAX_SIZE) {
derr << __func__ << " osd_max_object_size >= 0x" << std::hex << 
OBJECT_MAX_SIZE
  << "; BlueStore has hard limit of 0x" << OBJECT_MAX_SIZE << "." <<  
std::dec << dendl;
return -EINVAL;
  }


https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L12331
  if (offset + length >= OBJECT_MAX_SIZE) {
r = -E2BIG;
  } else {
_assign_nid(txc, o);
r = _do_write(txc, c, o, offset, length, bl, fadvise_flags);
txc->write_onode(o);
  }

Thanks!
Kevin


--
Kevin Hrpcek
NASA SNPP Atmosphere SIPS
Space Science & Engineering Center
University of Wisconsin-Madison
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

2018-09-26 Thread KEVIN MICHAEL HRPCEK
Hey, don't lose hope. I just went through 2 3-5 day outages after a mimic 
upgrade with no data loss. I'd recommend looking through the thread about it to 
see how close it is to your issue. From my point of view there seems to be some 
similarities. 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/029649.html.

At a similar point of desperation with my cluster I would shut all ceph 
processes down and bring them up in order. Doing this had my cluster almost 
healthy a few times until it fell over again due to mon issues. So solving any 
mon issues is the first priority. It seems like you may also benefit from 
setting mon_osd_cache_size to a very large number if you have enough memory on 
your mon servers.

I'll hop on the irc today.

Kevin

On 09/25/2018 05:53 PM, by morphin wrote:

After I tried too many things with so many helps on IRC. My pool
health is still in ERROR and I think I can't recover from this.
https://paste.ubuntu.com/p/HbsFnfkYDT/
At the end 2 of 3 mons crashed and started at same time and the pool
is offlined. Recovery takes more than 12hours and it is way too slow.
Somehow recovery seems to be not working.

If I can reach my data I will re-create the pool easily.
If I run ceph-object-tool script to regenerate mon store.db can I
acccess the RBD pool again?
by morphin , 25 Eyl 
2018 Sal, 20:03
tarihinde şunu yazdı:



Hi,

Cluster is still down :(

Up to not we have managed to compensate the OSDs. 118s of 160 OSD are
stable and cluster is still in the progress of settling. Thanks for
the guy Be-El in the ceph IRC channel. Be-El helped a lot to make
flapping OSDs stable.

What we learned up now is that this is the cause of unsudden death of
2 monitor servers of 3. And when they come back if they do not start
one by one (each after joining cluster) this can happen. Cluster can
be unhealty and it can take countless hour to come back.

Right now here is our status:
ceph -s : https://paste.ubuntu.com/p/6DbgqnGS7t/
health detail: https://paste.ubuntu.com/p/w4gccnqZjR/

Since OSDs disks are NL-SAS it can take up to 24 hours for an online
cluster. What is most it has been said that we could be extremely luck
if all the data is rescued.

Most unhappily our strategy is just to sit and wait :(. As soon as the
peering and activating count drops to 300-500 pgs we will restart the
stopped OSDs one by one. For each OSD and we will wait the cluster to
settle down. The amount of data stored is OSD is 33TB. Our most
concern is to export our rbd pool data outside to a backup space. Then
we will start again with clean one.

I hope to justify our analysis with an expert. Any help or advise
would be greatly appreciated.
by morphin , 25 Eyl 
2018 Sal, 15:08
tarihinde şunu yazdı:



After reducing the recovery parameter values did not change much.
There are a lot of OSD still marked down.

I don't know what I need to do after this point.

[osd]
osd recovery op priority = 63
osd client op priority = 1
osd recovery max active = 1
osd max scrubs = 1


ceph -s
  cluster:
id: 89569e73-eb89-41a4-9fc9-d2a5ec5f4106
health: HEALTH_ERR
42 osds down
1 host (6 osds) down
61/8948582 objects unfound (0.001%)
Reduced data availability: 3837 pgs inactive, 1822 pgs
down, 1900 pgs peering, 6 pgs stale
Possible data damage: 18 pgs recovery_unfound
Degraded data redundancy: 457246/17897164 objects degraded
(2.555%), 213 pgs degraded, 209 pgs undersized
2554 slow requests are blocked > 32 sec
3273 slow ops, oldest one blocked for 1453 sec, daemons
[osd.0,osd.1,osd.10,osd.100,osd.101,osd.102,osd.103,osd.104,osd.105,osd.106]...
have slow ops.

  services:
mon: 3 daemons, quorum SRV-SEKUARK3,SRV-SBKUARK2,SRV-SBKUARK3
mgr: SRV-SBKUARK2(active), standbys: SRV-SEKUARK2, SRV-SEKUARK3,
SRV-SEKUARK4
osd: 168 osds: 118 up, 160 in

  data:
pools:   1 pools, 4096 pgs
objects: 8.95 M objects, 17 TiB
usage:   33 TiB used, 553 TiB / 586 TiB avail
pgs: 93.677% pgs not active
 457246/17897164 objects degraded (2.555%)
 61/8948582 objects unfound (0.001%)
 1676 down
 1372 peering
 528  stale+peering
 164  active+undersized+degraded
 145  stale+down
 73   activating
 40   active+clean
 29   stale+activating
 17   active+recovery_unfound+undersized+degraded
 16   stale+active+clean
 16   stale+active+undersized+degraded
 9activating+undersized+degraded
 3active+recovery_wait+degraded
 2activating+undersized
 2activating+degraded
 1creating+down
 1stale+active+recovery_unfound+undersized+degraded
 1stale+active+clean+scrubbing+deep
 1

Re: [ceph-users] Mimic upgrade failure

2018-09-24 Thread KEVIN MICHAEL HRPCEK
The cluster is healthy and stable. I'll leave a summary for the archive in case 
anyone else has a similar problem.

centos 7.5
ceph mimic 13.2.1
3 mon/mgr/mds hosts, 862 osd (41 hosts)

This was all triggered by an unexpected ~1 min network blip on our 10Gbit 
switch. The ceph cluster lost connectivity to each other and obviously tried to 
remap everything once connectivity returned and tons of OSDs were being marked 
down. This was made worse by the OSDs trying to use large amounts of memory 
while recovering and ending up swapping, hanging, and me ipmi resetting hosts.  
All of this caused a lot of osd map changes and the mons will have stored all 
of them without trimming due to the unhealthy PGs. I was able to get almost all 
PGs active and clean on a few occasions but the cluster would fall over again 
after about 2 hours with cephx auth errors or OSDs trying to mark each other 
down (the mons seemed to not be rotating cephx auth keys). Setting 
'osd_heartbeat_interval = 30' helped a bit, but I eventually disabled process 
cephx auth with 'auth_cluster_required = none'. Setting that stopped the OSDs 
from falling over after 2 hours. From the beginning of this the MONs were 
running 100% on the ms_dispatch thread and constantly reelecting a leader every 
minute and not holding a consistent quorum with paxos lease_timeouts in the 
logs. The ms_dispatch was reading through the 
/var/lib/ceph/mon/mon-$hostname/store.db/*.sst constantly and strace showed 
this taking anywhere from 60 seconds to a couple minutes. This was almost all 
cpu user time and not much iowait. I think what was happening is that the mons 
failed health checks due to spending so much time constantly reading through 
the db and that held up other mon tasks which caused constant reelections.

We eventually reduced the MON reelections by finding the average ms_dispatch 
sst read time on the rank 0 mon took 65 seconds and setting 'mon_lease = 75' so 
that the paxos lease would last longer than ms_dispatch running 100%.  I also 
greatly increased the rocksdb_cache_size and leveldb_cache_size on the mons to 
be big enough to cache the entire db, but that didn't seem to make much 
difference initially. After working with Sage, he set the mon_osd_cache_size = 
20 (default 10). The huge mon_osd_cache_size let the mons cache all osd 
maps on the first read and the ms_dispatch thread was able to use this cache 
instead of spinning 100% on rereading them every minute. This stopped the 
constant elections because the mon stopped failing health checks and was able 
to complete other tasks. Lastly there were some self inflicted osd corruptions 
from the ipmi resets that needed to be dealt with to get all PGs active+clean, 
and the cephx change was rolled back to operate normally.

Sage, thanks again for your assistance with this.

Kevin

tl;dr Cache as much as you can.



On 09/24/2018 09:24 AM, Sage Weil wrote:

Hi Kevin,

Do you have an update on the state of the cluster?

I've opened a ticket http://tracker.ceph.com/issues/36163 to track the
likely root cause we identified, and have a PR open at
https://github.com/ceph/ceph/pull/24247

Thanks!
sage


On Thu, 20 Sep 2018, Sage Weil wrote:


On Thu, 20 Sep 2018, KEVIN MICHAEL HRPCEK wrote:


Top results when both were taken with ms_dispatch at 100%. The mon one
changes alot so I've included 3 snapshots of those. I'll update
mon_osd_cache_size.

After disabling auth_cluster_required and a cluster reboot I am having
less problems keeping OSDs in the cluster since they seem to not be
having auth problems around the 2 hour uptime mark. The mons still have
their problems but 859/861 OSDs are up with 2 crashing. I found a brief
mention on a forum or somewhere that the mons will only trim their
storedb when the cluster is healthy. If that's true do you think it is
likely that once all osds are healthy and unset some no* cluster flags
the mons will be able to trim their db and the result will be that
ms_dispatch no longer takes to long to churn through the db? Our primary
theory here is that ms_dispatch is taking too long and the mons reach a
timeout and then reelect in a nonstop cycle.



It's the PGs that need to all get healthy (active+clean) before the
osdmaps get trimmed.  Other health warnigns (e.g. about noout being set)
aren't related.



ceph-mon
34.24%34.24%  libpthread-2.17.so[.] pthread_rwlock_rdlock
+   34.00%34.00%  libceph-common.so.0   [.] crush_hash32_3



If this is the -g output you need to hit enter on lines like this to see
the call graph...  Or you can do 'perf record -g -p ' and then 'perf
report --stdio' (or similar) to dump it all to a file, fully expanded.

Thanks!
sage



+5.01% 5.01%  libceph-common.so.0   [.] ceph::decode >, 
std::less, mempool::pool_allocator<(mempool::pool_index_t)15, 
std::pair > >, 
std::_Select1st > > >, std::less > >, 
std::_Select1st > > >, std::less::copy
+0.79% 0.79%  libceph-c

Re: [ceph-users] Mimic upgrade failure

2018-09-20 Thread KEVIN MICHAEL HRPCEK
Select1st > > >,
std::less, mempoo
2.92%  libceph-common.so.0   [.] ceph::buffer::ptr::release
2.65%  libceph-common.so.0   [.]
ceph::buffer::list::iterator_impl::advance
2.57%  ceph-mon  [.] std::_Rb_tree > >, std::_Select1st > > >,
std::less, mempoo
2.27%  libceph-common.so.0   [.] ceph::buffer::ptr::ptr
1.99%  libstdc++.so.6.0.19   [.] std::_Rb_tree_increment
1.93%  libc-2.17.so  [.] __memcpy_ssse3_back
1.91%  libceph-common.so.0   [.] ceph::buffer::ptr::append
1.87%  libceph-common.so.0   [.] crush_hash32_3@plt
1.84%  libceph-common.so.0   [.]
ceph::buffer::list::iterator_impl::copy
1.75%  libtcmalloc.so.4.4.5  [.]
tcmalloc::CentralFreeList::FetchFromOneSpans
1.63%  libceph-common.so.0   [.] ceph::encode >,
std::less, mempool::pool_allocator<(mempool::pool_index_t)15,
std::pair > >
1.57%  libceph-common.so.0   [.] ceph::buffer::ptr::copy_out
1.55%  libstdc++.so.6.0.19   [.] std::_Rb_tree_insert_and_rebalance
1.47%  libceph-common.so.0   [.] ceph::buffer::ptr::raw_length
1.33%  libtcmalloc.so.4.4.5  [.] tc_deletearray_nothrow
1.09%  libceph-common.so.0   [.] ceph::decode >,
denc_traits >, void> >
1.07%  libtcmalloc.so.4.4.5  [.] operator new[]
1.02%  libceph-common.so.0   [.] ceph::buffer::list::iterator::copy
1.01%  libtcmalloc.so.4.4.5  [.] tc_posix_memalign
0.85%  ceph-mon  [.] ceph::buffer::ptr::release@plt
0.76%  libceph-common.so.0   [.] ceph::buffer::ptr::copy_out@plt
0.74%  libceph-common.so.0   [.] crc32_iscsi_00

strace
munmap(0x7f2eda736000, 2463941) = 0
open("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299339.sst", O_RDONLY) =
429
stat("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299339.sst",
{st_mode=S_IFREG|0644, st_size=1658656, ...}) = 0
mmap(NULL, 1658656, PROT_READ, MAP_SHARED, 429, 0) = 0x7f2eea87e000
close(429)  = 0
munmap(0x7f2ea8c97000, 2468005) = 0
open("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299338.sst", O_RDONLY) =
429
stat("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299338.sst",
{st_mode=S_IFREG|0644, st_size=2484001, ...}) = 0
mmap(NULL, 2484001, PROT_READ, MAP_SHARED, 429, 0) = 0x7f2eda74b000
close(429)  = 0
munmap(0x7f2ee21dc000, 2472343) = 0

Kevin


On 09/19/2018 06:50 AM, Sage Weil wrote:

On Wed, 19 Sep 2018, KEVIN MICHAEL HRPCEK wrote:


Sage,

Unfortunately the mon election problem came back yesterday and it makes
it really hard to get a cluster to stay healthy. A brief unexpected
network outage occurred and sent the cluster into a frenzy and when I
had it 95% healthy the mons started their nonstop reelections. In the
previous logs I sent were you able to identify why the mons are
constantly electing? The elections seem to be triggered by the below
paxos message but do you know which lease timeout is being reached or
why the lease isn't renewed instead of calling for an election?

One thing I tried was to shutdown the entire cluster and bring up only
the mon and mgr. The mons weren't able to hold their quorum with no osds
running and the ceph-mon ms_dispatch thread runs at 100% for > 60s at a
time.



This is odd... with no other dameons running I'm not sure what would be
eating up the CPU.  Can you run a 'perf top -p `pidof ceph-mon`' (or
similar) on the machine to see what the process is doing?  You might need
to install ceph-mon-dbg or ceph-debuginfo to get better symbols.



2018-09-19 03:56:21.729 7f4344ec1700 1 mon.sephmon2@1(peon).paxos(paxos
active c 133382665..133383355) lease_timeout -- calling new election



A workaround is probably to increase the lease timeout.  Try setting
mon_lease = 15 (default is 5... could also go higher than 15) in the
ceph.conf for all of the mons.  This is a bit of a band-aid but should
help you keep the mons in quorum until we sort out what is going on.

sage





Thanks
Kevin

On 09/10/2018 07:06 AM, Sage Weil wrote:

I took a look at the mon log you sent.  A few things I noticed:

- The frequent mon elections seem to get only 2/3 mons about half of the
time.
- The messages coming in a mostly osd_failure, and half of those seem to
be recoveries (cancellation of the failure message).

It does smell a bit like a networking issue, or some tunable that relates
to the messaging layer.  It might be worth looking at an OSD log for an
osd that reported a failure and seeing what error code it coming up on the
failed ping connection?  That might provide a useful hint (e.g.,
ECONNREFUSED vs EMFILE or something).

I'd also confirm that with nodown set the mon quorum stabilizes...

sage




On Mon, 10 Sep 2018, Kevin Hrpcek wrote:



Update for the list archive.

I went ahead and finished the mimic upgrade with the osds in a fluctuating
state of up and down. The cluster did start to normalize a lot easier
after
everything was on mimic since the random m

Re: [ceph-users] Mimic upgrade failure

2018-09-20 Thread KEVIN MICHAEL HRPCEK
0.19   [.] std::_Rb_tree_increment
   1.93%  libc-2.17.so<http://libc-2.17.so>  [.] __memcpy_ssse3_back
   1.91%  libceph-common.so.0   [.] ceph::buffer::ptr::append
   1.87%  libceph-common.so.0   [.] crush_hash32_3@plt
   1.84%  libceph-common.so.0   [.] 
ceph::buffer::list::iterator_impl::copy
   1.75%  libtcmalloc.so.4.4.5  [.] tcmalloc::CentralFreeList::FetchFromOneSpans
   1.63%  libceph-common.so.0   [.] ceph::encode >, std::less, 
mempool::pool_allocator<(mempool::pool_index_t)15, std::pair > >
   1.57%  libceph-common.so.0   [.] ceph::buffer::ptr::copy_out
   1.55%  libstdc++.so.6.0.19   [.] std::_Rb_tree_insert_and_rebalance
   1.47%  libceph-common.so.0   [.] ceph::buffer::ptr::raw_length
   1.33%  libtcmalloc.so.4.4.5  [.] tc_deletearray_nothrow
   1.09%  libceph-common.so.0   [.] ceph::decode >, 
denc_traits >, void> >
   1.07%  libtcmalloc.so.4.4.5  [.] operator new[]
   1.02%  libceph-common.so.0   [.] ceph::buffer::list::iterator::copy
   1.01%  libtcmalloc.so.4.4.5  [.] tc_posix_memalign
   0.85%  ceph-mon  [.] ceph::buffer::ptr::release@plt
   0.76%  libceph-common.so.0   [.] ceph::buffer::ptr::copy_out@plt
   0.74%  libceph-common.so.0   [.] crc32_iscsi_00

strace
munmap(0x7f2eda736000, 2463941) = 0
open("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299339.sst", O_RDONLY) = 429
stat("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299339.sst", 
{st_mode=S_IFREG|0644, st_size=1658656, ...}) = 0
mmap(NULL, 1658656, PROT_READ, MAP_SHARED, 429, 0) = 0x7f2eea87e000
close(429)  = 0
munmap(0x7f2ea8c97000, 2468005) = 0
open("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299338.sst", O_RDONLY) = 429
stat("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299338.sst", 
{st_mode=S_IFREG|0644, st_size=2484001, ...}) = 0
mmap(NULL, 2484001, PROT_READ, MAP_SHARED, 429, 0) = 0x7f2eda74b000
close(429)  = 0
munmap(0x7f2ee21dc000, 2472343) = 0

Kevin


On 09/19/2018 06:50 AM, Sage Weil wrote:

On Wed, 19 Sep 2018, KEVIN MICHAEL HRPCEK wrote:


Sage,

Unfortunately the mon election problem came back yesterday and it makes
it really hard to get a cluster to stay healthy. A brief unexpected
network outage occurred and sent the cluster into a frenzy and when I
had it 95% healthy the mons started their nonstop reelections. In the
previous logs I sent were you able to identify why the mons are
constantly electing? The elections seem to be triggered by the below
paxos message but do you know which lease timeout is being reached or
why the lease isn't renewed instead of calling for an election?

One thing I tried was to shutdown the entire cluster and bring up only
the mon and mgr. The mons weren't able to hold their quorum with no osds
running and the ceph-mon ms_dispatch thread runs at 100% for > 60s at a
time.



This is odd... with no other dameons running I'm not sure what would be
eating up the CPU.  Can you run a 'perf top -p `pidof ceph-mon`' (or
similar) on the machine to see what the process is doing?  You might need
to install ceph-mon-dbg or ceph-debuginfo to get better symbols.



2018-09-19 03:56:21.729 7f4344ec1700 1 mon.sephmon2@1(peon).paxos(paxos
active c 133382665..133383355) lease_timeout -- calling new election



A workaround is probably to increase the lease timeout.  Try setting
mon_lease = 15 (default is 5... could also go higher than 15) in the
ceph.conf for all of the mons.  This is a bit of a band-aid but should
help you keep the mons in quorum until we sort out what is going on.

sage





Thanks
Kevin

On 09/10/2018 07:06 AM, Sage Weil wrote:

I took a look at the mon log you sent.  A few things I noticed:

- The frequent mon elections seem to get only 2/3 mons about half of the
time.
- The messages coming in a mostly osd_failure, and half of those seem to
be recoveries (cancellation of the failure message).

It does smell a bit like a networking issue, or some tunable that relates
to the messaging layer.  It might be worth looking at an OSD log for an
osd that reported a failure and seeing what error code it coming up on the
failed ping connection?  That might provide a useful hint (e.g.,
ECONNREFUSED vs EMFILE or something).

I'd also confirm that with nodown set the mon quorum stabilizes...

sage




On Mon, 10 Sep 2018, Kevin Hrpcek wrote:



Update for the list archive.

I went ahead and finished the mimic upgrade with the osds in a fluctuating
state of up and down. The cluster did start to normalize a lot easier after
everything was on mimic since the random mass OSD heartbeat failures stopped
and the constant mon election problem went away. I'm still battling with the
cluster reacting poorly to host reboots or small map changes, but I feel like
my current pg:osd ratio may be playing a factor in that since we are 2x normal
pg count while migrating data to new EC pools.

I'm not sure of the root cause but it seems like the mix of l

Re: [ceph-users] Mimic upgrade failure

2018-09-19 Thread KEVIN MICHAEL HRPCEK
t;, void> >
   1.07%  libtcmalloc.so.4.4.5  [.] operator new[]
   1.02%  libceph-common.so.0   [.] ceph::buffer::list::iterator::copy
   1.01%  libtcmalloc.so.4.4.5  [.] tc_posix_memalign
   0.85%  ceph-mon  [.] ceph::buffer::ptr::release@plt
   0.76%  libceph-common.so.0   [.] ceph::buffer::ptr::copy_out@plt
   0.74%  libceph-common.so.0   [.] crc32_iscsi_00

strace
munmap(0x7f2eda736000, 2463941) = 0
open("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299339.sst", O_RDONLY) = 429
stat("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299339.sst", 
{st_mode=S_IFREG|0644, st_size=1658656, ...}) = 0
mmap(NULL, 1658656, PROT_READ, MAP_SHARED, 429, 0) = 0x7f2eea87e000
close(429)  = 0
munmap(0x7f2ea8c97000, 2468005) = 0
open("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299338.sst", O_RDONLY) = 429
stat("/var/lib/ceph/mon/ceph-sephmon1/store.db/26299338.sst", 
{st_mode=S_IFREG|0644, st_size=2484001, ...}) = 0
mmap(NULL, 2484001, PROT_READ, MAP_SHARED, 429, 0) = 0x7f2eda74b000
close(429)  = 0
munmap(0x7f2ee21dc000, 2472343) = 0

Kevin


On 09/19/2018 06:50 AM, Sage Weil wrote:

On Wed, 19 Sep 2018, KEVIN MICHAEL HRPCEK wrote:


Sage,

Unfortunately the mon election problem came back yesterday and it makes
it really hard to get a cluster to stay healthy. A brief unexpected
network outage occurred and sent the cluster into a frenzy and when I
had it 95% healthy the mons started their nonstop reelections. In the
previous logs I sent were you able to identify why the mons are
constantly electing? The elections seem to be triggered by the below
paxos message but do you know which lease timeout is being reached or
why the lease isn't renewed instead of calling for an election?

One thing I tried was to shutdown the entire cluster and bring up only
the mon and mgr. The mons weren't able to hold their quorum with no osds
running and the ceph-mon ms_dispatch thread runs at 100% for > 60s at a
time.



This is odd... with no other dameons running I'm not sure what would be
eating up the CPU.  Can you run a 'perf top -p `pidof ceph-mon`' (or
similar) on the machine to see what the process is doing?  You might need
to install ceph-mon-dbg or ceph-debuginfo to get better symbols.



2018-09-19 03:56:21.729 7f4344ec1700 1 mon.sephmon2@1(peon).paxos(paxos
active c 133382665..133383355) lease_timeout -- calling new election



A workaround is probably to increase the lease timeout.  Try setting
mon_lease = 15 (default is 5... could also go higher than 15) in the
ceph.conf for all of the mons.  This is a bit of a band-aid but should
help you keep the mons in quorum until we sort out what is going on.

sage





Thanks
Kevin

On 09/10/2018 07:06 AM, Sage Weil wrote:

I took a look at the mon log you sent.  A few things I noticed:

- The frequent mon elections seem to get only 2/3 mons about half of the
time.
- The messages coming in a mostly osd_failure, and half of those seem to
be recoveries (cancellation of the failure message).

It does smell a bit like a networking issue, or some tunable that relates
to the messaging layer.  It might be worth looking at an OSD log for an
osd that reported a failure and seeing what error code it coming up on the
failed ping connection?  That might provide a useful hint (e.g.,
ECONNREFUSED vs EMFILE or something).

I'd also confirm that with nodown set the mon quorum stabilizes...

sage




On Mon, 10 Sep 2018, Kevin Hrpcek wrote:



Update for the list archive.

I went ahead and finished the mimic upgrade with the osds in a fluctuating
state of up and down. The cluster did start to normalize a lot easier after
everything was on mimic since the random mass OSD heartbeat failures stopped
and the constant mon election problem went away. I'm still battling with the
cluster reacting poorly to host reboots or small map changes, but I feel like
my current pg:osd ratio may be playing a factor in that since we are 2x normal
pg count while migrating data to new EC pools.

I'm not sure of the root cause but it seems like the mix of luminous and mimic
did not play well together for some reason. Maybe it has to do with the scale
of my cluster, 871 osd, or maybe I've missed some some tuning as my cluster
has scaled to this size.

Kevin


On 09/09/2018 12:49 PM, Kevin Hrpcek wrote:


Nothing too crazy for non default settings. Some of those osd settings were
in place while I was testing recovery speeds and need to be brought back
closer to defaults. I was setting nodown before but it seems to mask the
problem. While its good to stop the osdmap changes, OSDs would come up, get
marked up, but at some point go down again (but the process is still
running) and still stay up in the map. Then when I'd unset nodown the
cluster would immediately mark 250+ osd down again and i'd be back where I
started.

This morning I went ahead and finished the osd upgrades to mimic to remove
that variabl

Re: [ceph-users] Mimic upgrade failure

2018-09-18 Thread KEVIN MICHAEL HRPCEK
Sage,

Unfortunately the mon election problem came back yesterday and it makes it 
really hard to get a cluster to stay healthy. A brief unexpected network outage 
occurred and sent the cluster into a frenzy and when I had it 95% healthy the 
mons started their nonstop reelections. In the previous logs I sent were you 
able to identify why the mons are constantly electing? The elections seem to be 
triggered by the below paxos message but do you know which lease timeout is 
being reached or why the lease isn't renewed instead of calling for an election?

One thing I tried was to shutdown the entire cluster and bring up only the mon 
and mgr. The mons weren't able to hold their quorum with no osds running and 
the ceph-mon ms_dispatch thread runs at 100% for > 60s at a time.

2018-09-19 03:56:21.729 7f4344ec1700  1 mon.sephmon2@1(peon).paxos(paxos active 
c 133382665..133383355) lease_timeout -- calling new election

Thanks
Kevin

On 09/10/2018 07:06 AM, Sage Weil wrote:

I took a look at the mon log you sent.  A few things I noticed:

- The frequent mon elections seem to get only 2/3 mons about half of the
time.
- The messages coming in a mostly osd_failure, and half of those seem to
be recoveries (cancellation of the failure message).

It does smell a bit like a networking issue, or some tunable that relates
to the messaging layer.  It might be worth looking at an OSD log for an
osd that reported a failure and seeing what error code it coming up on the
failed ping connection?  That might provide a useful hint (e.g.,
ECONNREFUSED vs EMFILE or something).

I'd also confirm that with nodown set the mon quorum stabilizes...

sage




On Mon, 10 Sep 2018, Kevin Hrpcek wrote:



Update for the list archive.

I went ahead and finished the mimic upgrade with the osds in a fluctuating
state of up and down. The cluster did start to normalize a lot easier after
everything was on mimic since the random mass OSD heartbeat failures stopped
and the constant mon election problem went away. I'm still battling with the
cluster reacting poorly to host reboots or small map changes, but I feel like
my current pg:osd ratio may be playing a factor in that since we are 2x normal
pg count while migrating data to new EC pools.

I'm not sure of the root cause but it seems like the mix of luminous and mimic
did not play well together for some reason. Maybe it has to do with the scale
of my cluster, 871 osd, or maybe I've missed some some tuning as my cluster
has scaled to this size.

Kevin


On 09/09/2018 12:49 PM, Kevin Hrpcek wrote:


Nothing too crazy for non default settings. Some of those osd settings were
in place while I was testing recovery speeds and need to be brought back
closer to defaults. I was setting nodown before but it seems to mask the
problem. While its good to stop the osdmap changes, OSDs would come up, get
marked up, but at some point go down again (but the process is still
running) and still stay up in the map. Then when I'd unset nodown the
cluster would immediately mark 250+ osd down again and i'd be back where I
started.

This morning I went ahead and finished the osd upgrades to mimic to remove
that variable. I've looked for networking problems but haven't found any. 2
of the mons are on the same switch. I've also tried combinations of shutting
down a mon to see if a single one was the problem, but they keep electing no
matter the mix of them that are up. Part of it feels like a networking
problem but I haven't been able to find a culprit yet as everything was
working normally before starting the upgrade. Other than the constant mon
elections, yesterday I had the cluster 95% healthy 3 or 4 times, but it
doesn't last long since at some point the OSDs start trying to fail each
other through their heartbeats.
2018-09-09 17:37:29.079 7eff774f5700  1 mon.sephmon1@0(leader).osd e991282
prepare_failure osd.39 10.1.9.2:6802/168438 from osd.49 10.1.9.3:6884/317908
is reporting failure:1
2018-09-09 17:37:29.079 7eff774f5700  0 log_channel(cluster) log [DBG] :
osd.39 10.1.9.2:6802/168438 reported failed by osd.49 10.1.9.3:6884/317908
2018-09-09 17:37:29.083 7eff774f5700  1 mon.sephmon1@0(leader).osd e991282
prepare_failure osd.93 10.1.9.9:6853/287469 from osd.372
10.1.9.13:6801/275806 is reporting failure:1

I'm working on getting things mostly good again with everything on mimic and
will see if it behaves better.

Thanks for your input on this David.


[global]
mon_initial_members = sephmon1, sephmon2, sephmon3
mon_host = 10.1.9.201,10.1.9.202,10.1.9.203
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = 10.1.0.0/16
osd backfill full ratio = 0.92
osd failsafe nearfull ratio = 0.90
osd max object size = 21474836480
mon max pg per osd = 350

[mon]
mon warn on legacy crush tunables = false
mon pg warn max per osd = 300
mon osd down out subtree limit = host
mon osd nearfull ratio = 0.90
mon osd full ratio = 0.97
mon health