Re: [ceph-users] Sizing your MON storage with a large cluster

2018-02-28 Thread Dan van der Ster
Hi Wido,

Are your mon's using rocksdb or still leveldb?

Are your mon stores trimming back to a small size after HEALTH_OK was restored?

One v12.2.2 cluster here just started showing the "is using a lot of
disk space" warning on one of our mons. In fact all three mons are now
using >16GB. I tried compacting and resyncing an empty mon but those
don't trim anything -- there really is 16GB of data mon store for this
healthy cluster.

(The mon's on this cluster were using ~560MB before updating to
luminous back in December.)

Any thoughts?

Cheers, Dan


On Sat, Feb 3, 2018 at 4:50 PM, Wido den Hollander  wrote:
> Hi,
>
> I just wanted to inform people about the fact that Monitor databases can
> grow quite big when you have a large cluster which is performing a very long
> rebalance.
>
> I'm posting this on ceph-users and ceph-large as it applies to both, but
> you'll see this sooner on a cluster with a lof of OSDs.
>
> Some information:
>
> - Version: Luminous 12.2.2
> - Number of OSDs: 2175
> - Data used: ~2PB
>
> We are in the middle of migrating from FileStore to BlueStore and this is
> causing a lot of PGs to backfill at the moment:
>
>  33488 active+clean
>  4802  active+undersized+degraded+remapped+backfill_wait
>  1670  active+remapped+backfill_wait
>  263   active+undersized+degraded+remapped+backfilling
>  250   active+recovery_wait+degraded
>  54active+recovery_wait+degraded+remapped
>  27active+remapped+backfilling
>  13active+recovery_wait+undersized+degraded+remapped
>  2 active+recovering+degraded
>
> This has been running for a few days now and it has caused this warning:
>
> MON_DISK_BIG mons
> srv-zmb03-05,srv-zmb04-05,srv-zmb05-05,srv-zmb06-05,srv-zmb07-05 are using a
> lot of disk space
> mon.srv-zmb03-05 is 31666 MB >= mon_data_size_warn (15360 MB)
> mon.srv-zmb04-05 is 31670 MB >= mon_data_size_warn (15360 MB)
> mon.srv-zmb05-05 is 31670 MB >= mon_data_size_warn (15360 MB)
> mon.srv-zmb06-05 is 31897 MB >= mon_data_size_warn (15360 MB)
> mon.srv-zmb07-05 is 31891 MB >= mon_data_size_warn (15360 MB)
>
> This is to be expected as MONs do not trim their store if one or more PGs is
> not active+clean.
>
> In this case we expected this and the MONs are each running on a 1TB Intel
> DC-series SSD to make sure we do not run out of space before the backfill
> finishes.
>
> The cluster is spread out over racks and in CRUSH we replicate over racks.
> Rack by rack we are wiping/destroying the OSDs and bringing them back as
> BlueStore OSDs and letting the backfill handle everything.
>
> In between we wait for the cluster to become HEALTH_OK (all PGs
> active+clean) so that the Monitors can trim their database before we start
> with the next rack.
>
> I just want to warn and inform people about this. Under normal circumstances
> a MON database isn't that big, but if you have a very long period of
> backfills/recoveries and also have a large number of OSDs you'll see the DB
> grow quite big.
>
> This has improved significantly going to Jewel and Luminous, but it is still
> something to watch out for.
>
> Make sure your MONs have enough free space to handle this!
>
> Wido
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sizing your MON storage with a large cluster

2018-02-09 Thread Anthony D'Atri
Thanks, Wido -- words to live by.

I had all kinds of problems with mon DBs not compacting under Firefly, really 
pointed out the benefit of having ample space on the mons -- and the necessity 
of having those DB's live on something faster than an LFF HDD.

I've had this happen when using ceph-gentle-reweight to slowly bring in a large 
population of new OSDs.  Breaking that into phases helps a bunch, or setting a 
large -i interval.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sizing your MON storage with a large cluster

2018-02-09 Thread Matthew Vernon
On 05/02/18 15:54, Wes Dillingham wrote:
> Good data point on not trimming when non active+clean PGs are present.
> So am I reading this correct? It grew to 32GB? Did it end up growing
> beyond that, what was the max?

The largest Mon store size I've seen (in a 3000-OSD cluster) was about 66GB.

Regards,

Matthew


-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sizing your MON storage with a large cluster

2018-02-05 Thread Wido den Hollander



On 02/05/2018 04:54 PM, Wes Dillingham wrote:
Good data point on not trimming when non active+clean PGs are present. 
So am I reading this correct? It grew to 32GB? Did it end up growing 
beyond that, what was the max?Also is only ~18PGs per OSD a reasonable
amount of PGs per OSD? I would think about quadruple that would be 
ideal. Is this an artifact of a steadily growing cluster or a design choice?




The backfills are still busy and the MONs are at 39GB right now. Still 
have plenty of space left.


Regarding the PGs it's a long story, but two sided.

1. This is an archive running on Atom 8-core CPUs to keep power 
consumption low, so we went low on amount of PGs
2. The system is still growing and after adding OSDs recently we didn't 
increase the amount of PGs yet


On Sat, Feb 3, 2018 at 10:50 AM, Wido den Hollander > wrote:


Hi,

I just wanted to inform people about the fact that Monitor databases
can grow quite big when you have a large cluster which is performing
a very long rebalance.

I'm posting this on ceph-users and ceph-large as it applies to both,
but you'll see this sooner on a cluster with a lof of OSDs.

Some information:

- Version: Luminous 12.2.2
- Number of OSDs: 2175
- Data used: ~2PB

We are in the middle of migrating from FileStore to BlueStore and
this is causing a lot of PGs to backfill at the moment:

              33488 active+clean
              4802  active+undersized+degraded+remapped+backfill_wait
              1670  active+remapped+backfill_wait
              263   active+undersized+degraded+remapped+backfilling
              250   active+recovery_wait+degraded
              54    active+recovery_wait+degraded+remapped
              27    active+remapped+backfilling
              13    active+recovery_wait+undersized+degraded+remapped
              2     active+recovering+degraded

This has been running for a few days now and it has caused this warning:

MON_DISK_BIG mons
srv-zmb03-05,srv-zmb04-05,srv-zmb05-05,srv-zmb06-05,srv-zmb07-05 are
using a lot of disk space
     mon.srv-zmb03-05 is 31666 MB >= mon_data_size_warn (15360 MB)
     mon.srv-zmb04-05 is 31670 MB >= mon_data_size_warn (15360 MB)
     mon.srv-zmb05-05 is 31670 MB >= mon_data_size_warn (15360 MB)
     mon.srv-zmb06-05 is 31897 MB >= mon_data_size_warn (15360 MB)
     mon.srv-zmb07-05 is 31891 MB >= mon_data_size_warn (15360 MB)

This is to be expected as MONs do not trim their store if one or
more PGs is not active+clean.

In this case we expected this and the MONs are each running on a 1TB
Intel DC-series SSD to make sure we do not run out of space before
the backfill finishes.

The cluster is spread out over racks and in CRUSH we replicate over
racks. Rack by rack we are wiping/destroying the OSDs and bringing
them back as BlueStore OSDs and letting the backfill handle everything.

In between we wait for the cluster to become HEALTH_OK (all PGs
active+clean) so that the Monitors can trim their database before we
start with the next rack.

I just want to warn and inform people about this. Under normal
circumstances a MON database isn't that big, but if you have a very
long period of backfills/recoveries and also have a large number of
OSDs you'll see the DB grow quite big.

This has improved significantly going to Jewel and Luminous, but it
is still something to watch out for.

Make sure your MONs have enough free space to handle this!

Wido



___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--
Respectfully,

Wes Dillingham
wes_dilling...@harvard.edu 
Research Computing | Senior CyberInfrastructure Storage Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 204

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sizing your MON storage with a large cluster

2018-02-05 Thread Wes Dillingham
Good data point on not trimming when non active+clean PGs are present. So
am I reading this correct? It grew to 32GB? Did it end up growing beyond
that, what was the max? Also is only ~18PGs per OSD a reasonable amount of
PGs per OSD? I would think about quadruple that would be ideal. Is this an
artifact of a steadily growing cluster or a design choice?

On Sat, Feb 3, 2018 at 10:50 AM, Wido den Hollander  wrote:

> Hi,
>
> I just wanted to inform people about the fact that Monitor databases can
> grow quite big when you have a large cluster which is performing a very
> long rebalance.
>
> I'm posting this on ceph-users and ceph-large as it applies to both, but
> you'll see this sooner on a cluster with a lof of OSDs.
>
> Some information:
>
> - Version: Luminous 12.2.2
> - Number of OSDs: 2175
> - Data used: ~2PB
>
> We are in the middle of migrating from FileStore to BlueStore and this is
> causing a lot of PGs to backfill at the moment:
>
>  33488 active+clean
>  4802  active+undersized+degraded+remapped+backfill_wait
>  1670  active+remapped+backfill_wait
>  263   active+undersized+degraded+remapped+backfilling
>  250   active+recovery_wait+degraded
>  54active+recovery_wait+degraded+remapped
>  27active+remapped+backfilling
>  13active+recovery_wait+undersized+degraded+remapped
>  2 active+recovering+degraded
>
> This has been running for a few days now and it has caused this warning:
>
> MON_DISK_BIG mons srv-zmb03-05,srv-zmb04-05,srv-
> zmb05-05,srv-zmb06-05,srv-zmb07-05 are using a lot of disk space
> mon.srv-zmb03-05 is 31666 MB >= mon_data_size_warn (15360 MB)
> mon.srv-zmb04-05 is 31670 MB >= mon_data_size_warn (15360 MB)
> mon.srv-zmb05-05 is 31670 MB >= mon_data_size_warn (15360 MB)
> mon.srv-zmb06-05 is 31897 MB >= mon_data_size_warn (15360 MB)
> mon.srv-zmb07-05 is 31891 MB >= mon_data_size_warn (15360 MB)
>
> This is to be expected as MONs do not trim their store if one or more PGs
> is not active+clean.
>
> In this case we expected this and the MONs are each running on a 1TB Intel
> DC-series SSD to make sure we do not run out of space before the backfill
> finishes.
>
> The cluster is spread out over racks and in CRUSH we replicate over racks.
> Rack by rack we are wiping/destroying the OSDs and bringing them back as
> BlueStore OSDs and letting the backfill handle everything.
>
> In between we wait for the cluster to become HEALTH_OK (all PGs
> active+clean) so that the Monitors can trim their database before we start
> with the next rack.
>
> I just want to warn and inform people about this. Under normal
> circumstances a MON database isn't that big, but if you have a very long
> period of backfills/recoveries and also have a large number of OSDs you'll
> see the DB grow quite big.
>
> This has improved significantly going to Jewel and Luminous, but it is
> still something to watch out for.
>
> Make sure your MONs have enough free space to handle this!
>
> Wido
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Respectfully,

Wes Dillingham
wes_dilling...@harvard.edu
Research Computing | Senior CyberInfrastructure Storage Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 204
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sizing your MON storage with a large cluster

2018-02-03 Thread Sage Weil
On Sat, 3 Feb 2018, Wido den Hollander wrote:
> Hi,
> 
> I just wanted to inform people about the fact that Monitor databases can grow
> quite big when you have a large cluster which is performing a very long
> rebalance.
> 
> I'm posting this on ceph-users and ceph-large as it applies to both, but
> you'll see this sooner on a cluster with a lof of OSDs.
> 
> Some information:
> 
> - Version: Luminous 12.2.2
> - Number of OSDs: 2175
> - Data used: ~2PB
> 
> We are in the middle of migrating from FileStore to BlueStore and this is
> causing a lot of PGs to backfill at the moment:
> 
>  33488 active+clean
>  4802  active+undersized+degraded+remapped+backfill_wait
>  1670  active+remapped+backfill_wait
>  263   active+undersized+degraded+remapped+backfilling
>  250   active+recovery_wait+degraded
>  54active+recovery_wait+degraded+remapped
>  27active+remapped+backfilling
>  13active+recovery_wait+undersized+degraded+remapped
>  2 active+recovering+degraded
> 
> This has been running for a few days now and it has caused this warning:
> 
> MON_DISK_BIG mons
> srv-zmb03-05,srv-zmb04-05,srv-zmb05-05,srv-zmb06-05,srv-zmb07-05 are using a
> lot of disk space
> mon.srv-zmb03-05 is 31666 MB >= mon_data_size_warn (15360 MB)
> mon.srv-zmb04-05 is 31670 MB >= mon_data_size_warn (15360 MB)
> mon.srv-zmb05-05 is 31670 MB >= mon_data_size_warn (15360 MB)
> mon.srv-zmb06-05 is 31897 MB >= mon_data_size_warn (15360 MB)
> mon.srv-zmb07-05 is 31891 MB >= mon_data_size_warn (15360 MB)
> 
> This is to be expected as MONs do not trim their store if one or more PGs is
> not active+clean.
> 
> In this case we expected this and the MONs are each running on a 1TB Intel
> DC-series SSD to make sure we do not run out of space before the backfill
> finishes.
> 
> The cluster is spread out over racks and in CRUSH we replicate over racks.
> Rack by rack we are wiping/destroying the OSDs and bringing them back as
> BlueStore OSDs and letting the backfill handle everything.
> 
> In between we wait for the cluster to become HEALTH_OK (all PGs active+clean)
> so that the Monitors can trim their database before we start with the next
> rack.
> 
> I just want to warn and inform people about this. Under normal circumstances a
> MON database isn't that big, but if you have a very long period of
> backfills/recoveries and also have a large number of OSDs you'll see the DB
> grow quite big.
> 
> This has improved significantly going to Jewel and Luminous, but it is still
> something to watch out for.
> 
> Make sure your MONs have enough free space to handle this!

Yes!

Just a side note that Joao has an elegant fix for this that allows the mon 
to trim most of the space-consuming full osdmaps.  It's still work in 
progress but is likely to get backported to luminous.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com