[ceph-users] A question about HEALTH_WARN and monitors holding onto cluster maps

2018-05-17 Thread Thomas Byrne - UKRI STFC
Hi all, As far as I understand, the monitor stores will grow while not HEALTH_OK as they hold onto all cluster maps. Is this true for all HEALTH_WARN reasons? Our cluster recently went into HEALTH_WARN due to a few weeks of backfilling onto new hardware pushing the monitors data stores over

Re: [ceph-users] A question about HEALTH_WARN and monitors holding onto cluster maps

2018-05-17 Thread Thomas Byrne - UKRI STFC
question about HEALTH_WARN and monitors > holding onto cluster maps > > > > On 05/17/2018 04:37 PM, Thomas Byrne - UKRI STFC wrote: > > Hi all, > > > > > > > > As far as I understand, the monitor stores will grow while not > > HEALTH_OK as they

Re: [ceph-users] A question about HEALTH_WARN and monitors holding onto cluster maps

2018-05-21 Thread Thomas Byrne - UKRI STFC
it with it's full weight and having everything move at once. On Thu, May 17, 2018 at 12:56 PM Thomas Byrne - UKRI STFC <tom.by...@stfc.ac.uk<mailto:tom.by...@stfc.ac.uk>> wrote: That seems like a sane way to do it, thanks for the clarification Wido. As a follow-up, do you

Re: [ceph-users] Balancing cluster with large disks - 10TB HHD

2019-01-02 Thread Thomas Byrne - UKRI STFC
Assuming I understand it correctly: "pg_upmap_items 6.0 [40,20]" refers to replacing (upmapping?) osd.40 with osd.20 in the acting set of the placement group '6.0'. Assuming it's a 3 replica PG, the other two OSDs in the set remain unchanged from the CRUSH calculation. "pg_upmap_items 6.6

Re: [ceph-users] ceph health JSON format has changed sync?

2019-01-02 Thread Thomas Byrne - UKRI STFC
I recently spent some time looking at this, I believe the 'summary' and 'overall_status' sections are now deprecated. The 'status' and 'checks' fields are the ones to use now. The 'status' field gives you the OK/WARN/ERR, but returning the most severe error condition from the 'checks' section

Re: [ceph-users] ceph health JSON format has changed

2019-01-02 Thread Thomas Byrne - UKRI STFC
> In previous versions of Ceph, I was able to determine which PGs had > scrub errors, and then a cron.hourly script ran "ceph pg repair" for them, > provided that they were not already being scrubbed. In Luminous, the bad > PG is not visible in "ceph --status" anywhere. Should I use

Re: [ceph-users] Is it possible to increase Ceph Mon store?

2019-01-08 Thread Thomas Byrne - UKRI STFC
For what it's worth, I think the behaviour Pardhiv and Bryan are describing is not quite normal, and sounds similar to something we see on our large luminous cluster with elderly (created as jewel?) monitors. After large operations which result in the mon stores growing to 20GB+, leaving the

Re: [ceph-users] OSDs taking a long time to boot due to 'clear_temp_objects', even with fresh PGs

2019-06-25 Thread Thomas Byrne - UKRI STFC
June 2019 17:30 To: Byrne, Thomas (STFC,RAL,SC) Cc: ceph-users Subject: Re: [ceph-users] OSDs taking a long time to boot due to 'clear_temp_objects', even with fresh PGs On Mon, Jun 24, 2019 at 9:06 AM Thomas Byrne - UKRI STFC wrote: > > Hi all, > > > > Some bluestore OSDs in

[ceph-users] OSDs taking a long time to boot due to 'clear_temp_objects', even with fresh PGs

2019-06-24 Thread Thomas Byrne - UKRI STFC
Hi all, Some bluestore OSDs in our Luminous test cluster have started becoming unresponsive and booting very slowly. These OSDs have been used for stress testing for hardware destined for our production cluster, so have had a number of pools on them with many, many objects in the past.

Re: [ceph-users] Scrub start-time and end-time

2019-08-14 Thread Thomas Byrne - UKRI STFC
Hi Torben, > Is it allowed to have the scrub period cross midnight ? eg have start time at > 22:00 and end time 07:00 next morning. Yes, I think that's what the way it is mostly used, primarily to reduce the scrub impact during waking/working hours. > I assume that if you only configure the

[ceph-users] Help understanding EC object reads

2019-08-29 Thread Thomas Byrne - UKRI STFC
Hi all, I'm investigating an issue with our (non-Ceph) caching layers of our large EC cluster. It seems to be turning users requests for whole objects into lots of small byte range requests reaching the OSDs, but I'm not sure how inefficient this behaviour is in reality. My limited

Re: [ceph-users] Help understanding EC object reads

2019-09-16 Thread Thomas Byrne - UKRI STFC
Sent: 09 September 2019 23:25 > To: Byrne, Thomas (STFC,RAL,SC) > Cc: ceph-users > Subject: Re: [ceph-users] Help understanding EC object reads > > On Thu, Aug 29, 2019 at 4:57 AM Thomas Byrne - UKRI STFC > wrote: > > > > Hi all, > > > > I’m investigating an is

Re: [ceph-users] How to add 100 new OSDs...

2019-07-25 Thread Thomas Byrne - UKRI STFC
As a counterpoint, adding large amounts of new hardware in gradually (or more specifically in a few steps) has a few benefits IMO. - Being able to pause the operation and confirm the new hardware (and cluster) is operating as expected. You can identify problems with hardware with OSDs at 10%