Re: [ceph-users] Recovery from 12.2.5 (corruption) -> 12.2.6 (hair on fire) -> 13.2.0 (some objects inaccessible and CephFS damaged)

2018-07-18 Thread Troy Ablan
On 07/18/2018 06:37 PM, Brad Hubbard wrote: > On Thu, Jul 19, 2018 at 2:48 AM, Troy Ablan wrote: >> >> >> On 07/17/2018 11:14 PM, Brad Hubbard wrote: >>> >>> On Wed, Jul 18, 2018 at 2:57 AM, Troy Ablan wrote: >>>> >>>>

[ceph-users] Recovery from 12.2.5 (corruption) -> 12.2.6 (hair on fire) -> 13.2.0 (some objects inaccessible and CephFS damaged)

2018-07-17 Thread Troy Ablan
I was on 12.2.5 for a couple weeks and started randomly seeing corruption, moved to 12.2.6 via yum update on Sunday, and all hell broke loose. I panicked and moved to Mimic, and when that didn't solve the problem, only then did I start to root around in mailing lists archives. It appears I can't

Re: [ceph-users] Recovery from 12.2.5 (corruption) -> 12.2.6 (hair on fire) -> 13.2.0 (some objects inaccessible and CephFS damaged)

2018-07-18 Thread Troy Ablan
On 07/17/2018 11:14 PM, Brad Hubbard wrote: On Wed, Jul 18, 2018 at 2:57 AM, Troy Ablan wrote: I was on 12.2.5 for a couple weeks and started randomly seeing corruption, moved to 12.2.6 via yum update on Sunday, and all hell broke loose. I panicked and moved to Mimic, and when that didn't

Re: [ceph-users] RAID question for Ceph

2018-07-18 Thread Troy Ablan
On 07/18/2018 07:44 PM, Satish Patel wrote: > If i have 8 OSD drives in server on P410i RAID controller (HP), If i > want to make this server has OSD node in that case show should i > configure RAID? > > 1. Put all drives in RAID-0? > 2. Put individual HDD in RAID-0 and create 8 individual

Re: [ceph-users] Recovery from 12.2.5 (corruption) -> 12.2.6 (hair on fire) -> 13.2.0 (some objects inaccessible and CephFS damaged)

2018-07-19 Thread Troy Ablan
>> >> I'm on IRC (as MooingLemur) if more real-time communication would help :) > > Sure, I'll try to contact you there. In the meantime could you open up > a tracker showing the crash stack trace above and a brief description > of the current situation and the events leading up to it? Could

[ceph-users] Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2019-08-13 Thread Troy Ablan
I've opened a tracker issue at https://tracker.ceph.com/issues/41240 Background: Cluster of 13 hosts, 5 of which contain 14 SSD OSDs between them. 409 HDDs in as well. The SSDs contain the RGW index and log pools, and some smaller pools The HDDs ccontain all other pools, including the RGW

Re: [ceph-users] Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2019-08-18 Thread Troy Ablan
On 8/18/19 6:43 PM, Brad Hubbard wrote: That's this code. 3114 switch (alg) { 3115 case CRUSH_BUCKET_UNIFORM: 3116 size = sizeof(crush_bucket_uniform); 3117 break; 3118 case CRUSH_BUCKET_LIST: 3119 size = sizeof(crush_bucket_list); 3120 break; 3121 case

Re: [ceph-users] RESOLVED: Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2019-08-19 Thread Troy Ablan
While I'm still unsure how this happened, this is what was done to solve this. Started OSD in foreground with debug 10, watched for the most recent osdmap epoch mentioned before abort(). For example, if it mentioned that it just tried to load 80896 and then crashed # ceph osd getmap -o

Re: [ceph-users] Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2019-08-19 Thread Troy Ablan
Yes, it's possible that they do, but since all of the affected OSDs are still down and the monitors have been restarted since, all of those pools have pgs that are in unknown state and don't return anything in ceph pg ls. There weren't that many placement groups for the SSDs, but also I don't

Re: [ceph-users] Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2019-08-14 Thread Troy Ablan
Paul, Thanks for the reply. All of these seemed to fail except for pulling the osdmap from the live cluster. -Troy -[~:#]- ceph-objectstore-tool --op get-osdmap --data-path /var/lib/ceph/osd/ceph-45/ --file osdmap45 terminate called after throwing an instance of

[ceph-users] default.rgw.log contains large omap object

2019-10-14 Thread Troy Ablan
Hi folks, Mimic cluster here, RGW pool with only default zone. I have a persistent error here LARGE_OMAP_OBJECTS 1 large omap objects 1 large objects found in pool 'default.rgw.log' Search the cluster log for 'Large omap object found' for more details. I think I've narrowed it

Re: [ceph-users] default.rgw.log contains large omap object

2019-10-14 Thread Troy Ablan
Yep, that's on me. I did enable it in the config originally, and I think that I thought at the time that it might be useful, but I wasn't aware of a sharding caveat owing to most of our traffic is happening on one rgw user. I think I know what I need to do to fix it now though. Thanks

Re: [ceph-users] default.rgw.log contains large omap object

2019-10-14 Thread Troy Ablan
Paul, Apparently never. Appears to (potentially) have every request from the beginning of time (late last year, in my case). In our use case, we don't really need this data (not multi-tenant), so I might simply clear it. But in the case where this were an extremely high transaction cluster