[ceph-users] Re: radosgw-octopus latest - NoSuchKey Error - some buckets lose their rados objects, but not the bucket index

Boris Behrens Fri, 02 Dec 2022 03:23:27 -0800

Hi Eric,
sadly it took too long from the customer complaining until it reached my
desk, so there are not RGW CLIENT logs.


We are currently improving out logging situation to move the logs to
graylog.

Currently it looks like, that the GC removed rados objects it should not
have removed due to this bug https://tracker.ceph.com/issues/53585 which we
are hitting multiple times a week.

Am Do., 1. Dez. 2022 um 17:26 Uhr schrieb J. Eric Ivancich <
ivanc...@redhat.com>:

> So it seems like a bucket still has objects listed in the bucket index but
> the underlying data objects are no longer there. Since you made reference
> to a customer, I’m guessing the customer does not have direct access to the
> cluster via `rados` commands, so there’s no chance that they could have
> removed the objects directly.
>
> I would look for references to the head objects in the logs….
>
> So if you had bucket “bkt1” and object “obj1”, you could do the following:
>
> 1. Find the marker for the bucket:
>     radosgw-admin metadata get bucket:bkt1
>
> 2. Construct the rados object name of the head object:
>     <marker>_obj1
>
> You’ll end up with something like
> "c44a7aab-e086-43df-befe-ed8151b3a209.4147.1_obj1”.
>
> 3. grep through the logs for the head object and see if you find anything.
>
> Eric
> (he/him)
>
> On Nov 22, 2022, at 10:36 AM, Boris Behrens <b...@kervyn.de> wrote:
>
> Does someone have an idea what I can check, maybe what logs I can turn on,
> to find the cause of the problem? Or at least can have a monitoring that
> tells me when this happens?
>
> Currently I go through ALL of the buckets and basically do a "compare
> bucket index to radoslist" for all objects in the bucket index. But I doubt
> this will give me new insights.
>
> Am Mo., 21. Nov. 2022 um 11:55 Uhr schrieb Boris Behrens <b...@kervyn.de>:
>
> Good day people,
>
> we have a very strange problem with some bucket.
> Customer informed us, that they had issues with objects. They are listed,
> but on a GET they receive "NoSuchKey" error.
> They did not delete anything from the bucket.
>
> We checked and `radosgw-admin bucket radoslist --bucket $BUCKET` was
> empty, but all the objects were still listed in the `radosgw-admin bi list
> --bucket`.
>
> The date when they noticed, the cluster was as healthy as it can get in
> our case. There were also no other tasks performed, including orphan
> objects search, resharding of buckets, adding or removing OSDs, rebalancing
> and so on.
>
> Some data about the cluster:
>
>   - 275 OSDs (38 SSD OSDs, 6 SSD OSDs reserved for GC, rest 8-16TB
>   spinning HDD) over 13 hosts
>   - SSD for block.db every 5 HDD OSDs
>   - The SSDs are 100GB LVs on our block.db SSDs and contain all the
>   pools that are not rgw.buckets.data and rgw.buckets.non-ec
>   - The garbage collector is on separate SSDs OSDs, which are als 100GB
>   LVs on our block.db SSDs
>   - We had to split of the GC from all other pools, because this bug (
>   https://tracker.ceph.com/issues/53585) lead to problems, where we
>   received 500s errors, from RGW
>   - We have three HAProxy frontends, each pointing to one of our RGW
>   instances (with the other two RGW daemons as fallback)
>   - We have 12 RGW daemons running in total, but only three of them are
>   connected to the outside world (3x only for GC, 3x for some zonegroup
>   restructuring, 3x for a dedicated customer with own pools)
>   - We have multiple zonegroups with one zone each. We only replicate
>   the metadata, so bucket names are unique and users get synced.
>
>
>
> Our ceph.conf:
>
>   - I replaced IP addresses, FSID, and domains
>   - the -old RGW are meant to get replaced, because we have a naming
>   conflict (all zonegroups are in one TLD and are separated by subdomain,
> but
>   the initial RGW is still available via TLD and not via subdomain.tld)
>
>
> [global]
> fsid                  = $FSID
> ms_bind_ipv6          = true
> ms_bind_ipv4          = false
> mon_initial_members   = s3db1, s3db2, s3db3
> mon_host              = [$s3b1-IPv6-public_network],[$s3b2-IPv6
> -public_network],[$s3b2-IPv6-public_network]
> auth_cluster_required = none
> auth_service_required = none
> auth_client_required  = none
> public_network        = $public_network/64
> #cluster_network       = $cluster_network/64
>
> [mon.s3db1]
> host = s3db1
> mon addr = [$s3b1-IPv6-public_network]:6789
>
> [mon.s3db2]
> host = s3db2
> mon addr = [$s3b2-IPv6-public_network]:6789
>
> [mon.s3db3]
> host = s3db3
> mon addr = [$s3b3-IPv6-public_network]:6789
>
> [client]
> rbd_cache = true
> rbd_cache_size = 64M
> rbd_cache_max_dirty = 48M
> rgw_print_continue = true
> rgw_enable_usage_log = true
> rgw_resolve_cname = true
> rgw_enable_apis = s3,admin,s3website
> rgw_enable_static_website = true
> rgw_trust_forwarded_https = true
>
> [client.gc-s3db1]
> rgw_frontends = "beast endpoint=[::1]:7489"
> #rgw_gc_processor_max_time = 1800
> #rgw_gc_max_concurrent_io = 20
>
> [client.eu-central-1-s3db1]
> rgw_frontends = beast endpoint=[::]:7482
> rgw_region = eu
> rgw_zone = eu-central-1
> rgw_dns_name = name.example.com
> rgw_dns_s3website_name = s3-website-name.example.com
> rgw_thread_pool_size = 512
>
> [client.eu-central-1-s3db1-old]
> rgw_frontends = beast endpoint=[::]:7480
> rgw_region = eu
> rgw_zone = eu-central-1
> rgw_dns_name = example.com
> rgw_dns_s3website_name = eu-central-1.example.com
> rgw_thread_pool_size = 512
>
>
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groÃƒ¼en Saal.
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groÃƒ¼en Saal.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: radosgw-octopus latest - NoSuchKey Error - some buckets lose their rados objects, but not the bucket index

Reply via email to