[ceph-users] Cache pool latency impact
This is regarding cache pools and the impact of the flush/evict on the client IO latencies. Am seeing a direct impact on the client IO latencies (making them worse) when flush/evict is triggered on the cache pool. In a constant ingress of IOs on the cache pool, the write performance is no better than without cache pool, because it is limited to the speed at which objects can be flushed/evicted to the backend pool. The questions I have are: 1) When the flush/evict is in progress, are the writes on the cache pool blocked, either at the PG or at object granularity? Though I see a blocking flag honored per object context here https://github.com/ceph/ceph/blob/master/src/osd/ReplicatedPG.cc#L6841, and most of the callers of the same seem to set the flag to be false. 2) Is there any mechanism (that I might have overlooked) to avoid this situation, by throttling the flush/evict operations on the fly? If not, shouldn't there be one? Thanks, -Pavan. PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Cache pool latency impact
This is regarding cache pools and the impact of the flush/evict on the client IO latencies. Am seeing a direct impact on the client IO latencies (making them worse) when flush/evict is triggered on the cache pool. In a constant ingress of IOs on the cache pool, the write performance is no better than without cache pool, because it is limited to the speed at which objects can be flushed/evicted to the backend pool. The questions I have are: 1) When the flush/evict is in progress, are the writes on the cache pool blocked, either at the PG or at object granularity? Though I see a blocking flag honored per object context in ReplicatedPG::start_flush() and most of the callers seem to set the flag to be false. 2) Is there any mechanism (that I might have overlooked) to avoid this situation, by throttling the flush/evict operations on the fly? If not, shouldn't there be one? Thanks, -Pavan. PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Cache pool latency impact
This is regarding cache pools and the impact of the flush/evict on the client IO latencies. Am seeing a direct impact on the client IO latencies (making them worse) when flush/evict is triggered on the cache pool. In a constant ingress of IOs on the cache pool, the write performance is no better than without cache pool, because it is limited to the speed at which objects can be flushed/evicted to the backend pool. The questions I have are: 1) When the flush/evict is in progress, are the writes on the cache pool blocked, either at the PG or at object granularity? Though I see a blocking flag honored per object context in ReplicatedPG::start_flush() and most of the callers seem to set the flag to be false. 2) Is there any mechanism (that I might have overlooked) to avoid this situation, by throttling the flush/evict operations on the fly? If not, shouldn't there be one? Thanks, -Pavan. PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RGW hammer/master woes
Is there anyone who is hitting this? or any help on this is much appreciated. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Pavan Rallabhandi Sent: Saturday, February 28, 2015 11:42 PM To: ceph-us...@ceph.com Subject: [ceph-users] RGW hammer/master woes Am struggling to get through a basic PUT via swift client with RGW and CEPH binaries built out of Hammer/Master codebase, whereas the same (command on the same setup) is going through with RGW and CEPH binaries built out of Giant. Find below RGW log snippet and the command that was run. Am I missing anything obvious here? The user info looks like this: { user_id: johndoe, display_name: John Doe, email: j...@example.com, suspended: 0, max_buckets: 1000, auid: 0, subusers: [ { id: johndoe:swift, permissions: full-control}], keys: [ { user: johndoe, access_key: 7B39L2TUQ448LZW4RI3M, secret_key: lshKCoacSlbyVc7mBLLr4cJ26fEEM22Tcmp29hT3}, { user: johndoe:swift, access_key: SHZ64EF7CIB4V42I14AH, secret_key: }], swift_keys: [ { user: johndoe:swift, secret_key: asdf}], caps: [], op_mask: read, write, delete, default_placement: , placement_tags: [], bucket_quota: { enabled: false, max_size_kb: -1, max_objects: -1}, user_quota: { enabled: false, max_size_kb: -1, max_objects: -1}, temp_url_keys: []} The command that was run and the logs: snip swift -A http://localhost:8989/auth -U johndoe:swift -K asdf upload mycontainer ceph 2015-02-28 23:28:39.272897 7fb610ff9700 1 == starting new request req=0x7fb5f0009990 = 2015-02-28 23:28:39.272913 7fb610ff9700 2 req 0:0.16::PUT /swift/v1/mycontainer/ceph::initializing 2015-02-28 23:28:39.272918 7fb610ff9700 10 host=localhost:8989 2015-02-28 23:28:39.272921 7fb610ff9700 20 subdomain= domain= in_hosted_domain=0 2015-02-28 23:28:39.272938 7fb610ff9700 10 meta HTTP_X_OBJECT_META_MTIME 2015-02-28 23:28:39.272945 7fb610ff9700 10 x x-amz-meta-mtime:1425140933.648506 2015-02-28 23:28:39.272964 7fb610ff9700 10 ver=v1 first=mycontainer req=ceph 2015-02-28 23:28:39.272971 7fb610ff9700 10 s-object=ceph s-bucket=mycontainer 2015-02-28 23:28:39.272976 7fb610ff9700 2 req 0:0.79:swift:PUT /swift/v1/mycontainer/ceph::getting op 2015-02-28 23:28:39.272982 7fb610ff9700 2 req 0:0.85:swift:PUT /swift/v1/mycontainer/ceph:put_obj:authorizing 2015-02-28 23:28:39.273008 7fb610ff9700 10 swift_user=johndoe:swift 2015-02-28 23:28:39.273026 7fb610ff9700 20 build_token token=0d006a6f686e646f653a73776966744436beb90402b13c4f53f35472c2cf0f 2015-02-28 23:28:39.273057 7fb610ff9700 2 req 0:0.000160:swift:PUT /swift/v1/mycontainer/ceph:put_obj:reading permissions 2015-02-28 23:28:39.273100 7fb610ff9700 15 Read AccessControlPolicyAccessControlPolicy xmlns=http://s3.amazonaws.com/doc/2006-03-01/;OwnerIDjohndoe/IDDisplayNameJohn Doe/DisplayName/OwnerAccessControlListGrantGrantee xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; xsi:type=CanonicalUserIDjohndoe/IDDisplayNameJohn Doe/DisplayName/GranteePermissionFULL_CONTROL/Permission/Grant/AccessControlList/AccessControlPolicy 2015-02-28 23:28:39.273114 7fb610ff9700 2 req 0:0.000216:swift:PUT /swift/v1/mycontainer/ceph:put_obj:init op 2015-02-28 23:28:39.273120 7fb610ff9700 2 req 0:0.000223:swift:PUT /swift/v1/mycontainer/ceph:put_obj:verifying op mask 2015-02-28 23:28:39.273123 7fb610ff9700 20 required_mask= 2 user.op_mask=7 2015-02-28 23:28:39.273125 7fb610ff9700 2 req 0:0.000228:swift:PUT /swift/v1/mycontainer/ceph:put_obj:verifying op permissions 2015-02-28 23:28:39.273129 7fb610ff9700 5 Searching permissions for uid=johndoe mask=50 2015-02-28 23:28:39.273131 7fb610ff9700 5 Found permission: 15 2015-02-28 23:28:39.273133 7fb610ff9700 5 Searching permissions for group=1 mask=50 2015-02-28 23:28:39.273135 7fb610ff9700 5 Permissions for group not found 2015-02-28 23:28:39.273136 7fb610ff9700 5 Searching permissions for group=2 mask=50 2015-02-28 23:28:39.273137 7fb610ff9700 5 Permissions for group not found 2015-02-28 23:28:39.273138 7fb610ff9700 5 Getting permissions id=johndoe owner=johndoe perm=2 2015-02-28 23:28:39.273140 7fb610ff9700 10 uid=johndoe requested perm (type)=2, policy perm=2, user_perm_mask=2, acl perm=2 2015-02-28 23:28:39.273143 7fb610ff9700 2 req 0:0.000246:swift:PUT /swift/v1/mycontainer/ceph:put_obj:verifying op params 2015-02-28 23:28:39.273146 7fb610ff9700 2 req 0:0.000249:swift:PUT /swift/v1/mycontainer/ceph:put_obj:executing 2015-02-28 23:28:39.273279 7fb610ff9700 10 x x-amz-meta-mtime:1425140933.648506 2015-02-28 23:28:39.273313 7fb610ff9700 20 get_obj_state: rctx=0x7fb610ff41f0 obj=mycontainer:ceph state=0x7fb5f0016940 s-prefetch_data=0 2015-02-28 23:28:39.274354 7fb610ff9700 20 get_obj_state: rctx=0x7fb610ff41f0 obj=mycontainer:ceph state=0x7fb5f0016940 s-prefetch_data=0 2015-02-28 23:28:39.274394
[ceph-users] RGW hammer/master woes
Am struggling to get through a basic PUT via swift client with RGW and CEPH binaries built out of Hammer/Master codebase, whereas the same (command on the same setup) is going through with RGW and CEPH binaries built out of Giant. Find below RGW log snippet and the command that was run. Am I missing anything obvious here? The user info looks like this: { user_id: johndoe, display_name: John Doe, email: j...@example.com, suspended: 0, max_buckets: 1000, auid: 0, subusers: [ { id: johndoe:swift, permissions: full-control}], keys: [ { user: johndoe, access_key: 7B39L2TUQ448LZW4RI3M, secret_key: lshKCoacSlbyVc7mBLLr4cJ26fEEM22Tcmp29hT3}, { user: johndoe:swift, access_key: SHZ64EF7CIB4V42I14AH, secret_key: }], swift_keys: [ { user: johndoe:swift, secret_key: asdf}], caps: [], op_mask: read, write, delete, default_placement: , placement_tags: [], bucket_quota: { enabled: false, max_size_kb: -1, max_objects: -1}, user_quota: { enabled: false, max_size_kb: -1, max_objects: -1}, temp_url_keys: []} The command that was run and the logs: snip swift -A http://localhost:8989/auth -U johndoe:swift -K asdf upload mycontainer ceph 2015-02-28 23:28:39.272897 7fb610ff9700 1 == starting new request req=0x7fb5f0009990 = 2015-02-28 23:28:39.272913 7fb610ff9700 2 req 0:0.16::PUT /swift/v1/mycontainer/ceph::initializing 2015-02-28 23:28:39.272918 7fb610ff9700 10 host=localhost:8989 2015-02-28 23:28:39.272921 7fb610ff9700 20 subdomain= domain= in_hosted_domain=0 2015-02-28 23:28:39.272938 7fb610ff9700 10 meta HTTP_X_OBJECT_META_MTIME 2015-02-28 23:28:39.272945 7fb610ff9700 10 x x-amz-meta-mtime:1425140933.648506 2015-02-28 23:28:39.272964 7fb610ff9700 10 ver=v1 first=mycontainer req=ceph 2015-02-28 23:28:39.272971 7fb610ff9700 10 s-object=ceph s-bucket=mycontainer 2015-02-28 23:28:39.272976 7fb610ff9700 2 req 0:0.79:swift:PUT /swift/v1/mycontainer/ceph::getting op 2015-02-28 23:28:39.272982 7fb610ff9700 2 req 0:0.85:swift:PUT /swift/v1/mycontainer/ceph:put_obj:authorizing 2015-02-28 23:28:39.273008 7fb610ff9700 10 swift_user=johndoe:swift 2015-02-28 23:28:39.273026 7fb610ff9700 20 build_token token=0d006a6f686e646f653a73776966744436beb90402b13c4f53f35472c2cf0f 2015-02-28 23:28:39.273057 7fb610ff9700 2 req 0:0.000160:swift:PUT /swift/v1/mycontainer/ceph:put_obj:reading permissions 2015-02-28 23:28:39.273100 7fb610ff9700 15 Read AccessControlPolicyAccessControlPolicy xmlns=http://s3.amazonaws.com/doc/2006-03-01/;OwnerIDjohndoe/IDDisplayNameJohn Doe/DisplayName/OwnerAccessControlListGrantGrantee xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; xsi:type=CanonicalUserIDjohndoe/IDDisplayNameJohn Doe/DisplayName/GranteePermissionFULL_CONTROL/Permission/Grant/AccessControlList/AccessControlPolicy 2015-02-28 23:28:39.273114 7fb610ff9700 2 req 0:0.000216:swift:PUT /swift/v1/mycontainer/ceph:put_obj:init op 2015-02-28 23:28:39.273120 7fb610ff9700 2 req 0:0.000223:swift:PUT /swift/v1/mycontainer/ceph:put_obj:verifying op mask 2015-02-28 23:28:39.273123 7fb610ff9700 20 required_mask= 2 user.op_mask=7 2015-02-28 23:28:39.273125 7fb610ff9700 2 req 0:0.000228:swift:PUT /swift/v1/mycontainer/ceph:put_obj:verifying op permissions 2015-02-28 23:28:39.273129 7fb610ff9700 5 Searching permissions for uid=johndoe mask=50 2015-02-28 23:28:39.273131 7fb610ff9700 5 Found permission: 15 2015-02-28 23:28:39.273133 7fb610ff9700 5 Searching permissions for group=1 mask=50 2015-02-28 23:28:39.273135 7fb610ff9700 5 Permissions for group not found 2015-02-28 23:28:39.273136 7fb610ff9700 5 Searching permissions for group=2 mask=50 2015-02-28 23:28:39.273137 7fb610ff9700 5 Permissions for group not found 2015-02-28 23:28:39.273138 7fb610ff9700 5 Getting permissions id=johndoe owner=johndoe perm=2 2015-02-28 23:28:39.273140 7fb610ff9700 10 uid=johndoe requested perm (type)=2, policy perm=2, user_perm_mask=2, acl perm=2 2015-02-28 23:28:39.273143 7fb610ff9700 2 req 0:0.000246:swift:PUT /swift/v1/mycontainer/ceph:put_obj:verifying op params 2015-02-28 23:28:39.273146 7fb610ff9700 2 req 0:0.000249:swift:PUT /swift/v1/mycontainer/ceph:put_obj:executing 2015-02-28 23:28:39.273279 7fb610ff9700 10 x x-amz-meta-mtime:1425140933.648506 2015-02-28 23:28:39.273313 7fb610ff9700 20 get_obj_state: rctx=0x7fb610ff41f0 obj=mycontainer:ceph state=0x7fb5f0016940 s-prefetch_data=0 2015-02-28 23:28:39.274354 7fb610ff9700 20 get_obj_state: rctx=0x7fb610ff41f0 obj=mycontainer:ceph state=0x7fb5f0016940 s-prefetch_data=0 2015-02-28 23:28:39.274394 7fb610ff9700 10 setting object write_tag=default.14199.0 2015-02-28 23:28:39.274554 7fb610ff9700 20 reading from .rgw:.bucket.meta.mycontainer:default.14199.3 2015-02-28 23:28:39.274574 7fb610ff9700 20 get_obj_state: rctx=0x7fb610ff2ef0 obj=.rgw:.bucket.meta.mycontainer:default.14199.3 state=0x7fb5f001db30
Re: [ceph-users] FW: RGW performance issue
If you are on >=hammer builds, you might want to consider the option of using 'rgw_num_rados_handles', which opens up more handles to the cluster from RGW. This would help in scenarios, where you have enough number of OSDs to drive the cluster bandwidth, which I guess is the case with you. Thanks, -Pavan. From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of ?? Sent: Thursday, November 12, 2015 1:51 PM To: ceph-users@lists.ceph.com Subject: [ceph-users] FW: RGW performance issue Hello, We are building a cluster for archive storage. We plan to use an Object Storage (RGW) only, no Block Devices and File System. We doesn't require high speed, so we are using old weak servers (4 cores, 3 GB RAM) with new huge but slow HDDs (8TB, 5900rpm). We have 3 storage nodes with 24 OSDs totally now and 3 RGW nodes based on a default Civetweb engine. We have got about 50 MB/sec "raw" write speed with librados-level benches (measured by rados bench, rados put), and it's quite enough for us. However, RGW performance is dramatically low: no more than 5 MB/sec for file uploading via s3cmd and swift client. It's too slow for ours tasks and it's abnormally slow compared with librados write speed, imho. Write speed is a most important for us now, our the first goal is to download about 50 TB of archive data from a public cloud to our promise storage. We need no less than 20 MB/sec of write speed. Can anybody help my with RGW performance? Who use RGW, what performance penalty does it give? And where to find the cause of the problem? I have checked all performance counters what I know and I haven't found any critical values. Thanks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] FW: RGW performance issue
No documentation that am aware of. The idea is to avoid having multiple RGW instances, if there is enough cluster bandwidth that a single RGW instance can drive through. -Original Message- From: Jens Rosenboom [mailto:j.rosenb...@x-ion.de] Sent: Friday, November 13, 2015 8:58 PM To: Pavan Rallabhandi Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] FW: RGW performance issue 2015-11-13 5:47 GMT+01:00 Pavan Rallabhandi <pavan.rallabha...@sandisk.com>: > If you are on >=hammer builds, you might want to consider the option > of using 'rgw_num_rados_handles', which opens up more handles to the > cluster from RGW. This would help in scenarios, where you have enough > number of OSDs to drive the cluster bandwidth, which I guess is the case with > you. Is there any documentation on this option other than the source itself? My google foo failed to come up with anything except pull requests. In particular it would be interesting to know what a useful target value would be and how to define "enough" OSDs. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rgw bucket deletion woes
To update this thread, this is now fixed via https://github.com/ceph/ceph/pull/8679 Thanks! From: Ben Hines <bhi...@gmail.com<mailto:bhi...@gmail.com>> Date: Thursday, March 17, 2016 at 4:47 AM To: Yehuda Sadeh-Weinraub <yeh...@redhat.com<mailto:yeh...@redhat.com>> Cc: Pavan Rallabhandi <prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>>, "ceph-us...@ceph.com<mailto:ceph-us...@ceph.com>" <ceph-us...@ceph.com<mailto:ceph-us...@ceph.com>> Subject: Re: [ceph-users] rgw bucket deletion woes We would be a big user of this. We delete large buckets often and it takes forever. Though didn't I read that 'object expiration' support is on the near-term RGW roadmap? That may do what we want.. we're creating thousands of objects a day, and thousands of objects a day will be expiring, so RGW will need to handle. -Ben On Wed, Mar 16, 2016 at 9:40 AM, Yehuda Sadeh-Weinraub <yeh...@redhat.com<mailto:yeh...@redhat.com>> wrote: On Tue, Mar 15, 2016 at 11:36 PM, Pavan Rallabhandi <prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>> wrote: > Hi, > > I find this to be discussed here before, but couldn¹t find any solution > hence the mail. In RGW, for a bucket holding objects in the range of ~ > millions, one can find it to take for ever to delete the bucket(via > radosgw-admin). I understand the gc(and its parameters) that would reclaim > the space eventually, but am looking more at the bucket deletion options > that can possibly speed up the operation. > > I realize, currently rgw_remove_bucket(), does it 1000 objects at a time, > serially. Wanted to know if there is a reason(that am possibly missing and > discussed) for this to be left that way, otherwise I was considering a > patch to make it happen better. > There is no real reason. You might want to have a version of that command that doesn't schedule the removal to gc, but rather removes all the object parts by itself. Otherwise, you're just going to flood the gc. You'll need to iterate through all the objects, and for each object you'll need to remove all of it's rados objects (starting with the tail, then the head). Removal of each rados object can be done asynchronously, but you'll need to throttle the operations, not send everything to the osds at once (which will be impossible, as the objecter will throttle the requests anyway, which will lead to a high memory consumption). Thanks, Yehuda ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rgw bucket deletion woes
Hi, I find this to be discussed here before, but couldn¹t find any solution hence the mail. In RGW, for a bucket holding objects in the range of ~ millions, one can find it to take for ever to delete the bucket(via radosgw-admin). I understand the gc(and its parameters) that would reclaim the space eventually, but am looking more at the bucket deletion options that can possibly speed up the operation. I realize, currently rgw_remove_bucket(), does it 1000 objects at a time, serially. Wanted to know if there is a reason(that am possibly missing and discussed) for this to be left that way, otherwise I was considering a patch to make it happen better. Thanks, -Pavan. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CBT results parsing/plotting
Thanks Mark, we did look at tools from Ben England here https://github.com/bengland2/cbt/blob/fio-thru-analysis/tools/parse-fio-result-log.sh but not with much luck, that’s partly because we didn’t bother to look into the gory details if things didn’t work. Thanks for the scripts you have attached, will give a shot. Thanks, -Pavan. On 7/6/16, 11:05 PM, "ceph-users on behalf of Mark Nelson" <ceph-users-boun...@lists.ceph.com on behalf of mnel...@redhat.com> wrote: Hi Pavan, A couple of us have some pretty ugly home-grown scripts for doing this. Basically just bash/awk that loop through the directories and grab the fio bw/latency lines. Eventually the whole way that cbt stores data should be revamped since the way data gets laid out in a nested directory structure doesn't really scale. Lately we've been more focused on parsing the fio log output. We've had issues with strange clock skew and multiple concurrent clients not all running at exactly the same time, leading to bias in the results. To get around this, we wrote a tool to parse the output of multiple fio bw/iops/latency log files. It was upstreamed into fio itself recently here: https://github.com/axboe/fio/blob/master/tools/fiologparser.py a potentially better (but still experimental and maybe buggy) version of this is here: https://github.com/markhpc/fio/blob/wip-interval/tools/fiologparser.py The idea is to fit samples into intervals based on how much they overlap, but if you have multiple files (say from multiple simulatenous clients) only calculate averages based on the time the clients ran at the same time (in case some clients started late or ended early). Ben England is working on speeding this up when fio records every IO rather than interval averages and we have an intern now that is looking at this and other methods for getting better logging data out of fio. Anyway, I've included two scripts that will run through a cbt output directory and either read the fio output file or run this script to parse the output data. Like I said, these are basically terrible (even embarrassing!) and will almost certainly need some slight hacking to work based on the iodepth/readahead/etc. I think Ben England has something a bit nicer but he's on vacation at the moment so you'll have to suffer with my horrible hacked up bash for now. ;) Mark On 07/06/2016 03:22 AM, Pavan Rallabhandi wrote: > Wanted to check if there are any readily available tools that the community > is aware of/using for parsing/plotting CBT run results. Am particularly > interested in tools for the CBT librbdfio runs, where in the aggregated > BW/IOPS/Latency reports are generated either in CSV/graphical. > > Thanks! > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CBT results parsing/plotting
Wanted to check if there are any readily available tools that the community is aware of/using for parsing/plotting CBT run results. Am particularly interested in tools for the CBT librbdfio runs, where in the aggregated BW/IOPS/Latency reports are generated either in CSV/graphical. Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rgw meta pool
Trying it one more time on the users list. In our clusters running Jewel 10.2.2, I see default.rgw.meta pool running into large number of objects, potentially to the same range of objects contained in the data pool. I understand that the immutable metadata entries are now stored in this heap pool, but I couldn’t reason out why the metadata objects are left in this pool even after the actual bucket/object/user deletions. The put_entry() promptly seems to be storing the same in the heap pool https://github.com/ceph/ceph/blob/master/src/rgw/rgw_metadata.cc#L880, but I do not see them to be reaped ever. Are they left there for some reason? Thanks, -Pavan. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rgw meta pool
Any help on this is much appreciated, am considering to fix this, given it’s confirmed an issue unless am missing something obvious. Thanks, -Pavan. On 9/8/16, 5:04 PM, "ceph-users on behalf of Pavan Rallabhandi" <ceph-users-boun...@lists.ceph.com on behalf of prallabha...@walmartlabs.com> wrote: Trying it one more time on the users list. In our clusters running Jewel 10.2.2, I see default.rgw.meta pool running into large number of objects, potentially to the same range of objects contained in the data pool. I understand that the immutable metadata entries are now stored in this heap pool, but I couldn’t reason out why the metadata objects are left in this pool even after the actual bucket/object/user deletions. The put_entry() promptly seems to be storing the same in the heap pool https://github.com/ceph/ceph/blob/master/src/rgw/rgw_metadata.cc#L880, but I do not see them to be reaped ever. Are they left there for some reason? Thanks, -Pavan. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rgw meta pool
Thanks Casey for the reply, more on the tracker. Thanks! On 9/9/16, 11:32 PM, "ceph-users on behalf of Casey Bodley" <ceph-users-boun...@lists.ceph.com on behalf of cbod...@redhat.com> wrote: Hi, My (limited) understanding of this metadata heap pool is that it's an archive of metadata entries and their versions. According to Yehuda, this was intended to support recovery operations by reverting specific metadata objects to a previous version. But nothing has been implemented so far, and I'm not aware of any plans to do so. So these objects are being created, but never read or deleted. This was discussed in the rgw standup this morning, and we agreed that this archival should be made optional (and default to off), most likely by assigning an empty pool name to the zone's 'metadata_heap' field. I've created a ticket at http://tracker.ceph.com/issues/17256 to track this issue. Casey On 09/09/2016 11:01 AM, Warren Wang - ISD wrote: > A little extra context here. Currently the metadata pool looks like it is > on track to exceed the number of objects in the data pool, over time. In a > brand new cluster, we¹re already up to almost 2 million in each pool. > > NAME ID USED %USED MAX AVAIL > OBJECTS > default.rgw.buckets.data 17 3092G 0.86 345T > 2013585 > default.rgw.meta 25 743M 0 172T > 1975937 > > We¹re concerned this will be unmanageable over time. > > Warren Wang > > > On 9/9/16, 10:54 AM, "ceph-users on behalf of Pavan Rallabhandi" > <ceph-users-boun...@lists.ceph.com on behalf of > prallabha...@walmartlabs.com> wrote: > >> Any help on this is much appreciated, am considering to fix this, given >> it¹s confirmed an issue unless am missing something obvious. >> >> Thanks, >> -Pavan. >> >> On 9/8/16, 5:04 PM, "ceph-users on behalf of Pavan Rallabhandi" >> <ceph-users-boun...@lists.ceph.com on behalf of >> prallabha...@walmartlabs.com> wrote: >> >> Trying it one more time on the users list. >> >> In our clusters running Jewel 10.2.2, I see default.rgw.meta pool >> running into large number of objects, potentially to the same range of >> objects contained in the data pool. >> >> I understand that the immutable metadata entries are now stored in >> this heap pool, but I couldn¹t reason out why the metadata objects are >> left in this pool even after the actual bucket/object/user deletions. >> >> The put_entry() promptly seems to be storing the same in the heap >> pool >> https://github.com/ceph/ceph/blob/master/src/rgw/rgw_metadata.cc#L880, >> but I do not see them to be reaped ever. Are they left there for some >> reason? >> >> Thanks, >> -Pavan. >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential *** > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Same pg scrubbed over and over (Jewel)
We find this as well in our fresh built Jewel clusters, and seems to happen only with a handful of PGs from couple of pools. Thanks! On 9/21/16, 3:14 PM, "ceph-users on behalf of Tobias Böhm"wrote: Hi, there is an open bug in the tracker: http://tracker.ceph.com/issues/16474 It also suggests restarting OSDs as a workaround. We faced the same issue after increasing the number of PGs in our cluster and restarting OSDs solved it as well. Tobias > Am 21.09.2016 um 11:26 schrieb Dan van der Ster : > > There was a thread about this a few days ago: > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/012857.html > And the OP found a workaround. > Looks like a bug though... (by default PGs scrub at most once per day). > > -- dan > > > > On Tue, Sep 20, 2016 at 10:43 PM, Martin Bureau wrote: >> Hello, >> >> >> I noticed that the same pg gets scrubbed repeatedly on our new Jewel >> cluster: >> >> >> Here's an excerpt from log: >> >> >> 2016-09-20 20:36:31.236123 osd.12 10.1.82.82:6820/14316 150514 : cluster >> [INF] 25.3f scrub ok >> 2016-09-20 20:36:32.232918 osd.12 10.1.82.82:6820/14316 150515 : cluster >> [INF] 25.3f scrub starts >> 2016-09-20 20:36:32.236876 osd.12 10.1.82.82:6820/14316 150516 : cluster >> [INF] 25.3f scrub ok >> 2016-09-20 20:36:33.233268 osd.12 10.1.82.82:6820/14316 150517 : cluster >> [INF] 25.3f deep-scrub starts >> 2016-09-20 20:36:33.242258 osd.12 10.1.82.82:6820/14316 150518 : cluster >> [INF] 25.3f deep-scrub ok >> 2016-09-20 20:36:36.233604 osd.12 10.1.82.82:6820/14316 150519 : cluster >> [INF] 25.3f scrub starts >> 2016-09-20 20:36:36.237221 osd.12 10.1.82.82:6820/14316 150520 : cluster >> [INF] 25.3f scrub ok >> 2016-09-20 20:36:41.234490 osd.12 10.1.82.82:6820/14316 150521 : cluster >> [INF] 25.3f deep-scrub starts >> 2016-09-20 20:36:41.243720 osd.12 10.1.82.82:6820/14316 150522 : cluster >> [INF] 25.3f deep-scrub ok >> 2016-09-20 20:36:45.235128 osd.12 10.1.82.82:6820/14316 150523 : cluster >> [INF] 25.3f deep-scrub starts >> 2016-09-20 20:36:45.352589 osd.12 10.1.82.82:6820/14316 150524 : cluster >> [INF] 25.3f deep-scrub ok >> 2016-09-20 20:36:47.235310 osd.12 10.1.82.82:6820/14316 150525 : cluster >> [INF] 25.3f scrub starts >> 2016-09-20 20:36:47.239348 osd.12 10.1.82.82:6820/14316 150526 : cluster >> [INF] 25.3f scrub ok >> 2016-09-20 20:36:49.235538 osd.12 10.1.82.82:6820/14316 150527 : cluster >> [INF] 25.3f deep-scrub starts >> 2016-09-20 20:36:49.243121 osd.12 10.1.82.82:6820/14316 150528 : cluster >> [INF] 25.3f deep-scrub ok >> 2016-09-20 20:36:51.235956 osd.12 10.1.82.82:6820/14316 150529 : cluster >> [INF] 25.3f deep-scrub starts >> 2016-09-20 20:36:51.244201 osd.12 10.1.82.82:6820/14316 150530 : cluster >> [INF] 25.3f deep-scrub ok >> 2016-09-20 20:36:52.236076 osd.12 10.1.82.82:6820/14316 150531 : cluster >> [INF] 25.3f scrub starts >> 2016-09-20 20:36:52.239376 osd.12 10.1.82.82:6820/14316 150532 : cluster >> [INF] 25.3f scrub ok >> 2016-09-20 20:36:56.236740 osd.12 10.1.82.82:6820/14316 150533 : cluster >> [INF] 25.3f scrub starts >> >> >> How can I troubleshoot / resolve this ? >> >> >> Regards, >> >> Martin >> >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSDs are flapping and marked down wrongly
Regarding the mon_osd_min_down_reports I was looking at it recently, this could provide some insight https://github.com/ceph/ceph/commit/0269a0c17723fd3e22738f7495fe017225b924a4 Thanks! On 10/17/16, 1:36 PM, "ceph-users on behalf of Somnath Roy"wrote: Thanks Piotr, Wido for quick response. @Wido , yes, I thought of trying with those values but I am seeing in the log messages at least 7 osds are reporting failure , so, didn't try. BTW, I found default mon_osd_min_down_reporters is 2 , not 1 and latest master is not having mon_osd_min_down_reports anymore. Not sure what it is replaced with.. @Piotr , yes, your PR really helps , thanks ! Regarding each messenger needs to respond to HB is confusing, I know each thread has a HB timeout value and beyond which it will crash with suicide timeout , are you talking about that ? Regards Somnath -Original Message- From: Piotr Dałek [mailto:bra...@predictor.org.pl] Sent: Monday, October 17, 2016 12:52 AM To: ceph-users@lists.ceph.com; Somnath Roy; ceph-de...@vger.kernel.org Subject: Re: OSDs are flapping and marked down wrongly On Mon, Oct 17, 2016 at 07:16:44AM +, Somnath Roy wrote: > Hi Sage et. al, > > I know this issue is reported number of times in community and attributed to either network issue or unresponsive OSDs. > Recently, we are seeing this issue when our all SSD cluster (Jewel based) is stressed with large block size and very high QD. Lowering QD it is working just fine. > We are seeing the lossy connection message like below and followed by the osd marked down by monitor. > > 2016-10-15 14:30:13.957534 7f6297bff700 0 -- 10.10.10.94:6810/2461767 > submit_message osd_op_reply(1463 > rbd_data.55246b8b4567.d633 [set-alloc-hint object_size > 4194304 write_size 4194304,write 3932160~262144] v222'95890 uv95890 > ondisk = 0) v7 remote, 10.10.10.98:0/1174431362, dropping message > > In the monitor log, I am seeing the osd is reported down by peers and subsequently monitor is marking it down. > OSDs is rejoining the cluster after detecting it is marked down wrongly and rebalancing started. This is hurting performance very badly. > > My question is the following. > > 1. I have 40Gb network and I am seeing network is not utilized beyond 10-12Gb/s , no network error is reported. So, why this lossy connection message is coming ? what could go wrong here ? Is it network prioritization issue of smaller ping packets ? I tried to gaze ping round time during this and nothing seems abnormal. > > 2. Nothing is saturated on the OSD side , plenty of network/memory/cpu/disk is left. So, I doubt my osds are unresponsive but yes it is really busy on IO path. Heartbeat is going through separate messenger and threads as well, so, busy op threads should not be making heartbeat delayed. Increasing osd heartbeat grace is only delaying this phenomenon , but, eventually happens after several hours. Anything else we can tune here ? There's a bunch of messengers in OSD code, if ANY of them doesn't respond to heartbeat messages in reasonable time, it is marked as down. Since packets are processed in FIFO/synchronous manner, overloading OSD with large I/O will cause it to time-out on at least one messenger. There was an idea to have heartbeat messages go in the OOB TCP/IP stream and process them asynchronously, but I don't know if that went beyond the idea stage. > 3. What could be the side effect of big grace period ? I understand that detecting a faulty osd will be delayed, anything else ? Yes - stalled ops. Assume that primary OSD goes down and replicas are still alive. Having big grace period will cause all ops going to that OSD to stall until that particular OSD is marked down or resumes normal operation. > 4. I saw if an OSD is crashed, monitor will detect the down osd almost instantaneously and it is not waiting till this grace period. How it is distinguishing between unresponsive and crashed osds ? In which scenario this heartbeat grace is coming into picture ? This is the effect of my PR#8558 (https://github.com/ceph/ceph/pull/8558) which causes any OSD that crash to be immediately marked as down, preventing stalled I/Os in most common cases. Grace period is only applied to unresponsive OSDs (i.e. temporary packet loss, bad cases of network lags, routing issues, in other words, everything that is known to be at least possible to resolve by itself in a finite amount of time). OSDs that crash and burn won't respond - instead, OS will respond with ECONNREFUSED indicating that OSD is not listening and in that case the OSD will be immediately marked down. -- Piotr Dałek bra...@predictor.org.pl http://blog.predictor.org.pl
Re: [ceph-users] rbd cache writethrough until flush
Thanks for verifying at your end Jason. It’s pretty weird that the difference is >~10X, with "rbd_cache_writethrough_until_flush = true" I see ~400 IOPS vs with "rbd_cache_writethrough_until_flush = false" I see them to be ~6000 IOPS. The QEMU cache is none for all of the rbd drives. On that note, would older librbd versions (like Hammer) have any caching issues while dealing with Jewel clusters? Thanks, -Pavan. On 10/21/16, 8:17 PM, "Jason Dillaman"wrote: QEMU cache setting for the rbd drive? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd cache writethrough until flush
The VM am testing against is created after the librbd upgrade. Always had this confusion around this bit in the docs here http://docs.ceph.com/docs/jewel/rbd/qemu-rbd/#qemu-cache-options that: “QEMU’s cache settings override Ceph’s default settings (i.e., settings that are not explicitly set in the Ceph configuration file). If you explicitly set RBD Cache settings in your Ceph configuration file, your Ceph settings override the QEMU cache settings. If you set cache settings on the QEMU command line, the QEMU command line settings override the Ceph configuration file settings.” Thanks, -Pavan. On 10/21/16, 11:31 PM, "Jason Dillaman" <jdill...@redhat.com> wrote: On Fri, Oct 21, 2016 at 1:15 PM, Pavan Rallabhandi <prallabha...@walmartlabs.com> wrote: > The QEMU cache is none for all of the rbd drives Hmm -- if you have QEMU cache disabled, I would expect it to disable the librbd cache. I have to ask, but did you (re)start/live-migrate these VMs you are testing against after you upgraded to librbd v10.2.3? -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd cache writethrough until flush
From my VMs that have cinder provisioned volumes, I tried dd / fio (like below) to find the IOPS to be less, even a sync before the runs didn’t help. Same runs by setting the option to false yield better results. Both the clients and the cluster are running 10.2.3, perhaps the only difference is that the clients are on Trusty and the cluster is Xenial. dd if=/dev/zero of=/dev/vdd bs=4K count=1000 oflag=direct fio -name iops -rw=write -bs=4k -direct=1 -runtime=60 -iodepth 1 -filename /dev/vde -ioengine=libaio Thanks, -Pavan. On 10/21/16, 6:15 PM, "Jason Dillaman" <jdill...@redhat.com> wrote: It's in the build and has tests to verify that it is properly being triggered [1]. $ git tag --contains 5498377205523052476ed81aebb2c2e6973f67ef v10.2.3 What are your tests that say otherwise? [1] https://github.com/ceph/ceph/pull/10797/commits/5498377205523052476ed81aebb2c2e6973f67ef On Fri, Oct 21, 2016 at 7:42 AM, Pavan Rallabhandi <prallabha...@walmartlabs.com> wrote: > I see the fix for write back cache not getting turned on after flush has made into Jewel 10.2.3 ( http://tracker.ceph.com/issues/17080 ) but our testing says otherwise. > > The cache is still behaving as if its writethrough, though the setting is set to true. Wanted to check if it’s still broken in Jewel 10.2.3 or am I missing anything here? > > Thanks, > -Pavan. > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd cache writethrough until flush
And to add, the host running Cinder services is having Hammer 0.94.9 but the rest of them like Nova are on Jewel 10.2.3 FWIW, the rbd info for one such image looks like this: rbd image 'volume-f6ec45e2-b644-4b58-b6b5-b3a418c3c5b2': size 2048 MB in 512 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.5ebf12d1934e format: 2 features: layering, striping flags: stripe unit: 4096 kB stripe count: 1 Thanks! On 10/21/16, 7:26 PM, "ceph-users on behalf of Pavan Rallabhandi" <ceph-users-boun...@lists.ceph.com on behalf of prallabha...@walmartlabs.com> wrote: Both the clients and the cluster are running 10.2.3, perhaps the only difference is that the clients are on Trusty and the cluster is Xenial. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rbd cache writethrough until flush
I see the fix for write back cache not getting turned on after flush has made into Jewel 10.2.3 ( http://tracker.ceph.com/issues/17080 ) but our testing says otherwise. The cache is still behaving as if its writethrough, though the setting is set to true. Wanted to check if it’s still broken in Jewel 10.2.3 or am I missing anything here? Thanks, -Pavan. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RadosGW and Openstack Keystone revoked tokens
You may want to look here http://tracker.ceph.com/issues/19499 and http://tracker.ceph.com/issues/9493 Thanks, From: ceph-userson behalf of "magicb...@gmail.com" Date: Friday, 21 April 2017 1:11 pm To: ceph-users Subject: EXT: Re: [ceph-users] RadosGW and Openstack Keystone revoked tokens Hi any ideas? thanks, J. On 17/04/17 12:50, magicb...@gmail.com wrote: Hi is it possible to configure radosGW (10.2.6-0ubuntu0.16.04.1) to work with Openstack Keystone UUID based tokens? RadosGW is expecting a list of revoked tokens, but that option only works in keystone deployments based on PKI token (not uuid/fernet tokens) error log: 2017-04-17 10:40:43.753674 7f38b4fe9700 0 revoked tokens response is missing signed section 2017-04-17 10:40:43.753694 7f38b4fe9700 0 ERROR: keystone revocation processing returned error r=-22 Thanks. J. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Speeding up garbage collection in RGW
If your Ceph version is >=Jewel, you can try the `--bypass-gc` option in radosgw-admin, which would remove the tails objects as well without marking them to be GCed. Thanks, On 25/07/17, 1:34 AM, "ceph-users on behalf of Bryan Stillwell"wrote: I'm in the process of cleaning up a test that an internal customer did on our production cluster that produced over a billion objects spread across 6000 buckets. So far I've been removing the buckets like this: printf %s\\n bucket{1..6000} | xargs -I{} -n 1 -P 32 radosgw-admin bucket rm --bucket={} --purge-objects However, the disk usage doesn't seem to be getting reduced at the same rate the objects are being removed. From what I can tell a large number of the objects are waiting for garbage collection. When I first read the docs it sounded like the garbage collector would only remove 32 objects every hour, but after looking through the logs I'm seeing about 55,000 objects removed every hour. That's about 1.3 million a day, so at this rate it'll take a couple years to clean up the rest! For comparison, the purge-objects command above is removing (but not GC'ing) about 30 million objects a day, so a much more manageable 33 days to finish. I've done some digging and it appears like I should be changing these configuration options: rgw gc max objs (default: 32) rgw gc obj min wait (default: 7200) rgw gc processor max time (default: 3600) rgw gc processor period (default: 3600) A few questions I have though are: Should 'rgw gc processor max time' and 'rgw gc processor period' always be set to the same value? Which would be better, increasing 'rgw gc max objs' to something like 1024, or reducing the 'rgw gc processor' times to something like 60 seconds? Any other guidance on the best way to adjust these values? Thanks, Bryan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Speeding up garbage collection in RGW
I’ve just realized that the option is present in Hammer (0.94.10) as well, you should try that. From: Bryan Stillwell <bstillw...@godaddy.com> Date: Tuesday, 25 July 2017 at 9:45 PM To: Pavan Rallabhandi <prallabha...@walmartlabs.com>, "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com> Subject: EXT: Re: [ceph-users] Speeding up garbage collection in RGW Unfortunately, we're on hammer still (0.94.10). That option looks like it would work better, so maybe it's time to move the upgrade up in the schedule. I've been playing with the various gc options and I haven't seen any speedups like we would need to remove them in a reasonable amount of time. Thanks, Bryan From: Pavan Rallabhandi <prallabha...@walmartlabs.com> Date: Tuesday, July 25, 2017 at 3:00 AM To: Bryan Stillwell <bstillw...@godaddy.com>, "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com> Subject: Re: [ceph-users] Speeding up garbage collection in RGW If your Ceph version is >=Jewel, you can try the `--bypass-gc` option in radosgw-admin, which would remove the tails objects as well without marking them to be GCed. Thanks, On 25/07/17, 1:34 AM, "ceph-users on behalf of Bryan Stillwell" <ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com> on behalf of bstillw...@godaddy.com<mailto:bstillw...@godaddy.com>> wrote: I'm in the process of cleaning up a test that an internal customer did on our production cluster that produced over a billion objects spread across 6000 buckets. So far I've been removing the buckets like this: printf %s\\n bucket{1..6000} | xargs -I{} -n 1 -P 32 radosgw-admin bucket rm --bucket={} --purge-objects However, the disk usage doesn't seem to be getting reduced at the same rate the objects are being removed. From what I can tell a large number of the objects are waiting for garbage collection. When I first read the docs it sounded like the garbage collector would only remove 32 objects every hour, but after looking through the logs I'm seeing about 55,000 objects removed every hour. That's about 1.3 million a day, so at this rate it'll take a couple years to clean up the rest! For comparison, the purge-objects command above is removing (but not GC'ing) about 30 million objects a day, so a much more manageable 33 days to finish. I've done some digging and it appears like I should be changing these configuration options: rgw gc max objs (default: 32) rgw gc obj min wait (default: 7200) rgw gc processor max time (default: 3600) rgw gc processor period (default: 3600) A few questions I have though are: Should 'rgw gc processor max time' and 'rgw gc processor period' always be set to the same value? Which would be better, increasing 'rgw gc max objs' to something like 1024, or reducing the 'rgw gc processor' times to something like 60 seconds? Any other guidance on the best way to adjust these values? Thanks, Bryan ___ ceph-users mailing list ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] FW: radosgw: stale/leaked bucket index entries
Hi Orit, No, we do not use multi-site. Thanks, -Pavan. From: Orit Wasserman <owass...@redhat.com> Date: Tuesday, 20 June 2017 at 12:49 PM To: Pavan Rallabhandi <prallabha...@walmartlabs.com> Cc: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com> Subject: EXT: Re: [ceph-users] FW: radosgw: stale/leaked bucket index entries Hi Pavan, On Tue, Jun 20, 2017 at 8:29 AM, Pavan Rallabhandi <prallabha...@walmartlabs.com> wrote: Trying one more time with ceph-users On 19/06/17, 11:07 PM, "Pavan Rallabhandi" <prallabha...@walmartlabs.com> wrote: On many of our clusters running Jewel (10.2.5+), am running into a strange problem of having stale bucket index entries left over for (some of the) objects deleted. Though it is not reproducible at will, it has been pretty consistent of late and am clueless at this point for the possible reasons to happen so. The symptoms are that the actual delete operation of an object is reported successful in the RGW logs, but a bucket list on the container would still show the deleted object. An attempt to download/stat of the object appropriately results in a failure. No failures are seen in the respective OSDs where the bucket index object is located. And rebuilding the bucket index by running ‘radosgw-admin bucket check –fix’ would fix the issue. Though I could simulate the problem by instrumenting the code, to not to have invoked `complete_del` on the bucket index op https://github.com/ceph/ceph/blob/master/src/rgw/rgw_rados.cc#L8793, but that call is always seem to be made unless there is a cascading error from the actual delete operation of the object, which doesn’t seem to be the case here. I wanted to know the possible reasons where the bucket index would be left in such limbo, any pointers would be much appreciated. FWIW, we are not sharding the buckets and very recently I’ve seen this happen with buckets having as low as < 10 objects, and we are using swift for all the operations. Do you use multisite? Regards, Orit Thanks, -Pavan. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] FW: radosgw: stale/leaked bucket index entries
Trying one more time with ceph-users On 19/06/17, 11:07 PM, "Pavan Rallabhandi" <prallabha...@walmartlabs.com> wrote: On many of our clusters running Jewel (10.2.5+), am running into a strange problem of having stale bucket index entries left over for (some of the) objects deleted. Though it is not reproducible at will, it has been pretty consistent of late and am clueless at this point for the possible reasons to happen so. The symptoms are that the actual delete operation of an object is reported successful in the RGW logs, but a bucket list on the container would still show the deleted object. An attempt to download/stat of the object appropriately results in a failure. No failures are seen in the respective OSDs where the bucket index object is located. And rebuilding the bucket index by running ‘radosgw-admin bucket check –fix’ would fix the issue. Though I could simulate the problem by instrumenting the code, to not to have invoked `complete_del` on the bucket index op https://github.com/ceph/ceph/blob/master/src/rgw/rgw_rados.cc#L8793, but that call is always seem to be made unless there is a cascading error from the actual delete operation of the object, which doesn’t seem to be the case here. I wanted to know the possible reasons where the bucket index would be left in such limbo, any pointers would be much appreciated. FWIW, we are not sharding the buckets and very recently I’ve seen this happen with buckets having as low as < 10 objects, and we are using swift for all the operations. Thanks, -Pavan. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] FW: radosgw: stale/leaked bucket index entries
Looks like I’ve now got a consistent repro scenario, please find the gory details here http://tracker.ceph.com/issues/20380 Thanks! On 20/06/17, 2:04 PM, "Pavan Rallabhandi" <prallabha...@walmartlabs.com> wrote: Hi Orit, No, we do not use multi-site. Thanks, -Pavan. From: Orit Wasserman <owass...@redhat.com> Date: Tuesday, 20 June 2017 at 12:49 PM To: Pavan Rallabhandi <prallabha...@walmartlabs.com> Cc: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com> Subject: EXT: Re: [ceph-users] FW: radosgw: stale/leaked bucket index entries Hi Pavan, On Tue, Jun 20, 2017 at 8:29 AM, Pavan Rallabhandi <prallabha...@walmartlabs.com> wrote: Trying one more time with ceph-users On 19/06/17, 11:07 PM, "Pavan Rallabhandi" <prallabha...@walmartlabs.com> wrote: On many of our clusters running Jewel (10.2.5+), am running into a strange problem of having stale bucket index entries left over for (some of the) objects deleted. Though it is not reproducible at will, it has been pretty consistent of late and am clueless at this point for the possible reasons to happen so. The symptoms are that the actual delete operation of an object is reported successful in the RGW logs, but a bucket list on the container would still show the deleted object. An attempt to download/stat of the object appropriately results in a failure. No failures are seen in the respective OSDs where the bucket index object is located. And rebuilding the bucket index by running ‘radosgw-admin bucket check –fix’ would fix the issue. Though I could simulate the problem by instrumenting the code, to not to have invoked `complete_del` on the bucket index op https://github.com/ceph/ceph/blob/master/src/rgw/rgw_rados.cc#L8793, but that call is always seem to be made unless there is a cascading error from the actual delete operation of the object, which doesn’t seem to be the case here. I wanted to know the possible reasons where the bucket index would be left in such limbo, any pointers would be much appreciated. FWIW, we are not sharding the buckets and very recently I’ve seen this happen with buckets having as low as < 10 objects, and we are using swift for all the operations. Do you use multisite? Regards, Orit Thanks, -Pavan. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bucket reporting content inconsistently
Can possibly be due to these http://tracker.ceph.com/issues/20380, http://tracker.ceph.com/issues/22555 Thanks, From: ceph-userson behalf of Tom W Date: Saturday, May 12, 2018 at 10:57 AM To: ceph-users Subject: EXT: Re: [ceph-users] Bucket reporting content inconsistently Thanks for posting this for me Sean. Just to update, it seems that despite the bucket checks completing and reporting no issues, the objects continued to show in any tools to list the contents of the bucket. I put together a simple loop to upload a new file to overwrite the existing one then trigger a delete request though the API and this seems to be working in lieu of a cleaner solution. We will be upgrading to Luminous in the coming week, I’ll report back if we see any significant change in this issue when we do. Kind Regards, Tom From: ceph-users On Behalf Of Sean Redmond Sent: 11 May 2018 17:15 To: ceph-users Subject: [ceph-users] Bucket reporting content inconsistently HI all, We have recently upgraded to 10.2.10 in preparation for our upcoming upgrade to Luminous and I have been attempting to remove a bucket. When using tools such as s3cmd I can see files are listed, verified by the checking with bi list too as shown below: root@ceph-rgw-1:~# radosgw-admin --id rgw.ceph-rgw-1 bi list --bucket='bucketnamehere' | grep -i "\"idx\":" | wc -l 3278 However, on attempting to delete the bucket and purge the objects , it appears not to be recognised: root@ceph-rgw-1:~# radosgw-admin --id rgw.ceph-rgw-1 bucket rm --bucket= bucketnamehere --purge-objects 2018-05-10 14:11:05.393851 7f0ab07b6a00 -1 ERROR: unable to remove bucket(2) No such file or directory Checking the bucket stats, it does appear that the bucket is reporting no content, and repeat the above content test there has been no change to the 3278 figure: root@ceph-rgw-1:~# radosgw-admin --id rgw.ceph-rgw-1 bucket stats --bucket="bucketnamehere" { "bucket": "bucketnamehere", "pool": ".rgw.buckets", "index_pool": ".rgw.buckets.index", "id": "default.28142894.1", "marker": "default.28142894.1", "owner": "16355", "ver": "0#5463545,1#5483686,2#5483484,3#5474696,4#5479052,5#5480339,6#5469460,7#5463976", "master_ver": "0#0,1#0,2#0,3#0,4#0,5#0,6#0,7#0", "mtime": "2015-12-08 12:42:26.286153", "max_marker": "0#,1#,2#,3#,4#,5#,6#,7#", "usage": { "rgw.main": { "size_kb": 0, "size_kb_actual": 0, "num_objects": 0 }, "rgw.multimeta": { "size_kb": 0, "size_kb_actual": 0, "num_objects": 0 } }, "bucket_quota": { "enabled": false, "max_size_kb": -1, "max_objects": -1 } } I have attempted a bucket index check and fix on this, however, it does not appear to have made a difference and no fixes or errors reported from it. Does anyone have any advice on how to proceed with removing this content? At this stage I am not too concerned if the method needed to remove this generates orphans, as we will shortly be running a large orphan scan after our upgrade to Luminous. Cluster health otherwise reports normal. Thanks Sean Redmond NOTICE AND DISCLAIMER This e-mail (including any attachments) is intended for the above-named person(s). If you are not the intended recipient, notify the sender immediately, delete this email from your system and do not disclose or use for any purpose. We may monitor all incoming and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Jewel PG stuck inconsistent with 3 0-size objects
Yes, that suggestion worked for us, although we hit this when we've upgraded to 10.2.10 from 10.2.7. I guess this was fixed via http://tracker.ceph.com/issues/21440 and http://tracker.ceph.com/issues/19404 Thanks, -Pavan. On 7/16/18, 5:07 AM, "ceph-users on behalf of Matthew Vernon" wrote: Hi, Our cluster is running 10.2.9 (from Ubuntu; on 16.04 LTS), and we have a pg that's stuck inconsistent; if I repair it, it logs "failed to pick suitable auth object" (repair log attached, to try and stop my MUA mangling it). We then deep-scrubbed that pg, at which point rados list-inconsistent-obj 67.2e --format=json-pretty produces a bit of output (also attached), which includes that all 3 osds have a zero-sized object e.g. "osd": 1937, "errors": [ "omap_digest_mismatch_oi" ], "size": 0, "omap_digest": "0x45773901", "data_digest": "0x" All 3 osds have different omap_digest, but all have 0 size. Indeed, looking on the OSD disks directly, each object is 0 size (i.e. they are identical). This looks similar to one of the failure modes in http://tracker.ceph.com/issues/21388 where the is a suggestion (comment 19 from David Zafman) to do: rados -p default.rgw.buckets.index setomapval .dir.861ae926-7ff0-48c5-86d6-a6ba8d0a7a14.7130858.6 temporary-key anything [deep-scrub] rados -p default.rgw.buckets.index rmomapkey .dir.861ae926-7ff0-48c5-86d6-a6ba8d0a7a14.7130858.6 temporary-key Is this likely to be the correct approach here, to? And is there an underlying bug in ceph that still needs fixing? :) Thanks, Matthew -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Backfilling on Luminous
Somehow, I missed replying to this, the random split would be enabled for all new PGs or the PGs that get mapped to new OSDs. For existing OSDs, one has to use ceph-objectstore-tool’s apply-layout commad to run on each OSD while the OSD is offline. If you want to pre-split PGs using ‘expected_num_objects’ at the time of pool creation, be aware of this fix http://tracker.ceph.com/issues/22530. Thanks, -Pavan. From: David Turner <drakonst...@gmail.com> Date: Tuesday, March 20, 2018 at 1:50 PM To: Pavan Rallabhandi <prallabha...@walmartlabs.com> Cc: ceph-users <ceph-users@lists.ceph.com> Subject: EXT: Re: [ceph-users] Backfilling on Luminous @Pavan, I did not know about 'filestore split rand factor'. That looks like it was added in Jewel and I must have missed it. To disable it, would I just set it to 0 and restart all of the OSDs? That isn't an option at the moment, but restarting the OSDs after this backfilling is done is definitely doable. On Mon, Mar 19, 2018 at 5:28 PM Pavan Rallabhandi <prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>> wrote: David, Pretty sure you must be aware of the filestore random split on existing OSD PGs, `filestore split rand factor`, may be you could try that too. Thanks, -Pavan. From: ceph-users <ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>> on behalf of David Turner <drakonst...@gmail.com<mailto:drakonst...@gmail.com>> Date: Monday, March 19, 2018 at 1:36 PM To: Caspar Smit <caspars...@supernas.eu<mailto:caspars...@supernas.eu>> Cc: ceph-users <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>> Subject: EXT: Re: [ceph-users] Backfilling on Luminous Sorry for being away. I set all of my backfilling to VERY slow settings over the weekend and things have been stable, but incredibly slow (1% recovery from 3% misplaced to 2% all weekend). I'm back on it now and well rested. @Caspar, SWAP isn't being used on these nodes and all of the affected OSDs have been filestore. @Dan, I think you hit the nail on the head. I didn't know that logging was added for subfolder splitting in Luminous!!! That's AMAZING We are seeing consistent subfolder splitting all across the cluster. The majority of the crashed OSDs have a split started before the crash and then commenting about it in the crash dump. Looks like I just need to write a daemon to watch for splitting to start and throttle recovery until it's done. I had injected the following timeout settings, but it didn't seem to affect anything. I may need to have placed them in ceph.conf and let them pick up the new settings as the OSDs crashed, but I didn't really want different settings on some OSDs in the cluster. osd_op_thread_suicide_timeout=1200 (from 180) osd-recovery-thread-timeout=300 (from 30) My game plan for now is to watch for splitting in the log, increase recovery sleep, decrease osd_recovery_max_active, and watch for splitting to finish before setting them back to more aggressive settings. After this cluster is done backfilling I'm going to do my best to reproduce this scenario in a test environment and open a ticket to hopefully fix why this is happening so detrimentally. On Fri, Mar 16, 2018 at 4:00 AM Caspar Smit <caspars...@supernas.eu<mailto:caspars...@supernas.eu>> wrote: Hi David, What about memory usage? 1] 23 OSD nodes: 15x 10TB Seagate Ironwolf filestore with journals on Intel DC P3700, 70% full cluster, Dual Socket E5-2620 v4 @ 2.10GHz, 128GB RAM. If you upgrade to bluestore, memory usage will likely increase. 15x10TB ~~ 150GB RAM needed especially in recovery/backfilling scenario's like these. Kind regards, Caspar 2018-03-15 21:53 GMT+01:00 Dan van der Ster <d...@vanderster.com<mailto:d...@vanderster.com>>: Did you use perf top or iotop to try to identify where the osd is stuck? Did you try increasing the op thread suicide timeout from 180s? Splitting should log at the beginning and end of an op, so it should be clear if it's taking longer than the timeout. .. Dan On Mar 15, 2018 9:23 PM, "David Turner" <drakonst...@gmail.com<mailto:drakonst...@gmail.com>> wrote: I am aware of the filestore splitting happening. I manually split all of the subfolders a couple weeks ago on this cluster, but every time we have backfilling the newly moved PGs have a chance to split before the backfilling is done. When that has happened in the past it causes some blocked requests and will flap OSDs if we don't increase the osd_heartbeat_grace, but it has never consistently killed the OSDs during the task. Maybe that's new in Luminous due to some of the priority and timeout settings. This problem in general seems unrelated to the subfolder splitting, though, since it started to happen very quickly into the backfilling process. Definitely before many of the recently moved
Re: [ceph-users] Does jewel 10.2.10 support filestore_split_rand_factor?
No, it is supported in the next version of Jewel http://tracker.ceph.com/issues/22658 From: ceph-userson behalf of shadow_lin Date: Sunday, April 1, 2018 at 3:53 AM To: ceph-users Subject: EXT: [ceph-users] Does jewel 10.2.10 support filestore_split_rand_factor? Hi list, The document page of jewel has filestore_split_rand_factor config but I can't find the config by using 'ceph daemon osd.x config'. ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe) ceph daemon osd.0 config show|grep split "mon_osd_max_split_count": "32", "journaler_allow_split_entries": "true", "mds_bal_split_size": "1", "mds_bal_split_rd": "25000", "mds_bal_split_wr": "1", "mds_bal_split_bits": "3", "filestore_split_multiple": "4", "filestore_debug_verify_split": "false", 2018-04-01 shadow_lin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Backfilling on Luminous
David, Pretty sure you must be aware of the filestore random split on existing OSD PGs, `filestore split rand factor`, may be you could try that too. Thanks, -Pavan. From: ceph-userson behalf of David Turner Date: Monday, March 19, 2018 at 1:36 PM To: Caspar Smit Cc: ceph-users Subject: EXT: Re: [ceph-users] Backfilling on Luminous Sorry for being away. I set all of my backfilling to VERY slow settings over the weekend and things have been stable, but incredibly slow (1% recovery from 3% misplaced to 2% all weekend). I'm back on it now and well rested. @Caspar, SWAP isn't being used on these nodes and all of the affected OSDs have been filestore. @Dan, I think you hit the nail on the head. I didn't know that logging was added for subfolder splitting in Luminous!!! That's AMAZING We are seeing consistent subfolder splitting all across the cluster. The majority of the crashed OSDs have a split started before the crash and then commenting about it in the crash dump. Looks like I just need to write a daemon to watch for splitting to start and throttle recovery until it's done. I had injected the following timeout settings, but it didn't seem to affect anything. I may need to have placed them in ceph.conf and let them pick up the new settings as the OSDs crashed, but I didn't really want different settings on some OSDs in the cluster. osd_op_thread_suicide_timeout=1200 (from 180) osd-recovery-thread-timeout=300 (from 30) My game plan for now is to watch for splitting in the log, increase recovery sleep, decrease osd_recovery_max_active, and watch for splitting to finish before setting them back to more aggressive settings. After this cluster is done backfilling I'm going to do my best to reproduce this scenario in a test environment and open a ticket to hopefully fix why this is happening so detrimentally. On Fri, Mar 16, 2018 at 4:00 AM Caspar Smit > wrote: Hi David, What about memory usage? 1] 23 OSD nodes: 15x 10TB Seagate Ironwolf filestore with journals on Intel DC P3700, 70% full cluster, Dual Socket E5-2620 v4 @ 2.10GHz, 128GB RAM. If you upgrade to bluestore, memory usage will likely increase. 15x10TB ~~ 150GB RAM needed especially in recovery/backfilling scenario's like these. Kind regards, Caspar 2018-03-15 21:53 GMT+01:00 Dan van der Ster >: Did you use perf top or iotop to try to identify where the osd is stuck? Did you try increasing the op thread suicide timeout from 180s? Splitting should log at the beginning and end of an op, so it should be clear if it's taking longer than the timeout. .. Dan On Mar 15, 2018 9:23 PM, "David Turner" > wrote: I am aware of the filestore splitting happening. I manually split all of the subfolders a couple weeks ago on this cluster, but every time we have backfilling the newly moved PGs have a chance to split before the backfilling is done. When that has happened in the past it causes some blocked requests and will flap OSDs if we don't increase the osd_heartbeat_grace, but it has never consistently killed the OSDs during the task. Maybe that's new in Luminous due to some of the priority and timeout settings. This problem in general seems unrelated to the subfolder splitting, though, since it started to happen very quickly into the backfilling process. Definitely before many of the recently moved PGs would have reached that point. I've also confirmed that the OSDs that are dying are not just stuck on a process (like it looks like with filestore splitting), but actually segfaulting and restarting. On Thu, Mar 15, 2018 at 4:08 PM Dan van der Ster > wrote: Hi, Do you see any split or merge messages in the osd logs? I recall some surprise filestore splitting on a few osds after the luminous upgrade. .. Dan On Mar 15, 2018 6:04 PM, "David Turner" > wrote: I upgraded a [1] cluster from Jewel 10.2.7 to Luminous 12.2.2 and last week I added 2 nodes to the cluster. The backfilling has been ATROCIOUS. I have OSDs consistently [2] segfaulting during recovery. There's no pattern of which OSDs are segfaulting, which hosts have segfaulting OSDs, etc... It's all over the cluster. I have been trying variants on all of these following settings with different levels of success, but I cannot eliminate the blocked requests and segfaulting OSDs. osd_heartbeat_grace, osd_max_backfills, osd_op_thread_suicide_timeout, osd_recovery_max_active, osd_recovery_sleep_hdd, osd_recovery_sleep_hybrid, osd_recovery_thread_timeout, and osd_scrub_during_recovery. Except for setting nobackfilling on the cluster I can't stop OSDs from
Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever
I see Filestore symbols on the stack, so the bluestore config doesn’t affect. And the top frame of the stack hints at a RocksDB issue, and there are a whole lot of these too: “2018-09-17 19:23:06.480258 7f1f3d2a7700 2 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/table/block_based_table_reader.cc:636] Cannot find Properties block from file.” It really seems to be something with RocksDB on centOS. I still think you can try removing “compression=kNoCompression” from the filestore_rocksdb_options And/Or check if rocksdb is expecting snappy to be enabled. Thanks, -Pavan. From: David Turner Date: Thursday, September 27, 2018 at 1:18 PM To: Pavan Rallabhandi Cc: ceph-users Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever I got pulled away from this for a while. The error in the log is "abort: Corruption: Snappy not supported or corrupted Snappy compressed block contents" and the OSD has 2 settings set to snappy by default, async_compressor_type and bluestore_compression_algorithm. Do either of these settings affect the omap store? On Wed, Sep 19, 2018 at 2:33 PM Pavan Rallabhandi <mailto:prallabha...@walmartlabs.com> wrote: Looks like you are running on CentOS, fwiw. We’ve successfully ran the conversion commands on Jewel, Ubuntu 16.04. Have a feel it’s expecting the compression to be enabled, can you try removing “compression=kNoCompression” from the filestore_rocksdb_options? And/or you might want to check if rocksdb is expecting snappy to be enabled. From: David Turner <mailto:drakonst...@gmail.com> Date: Tuesday, September 18, 2018 at 6:01 PM To: Pavan Rallabhandi <mailto:prallabha...@walmartlabs.com> Cc: ceph-users <mailto:ceph-users@lists.ceph.com> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever Here's the [1] full log from the time the OSD was started to the end of the crash dump. These logs are so hard to parse. Is there anything useful in them? I did confirm that all perms were set correctly and that the superblock was changed to rocksdb before the first time I attempted to start the OSD with it's new DB. This is on a fully Luminous cluster with [2] the defaults you mentioned. [1] https://gist.github.com/drakonstein/fa3ac0ad9b2ec1389c957f95e05b79ed [2] "filestore_omap_backend": "rocksdb", "filestore_rocksdb_options": "max_background_compactions=8,compaction_readahead_size=2097152,compression=kNoCompression", On Tue, Sep 18, 2018 at 5:29 PM Pavan Rallabhandi <mailto:mailto:prallabha...@walmartlabs.com> wrote: I meant the stack trace hints that the superblock still has leveldb in it, have you verified that already? On 9/18/18, 5:27 PM, "Pavan Rallabhandi" <mailto:mailto:prallabha...@walmartlabs.com> wrote: You should be able to set them under the global section and that reminds me, since you are on Luminous already, I guess those values are already the default, you can verify from the admin socket of any OSD. But the stack trace didn’t hint as if the superblock on the OSD is still considering the omap backend to be leveldb and to do with the compression. Thanks, -Pavan. From: David Turner <mailto:mailto:drakonst...@gmail.com> Date: Tuesday, September 18, 2018 at 5:07 PM To: Pavan Rallabhandi <mailto:mailto:prallabha...@walmartlabs.com> Cc: ceph-users <mailto:mailto:ceph-users@lists.ceph.com> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever Are those settings fine to have be global even if not all OSDs on a node have rocksdb as the backend? Or will I need to convert all OSDs on a node at the same time? On Tue, Sep 18, 2018 at 5:02 PM Pavan Rallabhandi <mailto:mailto:mailto:mailto:prallabha...@walmartlabs.com> wrote: The steps that were outlined for conversion are correct, have you tried setting some the relevant ceph conf values too: filestore_rocksdb_options = "max_background_compactions=8;compaction_readahead_size=2097152;compression=kNoCompression" filestore_omap_backend = rocksdb Thanks, -Pavan. From: ceph-users <mailto:mailto:mailto:mailto:ceph-users-boun...@lists.ceph.com> on behalf of David Turner <mailto:mailto:mailto:mailto:drakonst...@gmail.com> Date: Tuesday, September 18, 2018 at 4:09 PM To: ceph-users <mailto:mailto:mailto:mailto:ceph-users@lists.ceph.com> Subject: EXT: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever I've finally learned enough about the OSD backend track down this issue to what I believe is the root cause. LevelDB compac
Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever
Not exactly, this feature was supported in Jewel starting 10.2.11, ref https://github.com/ceph/ceph/pull/18010 I thought you mentioned you were using Luminous 12.2.4. From: David Turner Date: Friday, November 2, 2018 at 5:21 PM To: Pavan Rallabhandi Cc: ceph-users Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever That makes so much more sense. It seems like RHCS had had this ability since Jewel while it was only put into the community version as of Mimic. So my version of the version isn't actually capable of changing the backend db. Whole digging into the coffee I did find a bug with the creation of the rocksdb backend created with ceph-kvstore-tool. It doesn't use the ceph defaults or any settings in your config file for the db settings. I'm working on testing a modified version that should take those settings into account. If the fix does work, the fix will be able to apply to a few other tools as well that can be used to set up the omap backend db. On Fri, Nov 2, 2018, 4:26 PM Pavan Rallabhandi mailto:prallabha...@walmartlabs.com>> wrote: It was Redhat versioned Jewel. But may be more relevantly, we are on Ubuntu unlike your case. From: David Turner mailto:drakonst...@gmail.com>> Date: Friday, November 2, 2018 at 10:24 AM To: Pavan Rallabhandi mailto:prallabha...@walmartlabs.com>> Cc: ceph-users mailto:ceph-users@lists.ceph.com>> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever Pavan, which version of Ceph were you using when you changed your backend to rocksdb? On Mon, Oct 1, 2018 at 4:24 PM Pavan Rallabhandi mailto:prallabha...@walmartlabs.com>> wrote: Yeah, I think this is something to do with the CentOS binaries, sorry that I couldn’t be of much help here. Thanks, -Pavan. From: David Turner mailto:drakonst...@gmail.com>> Date: Monday, October 1, 2018 at 1:37 PM To: Pavan Rallabhandi mailto:prallabha...@walmartlabs.com>> Cc: ceph-users mailto:ceph-users@lists.ceph.com>> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever I tried modifying filestore_rocksdb_options by removing compression=kNoCompression as well as setting it to compression=kSnappyCompression. Leaving it with kNoCompression or removing it results in the same segfault in the previous log. Setting it to kSnappyCompression resulted in [1] this being logged and the OSD just failing to start instead of segfaulting. Is there anything else you would suggest trying before I purge this OSD from the cluster? I'm afraid it might be something with the CentOS binaries. [1] 2018-10-01 17:10:37.134930 7f1415dfcd80 0 set rocksdb option compression = kSnappyCompression 2018-10-01 17:10:37.134986 7f1415dfcd80 -1 rocksdb: Invalid argument: Compression type Snappy is not linked with the binary. 2018-10-01 17:10:37.135004 7f1415dfcd80 -1 filestore(/var/lib/ceph/osd/ceph-1) mount(1723): Error initializing rocksdb : 2018-10-01 17:10:37.135020 7f1415dfcd80 -1 osd.1 0 OSD:init: unable to mount object store 2018-10-01 17:10:37.135029 7f1415dfcd80 -1 ESC[0;31m ** ERROR: osd init failed: (1) Operation not permittedESC[0m On Sat, Sep 29, 2018 at 1:57 PM Pavan Rallabhandi <mailto:prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>> wrote: I looked at one of my test clusters running Jewel on Ubuntu 16.04, and interestingly I found this(below) in one of the OSD logs, which is different from your OSD boot log, where none of the compression algorithms seem to be supported. This hints more at how rocksdb was built on CentOS for Ceph. 2018-09-29 17:38:38.629112 7fbd318d4b00 4 rocksdb: Compression algorithms supported: 2018-09-29 17:38:38.629112 7fbd318d4b00 4 rocksdb: Snappy supported: 1 2018-09-29 17:38:38.629113 7fbd318d4b00 4 rocksdb: Zlib supported: 1 2018-09-29 17:38:38.629113 7fbd318d4b00 4 rocksdb: Bzip supported: 0 2018-09-29 17:38:38.629114 7fbd318d4b00 4 rocksdb: LZ4 supported: 0 2018-09-29 17:38:38.629114 7fbd318d4b00 4 rocksdb: ZSTD supported: 0 2018-09-29 17:38:38.629115 7fbd318d4b00 4 rocksdb: Fast CRC32 supported: 0 On 9/27/18, 2:56 PM, "Pavan Rallabhandi" <mailto:prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>> wrote: I see Filestore symbols on the stack, so the bluestore config doesn’t affect. And the top frame of the stack hints at a RocksDB issue, and there are a whole lot of these too: “2018-09-17 19:23:06.480258 7f1f3d2a7700 2 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/table/block_based_table_reader.cc:636] Cannot find Properties block from file.” It really seems to be something with RocksD
Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever
It was Redhat versioned Jewel. But may be more relevantly, we are on Ubuntu unlike your case. From: David Turner Date: Friday, November 2, 2018 at 10:24 AM To: Pavan Rallabhandi Cc: ceph-users Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever Pavan, which version of Ceph were you using when you changed your backend to rocksdb? On Mon, Oct 1, 2018 at 4:24 PM Pavan Rallabhandi mailto:prallabha...@walmartlabs.com>> wrote: Yeah, I think this is something to do with the CentOS binaries, sorry that I couldn’t be of much help here. Thanks, -Pavan. From: David Turner mailto:drakonst...@gmail.com>> Date: Monday, October 1, 2018 at 1:37 PM To: Pavan Rallabhandi mailto:prallabha...@walmartlabs.com>> Cc: ceph-users mailto:ceph-users@lists.ceph.com>> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever I tried modifying filestore_rocksdb_options by removing compression=kNoCompression as well as setting it to compression=kSnappyCompression. Leaving it with kNoCompression or removing it results in the same segfault in the previous log. Setting it to kSnappyCompression resulted in [1] this being logged and the OSD just failing to start instead of segfaulting. Is there anything else you would suggest trying before I purge this OSD from the cluster? I'm afraid it might be something with the CentOS binaries. [1] 2018-10-01 17:10:37.134930 7f1415dfcd80 0 set rocksdb option compression = kSnappyCompression 2018-10-01 17:10:37.134986 7f1415dfcd80 -1 rocksdb: Invalid argument: Compression type Snappy is not linked with the binary. 2018-10-01 17:10:37.135004 7f1415dfcd80 -1 filestore(/var/lib/ceph/osd/ceph-1) mount(1723): Error initializing rocksdb : 2018-10-01 17:10:37.135020 7f1415dfcd80 -1 osd.1 0 OSD:init: unable to mount object store 2018-10-01 17:10:37.135029 7f1415dfcd80 -1 ESC[0;31m ** ERROR: osd init failed: (1) Operation not permittedESC[0m On Sat, Sep 29, 2018 at 1:57 PM Pavan Rallabhandi <mailto:prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>> wrote: I looked at one of my test clusters running Jewel on Ubuntu 16.04, and interestingly I found this(below) in one of the OSD logs, which is different from your OSD boot log, where none of the compression algorithms seem to be supported. This hints more at how rocksdb was built on CentOS for Ceph. 2018-09-29 17:38:38.629112 7fbd318d4b00 4 rocksdb: Compression algorithms supported: 2018-09-29 17:38:38.629112 7fbd318d4b00 4 rocksdb: Snappy supported: 1 2018-09-29 17:38:38.629113 7fbd318d4b00 4 rocksdb: Zlib supported: 1 2018-09-29 17:38:38.629113 7fbd318d4b00 4 rocksdb: Bzip supported: 0 2018-09-29 17:38:38.629114 7fbd318d4b00 4 rocksdb: LZ4 supported: 0 2018-09-29 17:38:38.629114 7fbd318d4b00 4 rocksdb: ZSTD supported: 0 2018-09-29 17:38:38.629115 7fbd318d4b00 4 rocksdb: Fast CRC32 supported: 0 On 9/27/18, 2:56 PM, "Pavan Rallabhandi" <mailto:prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>> wrote: I see Filestore symbols on the stack, so the bluestore config doesn’t affect. And the top frame of the stack hints at a RocksDB issue, and there are a whole lot of these too: “2018-09-17 19:23:06.480258 7f1f3d2a7700 2 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/table/block_based_table_reader.cc:636] Cannot find Properties block from file.” It really seems to be something with RocksDB on centOS. I still think you can try removing “compression=kNoCompression” from the filestore_rocksdb_options And/Or check if rocksdb is expecting snappy to be enabled. Thanks, -Pavan. From: David Turner <mailto:drakonst...@gmail.com<mailto:drakonst...@gmail.com>> Date: Thursday, September 27, 2018 at 1:18 PM To: Pavan Rallabhandi <mailto:prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>> Cc: ceph-users <mailto:ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever I got pulled away from this for a while. The error in the log is "abort: Corruption: Snappy not supported or corrupted Snappy compressed block contents" and the OSD has 2 settings set to snappy by default, async_compressor_type and bluestore_compression_algorithm. Do either of these settings affect the omap store? On Wed, Sep 19, 2018 at 2:33 PM Pavan Rallabhandi <mailto:mailto<mailto:mailto>:prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>> wrote: Looks like you are running on CentOS, fwiw. We’ve successfully ran the conversion commands
Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever
Not sure I understand that, but starting Luminous, the filestore omap backend is rocksdb by default. From: David Turner Date: Monday, November 5, 2018 at 3:25 PM To: Pavan Rallabhandi Cc: ceph-users Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever Digging into the code a little more, that functionality was added in 10.2.11 and 13.0.1, but it still isn't anywhere in the 12.x.x Luminous version. That's so bizarre. On Sat, Nov 3, 2018 at 11:56 AM Pavan Rallabhandi mailto:prallabha...@walmartlabs.com>> wrote: Not exactly, this feature was supported in Jewel starting 10.2.11, ref https://github.com/ceph/ceph/pull/18010 I thought you mentioned you were using Luminous 12.2.4. From: David Turner mailto:drakonst...@gmail.com>> Date: Friday, November 2, 2018 at 5:21 PM To: Pavan Rallabhandi mailto:prallabha...@walmartlabs.com>> Cc: ceph-users mailto:ceph-users@lists.ceph.com>> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever That makes so much more sense. It seems like RHCS had had this ability since Jewel while it was only put into the community version as of Mimic. So my version of the version isn't actually capable of changing the backend db. Whole digging into the coffee I did find a bug with the creation of the rocksdb backend created with ceph-kvstore-tool. It doesn't use the ceph defaults or any settings in your config file for the db settings. I'm working on testing a modified version that should take those settings into account. If the fix does work, the fix will be able to apply to a few other tools as well that can be used to set up the omap backend db. On Fri, Nov 2, 2018, 4:26 PM Pavan Rallabhandi mailto:prallabha...@walmartlabs.com>> wrote: It was Redhat versioned Jewel. But may be more relevantly, we are on Ubuntu unlike your case. From: David Turner mailto:drakonst...@gmail.com>> Date: Friday, November 2, 2018 at 10:24 AM To: Pavan Rallabhandi mailto:prallabha...@walmartlabs.com>> Cc: ceph-users mailto:ceph-users@lists.ceph.com>> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever Pavan, which version of Ceph were you using when you changed your backend to rocksdb? On Mon, Oct 1, 2018 at 4:24 PM Pavan Rallabhandi mailto:prallabha...@walmartlabs.com>> wrote: Yeah, I think this is something to do with the CentOS binaries, sorry that I couldn’t be of much help here. Thanks, -Pavan. From: David Turner mailto:drakonst...@gmail.com>> Date: Monday, October 1, 2018 at 1:37 PM To: Pavan Rallabhandi mailto:prallabha...@walmartlabs.com>> Cc: ceph-users mailto:ceph-users@lists.ceph.com>> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever I tried modifying filestore_rocksdb_options by removing compression=kNoCompression as well as setting it to compression=kSnappyCompression. Leaving it with kNoCompression or removing it results in the same segfault in the previous log. Setting it to kSnappyCompression resulted in [1] this being logged and the OSD just failing to start instead of segfaulting. Is there anything else you would suggest trying before I purge this OSD from the cluster? I'm afraid it might be something with the CentOS binaries. [1] 2018-10-01 17:10:37.134930 7f1415dfcd80 0 set rocksdb option compression = kSnappyCompression 2018-10-01 17:10:37.134986 7f1415dfcd80 -1 rocksdb: Invalid argument: Compression type Snappy is not linked with the binary. 2018-10-01 17:10:37.135004 7f1415dfcd80 -1 filestore(/var/lib/ceph/osd/ceph-1) mount(1723): Error initializing rocksdb : 2018-10-01 17:10:37.135020 7f1415dfcd80 -1 osd.1 0 OSD:init: unable to mount object store 2018-10-01 17:10:37.135029 7f1415dfcd80 -1 ESC[0;31m ** ERROR: osd init failed: (1) Operation not permittedESC[0m On Sat, Sep 29, 2018 at 1:57 PM Pavan Rallabhandi <mailto:prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>> wrote: I looked at one of my test clusters running Jewel on Ubuntu 16.04, and interestingly I found this(below) in one of the OSD logs, which is different from your OSD boot log, where none of the compression algorithms seem to be supported. This hints more at how rocksdb was built on CentOS for Ceph. 2018-09-29 17:38:38.629112 7fbd318d4b00 4 rocksdb: Compression algorithms supported: 2018-09-29 17:38:38.629112 7fbd318d4b00 4 rocksdb: Snappy supported: 1 2018-09-29 17:38:38.629113 7fbd318d4b00 4 rocksdb: Zlib supported: 1 2018-09-29 17:38:38.629113 7fbd318d4b00 4 rocksdb: Bzip supported: 0 2018-09-29 17:38:38.629114 7fbd318d4b00 4 rocksdb: LZ4 supported: 0 2018-09-29 17:38:38.629114 7fbd318d4b00 4 rocksdb: ZSTD supported: 0 2018-09-29 17:38:38.629115 7fbd318d4b00 4 rocksdb: Fast CRC32 supported: 0 On 9/27/18, 2:56 PM, "Pav
Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever
I looked at one of my test clusters running Jewel on Ubuntu 16.04, and interestingly I found this(below) in one of the OSD logs, which is different from your OSD boot log, where none of the compression algorithms seem to be supported. This hints more at how rocksdb was built on CentOS for Ceph. 2018-09-29 17:38:38.629112 7fbd318d4b00 4 rocksdb: Compression algorithms supported: 2018-09-29 17:38:38.629112 7fbd318d4b00 4 rocksdb: Snappy supported: 1 2018-09-29 17:38:38.629113 7fbd318d4b00 4 rocksdb: Zlib supported: 1 2018-09-29 17:38:38.629113 7fbd318d4b00 4 rocksdb: Bzip supported: 0 2018-09-29 17:38:38.629114 7fbd318d4b00 4 rocksdb: LZ4 supported: 0 2018-09-29 17:38:38.629114 7fbd318d4b00 4 rocksdb: ZSTD supported: 0 2018-09-29 17:38:38.629115 7fbd318d4b00 4 rocksdb: Fast CRC32 supported: 0 On 9/27/18, 2:56 PM, "Pavan Rallabhandi" wrote: I see Filestore symbols on the stack, so the bluestore config doesn’t affect. And the top frame of the stack hints at a RocksDB issue, and there are a whole lot of these too: “2018-09-17 19:23:06.480258 7f1f3d2a7700 2 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/table/block_based_table_reader.cc:636] Cannot find Properties block from file.” It really seems to be something with RocksDB on centOS. I still think you can try removing “compression=kNoCompression” from the filestore_rocksdb_options And/Or check if rocksdb is expecting snappy to be enabled. Thanks, -Pavan. From: David Turner Date: Thursday, September 27, 2018 at 1:18 PM To: Pavan Rallabhandi Cc: ceph-users Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever I got pulled away from this for a while. The error in the log is "abort: Corruption: Snappy not supported or corrupted Snappy compressed block contents" and the OSD has 2 settings set to snappy by default, async_compressor_type and bluestore_compression_algorithm. Do either of these settings affect the omap store? On Wed, Sep 19, 2018 at 2:33 PM Pavan Rallabhandi <mailto:prallabha...@walmartlabs.com> wrote: Looks like you are running on CentOS, fwiw. We’ve successfully ran the conversion commands on Jewel, Ubuntu 16.04. Have a feel it’s expecting the compression to be enabled, can you try removing “compression=kNoCompression” from the filestore_rocksdb_options? And/or you might want to check if rocksdb is expecting snappy to be enabled. From: David Turner <mailto:drakonst...@gmail.com> Date: Tuesday, September 18, 2018 at 6:01 PM To: Pavan Rallabhandi <mailto:prallabha...@walmartlabs.com> Cc: ceph-users <mailto:ceph-users@lists.ceph.com> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever Here's the [1] full log from the time the OSD was started to the end of the crash dump. These logs are so hard to parse. Is there anything useful in them? I did confirm that all perms were set correctly and that the superblock was changed to rocksdb before the first time I attempted to start the OSD with it's new DB. This is on a fully Luminous cluster with [2] the defaults you mentioned. [1] https://gist.github.com/drakonstein/fa3ac0ad9b2ec1389c957f95e05b79ed [2] "filestore_omap_backend": "rocksdb", "filestore_rocksdb_options": "max_background_compactions=8,compaction_readahead_size=2097152,compression=kNoCompression", On Tue, Sep 18, 2018 at 5:29 PM Pavan Rallabhandi <mailto:mailto:prallabha...@walmartlabs.com> wrote: I meant the stack trace hints that the superblock still has leveldb in it, have you verified that already? On 9/18/18, 5:27 PM, "Pavan Rallabhandi" <mailto:mailto:prallabha...@walmartlabs.com> wrote: You should be able to set them under the global section and that reminds me, since you are on Luminous already, I guess those values are already the default, you can verify from the admin socket of any OSD. But the stack trace didn’t hint as if the superblock on the OSD is still considering the omap backend to be leveldb and to do with the compression. Thanks, -Pavan. From: David Turner <mailto:mailto:drakonst...@gmail.com> Date: Tuesday, September 18, 2018 at 5:07 PM To: Pavan Rallabhandi <mailto:mailto:prallabha...@walmartlabs.com> Cc: ceph-users <mailto:mailto:ceph-users@lists.ceph.com> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever Are those settings fine to have be global
Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever
Yeah, I think this is something to do with the CentOS binaries, sorry that I couldn’t be of much help here. Thanks, -Pavan. From: David Turner Date: Monday, October 1, 2018 at 1:37 PM To: Pavan Rallabhandi Cc: ceph-users Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever I tried modifying filestore_rocksdb_options by removing compression=kNoCompression as well as setting it to compression=kSnappyCompression. Leaving it with kNoCompression or removing it results in the same segfault in the previous log. Setting it to kSnappyCompression resulted in [1] this being logged and the OSD just failing to start instead of segfaulting. Is there anything else you would suggest trying before I purge this OSD from the cluster? I'm afraid it might be something with the CentOS binaries. [1] 2018-10-01 17:10:37.134930 7f1415dfcd80 0 set rocksdb option compression = kSnappyCompression 2018-10-01 17:10:37.134986 7f1415dfcd80 -1 rocksdb: Invalid argument: Compression type Snappy is not linked with the binary. 2018-10-01 17:10:37.135004 7f1415dfcd80 -1 filestore(/var/lib/ceph/osd/ceph-1) mount(1723): Error initializing rocksdb : 2018-10-01 17:10:37.135020 7f1415dfcd80 -1 osd.1 0 OSD:init: unable to mount object store 2018-10-01 17:10:37.135029 7f1415dfcd80 -1 ESC[0;31m ** ERROR: osd init failed: (1) Operation not permittedESC[0m On Sat, Sep 29, 2018 at 1:57 PM Pavan Rallabhandi <mailto:prallabha...@walmartlabs.com> wrote: I looked at one of my test clusters running Jewel on Ubuntu 16.04, and interestingly I found this(below) in one of the OSD logs, which is different from your OSD boot log, where none of the compression algorithms seem to be supported. This hints more at how rocksdb was built on CentOS for Ceph. 2018-09-29 17:38:38.629112 7fbd318d4b00 4 rocksdb: Compression algorithms supported: 2018-09-29 17:38:38.629112 7fbd318d4b00 4 rocksdb: Snappy supported: 1 2018-09-29 17:38:38.629113 7fbd318d4b00 4 rocksdb: Zlib supported: 1 2018-09-29 17:38:38.629113 7fbd318d4b00 4 rocksdb: Bzip supported: 0 2018-09-29 17:38:38.629114 7fbd318d4b00 4 rocksdb: LZ4 supported: 0 2018-09-29 17:38:38.629114 7fbd318d4b00 4 rocksdb: ZSTD supported: 0 2018-09-29 17:38:38.629115 7fbd318d4b00 4 rocksdb: Fast CRC32 supported: 0 On 9/27/18, 2:56 PM, "Pavan Rallabhandi" <mailto:prallabha...@walmartlabs.com> wrote: I see Filestore symbols on the stack, so the bluestore config doesn’t affect. And the top frame of the stack hints at a RocksDB issue, and there are a whole lot of these too: “2018-09-17 19:23:06.480258 7f1f3d2a7700 2 rocksdb: [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/table/block_based_table_reader.cc:636] Cannot find Properties block from file.” It really seems to be something with RocksDB on centOS. I still think you can try removing “compression=kNoCompression” from the filestore_rocksdb_options And/Or check if rocksdb is expecting snappy to be enabled. Thanks, -Pavan. From: David Turner <mailto:drakonst...@gmail.com> Date: Thursday, September 27, 2018 at 1:18 PM To: Pavan Rallabhandi <mailto:prallabha...@walmartlabs.com> Cc: ceph-users <mailto:ceph-users@lists.ceph.com> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever I got pulled away from this for a while. The error in the log is "abort: Corruption: Snappy not supported or corrupted Snappy compressed block contents" and the OSD has 2 settings set to snappy by default, async_compressor_type and bluestore_compression_algorithm. Do either of these settings affect the omap store? On Wed, Sep 19, 2018 at 2:33 PM Pavan Rallabhandi <mailto:mailto:prallabha...@walmartlabs.com> wrote: Looks like you are running on CentOS, fwiw. We’ve successfully ran the conversion commands on Jewel, Ubuntu 16.04. Have a feel it’s expecting the compression to be enabled, can you try removing “compression=kNoCompression” from the filestore_rocksdb_options? And/or you might want to check if rocksdb is expecting snappy to be enabled. From: David Turner <mailto:mailto:drakonst...@gmail.com> Date: Tuesday, September 18, 2018 at 6:01 PM To: Pavan Rallabhandi <mailto:mailto:prallabha...@walmartlabs.com> Cc: ceph-users <mailto:mailto:ceph-users@lists.ceph.com> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever Here's the [1] full log from the time the OSD was started to the end of the crash dump. These logs are so hard to parse. Is there anything useful in them? I did confirm that all perms were set correctly and that the superblock was changed to ro
Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever
Looks like you are running on CentOS, fwiw. We’ve successfully ran the conversion commands on Jewel, Ubuntu 16.04. Have a feel it’s expecting the compression to be enabled, can you try removing “compression=kNoCompression” from the filestore_rocksdb_options? And/or you might want to check if rocksdb is expecting snappy to be enabled. From: David Turner Date: Tuesday, September 18, 2018 at 6:01 PM To: Pavan Rallabhandi Cc: ceph-users Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever Here's the [1] full log from the time the OSD was started to the end of the crash dump. These logs are so hard to parse. Is there anything useful in them? I did confirm that all perms were set correctly and that the superblock was changed to rocksdb before the first time I attempted to start the OSD with it's new DB. This is on a fully Luminous cluster with [2] the defaults you mentioned. [1] https://gist.github.com/drakonstein/fa3ac0ad9b2ec1389c957f95e05b79ed [2] "filestore_omap_backend": "rocksdb", "filestore_rocksdb_options": "max_background_compactions=8,compaction_readahead_size=2097152,compression=kNoCompression", On Tue, Sep 18, 2018 at 5:29 PM Pavan Rallabhandi <mailto:prallabha...@walmartlabs.com> wrote: I meant the stack trace hints that the superblock still has leveldb in it, have you verified that already? On 9/18/18, 5:27 PM, "Pavan Rallabhandi" <mailto:prallabha...@walmartlabs.com> wrote: You should be able to set them under the global section and that reminds me, since you are on Luminous already, I guess those values are already the default, you can verify from the admin socket of any OSD. But the stack trace didn’t hint as if the superblock on the OSD is still considering the omap backend to be leveldb and to do with the compression. Thanks, -Pavan. From: David Turner <mailto:drakonst...@gmail.com> Date: Tuesday, September 18, 2018 at 5:07 PM To: Pavan Rallabhandi <mailto:prallabha...@walmartlabs.com> Cc: ceph-users <mailto:ceph-users@lists.ceph.com> Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever Are those settings fine to have be global even if not all OSDs on a node have rocksdb as the backend? Or will I need to convert all OSDs on a node at the same time? On Tue, Sep 18, 2018 at 5:02 PM Pavan Rallabhandi <mailto:mailto:prallabha...@walmartlabs.com> wrote: The steps that were outlined for conversion are correct, have you tried setting some the relevant ceph conf values too: filestore_rocksdb_options = "max_background_compactions=8;compaction_readahead_size=2097152;compression=kNoCompression" filestore_omap_backend = rocksdb Thanks, -Pavan. From: ceph-users <mailto:mailto:ceph-users-boun...@lists.ceph.com> on behalf of David Turner <mailto:mailto:drakonst...@gmail.com> Date: Tuesday, September 18, 2018 at 4:09 PM To: ceph-users <mailto:mailto:ceph-users@lists.ceph.com> Subject: EXT: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever I've finally learned enough about the OSD backend track down this issue to what I believe is the root cause. LevelDB compaction is the common thread every time we move data around our cluster. I've ruled out PG subfolder splitting, EC doesn't seem to be the root cause of this, and it is cluster wide as opposed to specific hardware. One of the first things I found after digging into leveldb omap compaction was [1] this article with a heading "RocksDB instead of LevelDB" which mentions that leveldb was replaced with rocksdb as the default db backend for filestore OSDs and was even backported to Jewel because of the performance improvements. I figured there must be a way to be able to upgrade an OSD to use rocksdb from leveldb without needing to fully backfill the entire OSD. There is [2] this article, but you need to have an active service account with RedHat to access it. I eventually came across [3] this article about optimizing Ceph Object Storage which mentions a resolution to OSDs flapping due to omap compaction to migrate to using rocksdb. It links to the RedHat article, but also has [4] these steps outlined in it. I tried to follow the steps, but the OSD I tested this on was unable to start with [5] this segfault. And then trying to move the OSD back to the original LevelDB omap folder resulted in [6] this in the log. I apologize that all of my logging is with log level 1. If needed I can get some higher log levels. My Ceph version is 12.2.4. Does anyone have any suggestions for how I can update my filestore backend from leveldb to rocksdb? Or if that's the wrong direction and I should be looking elsewhere? Thank you. [1] ht
Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever
The steps that were outlined for conversion are correct, have you tried setting some the relevant ceph conf values too: filestore_rocksdb_options = "max_background_compactions=8;compaction_readahead_size=2097152;compression=kNoCompression" filestore_omap_backend = rocksdb Thanks, -Pavan. From: ceph-users on behalf of David Turner Date: Tuesday, September 18, 2018 at 4:09 PM To: ceph-users Subject: EXT: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever I've finally learned enough about the OSD backend track down this issue to what I believe is the root cause. LevelDB compaction is the common thread every time we move data around our cluster. I've ruled out PG subfolder splitting, EC doesn't seem to be the root cause of this, and it is cluster wide as opposed to specific hardware. One of the first things I found after digging into leveldb omap compaction was [1] this article with a heading "RocksDB instead of LevelDB" which mentions that leveldb was replaced with rocksdb as the default db backend for filestore OSDs and was even backported to Jewel because of the performance improvements. I figured there must be a way to be able to upgrade an OSD to use rocksdb from leveldb without needing to fully backfill the entire OSD. There is [2] this article, but you need to have an active service account with RedHat to access it. I eventually came across [3] this article about optimizing Ceph Object Storage which mentions a resolution to OSDs flapping due to omap compaction to migrate to using rocksdb. It links to the RedHat article, but also has [4] these steps outlined in it. I tried to follow the steps, but the OSD I tested this on was unable to start with [5] this segfault. And then trying to move the OSD back to the original LevelDB omap folder resulted in [6] this in the log. I apologize that all of my logging is with log level 1. If needed I can get some higher log levels. My Ceph version is 12.2.4. Does anyone have any suggestions for how I can update my filestore backend from leveldb to rocksdb? Or if that's the wrong direction and I should be looking elsewhere? Thank you. [1] https://ceph.com/community/new-luminous-rados-improvements/ [2] https://access.redhat.com/solutions/3210951 [3] https://hubb.blob.core.windows.net/c2511cea-81c5-4386-8731-cc444ff806df-public/resources/Optimize Ceph object storage for production in multisite clouds.pdf [4] ■ Stop the OSD ■ mv /var/lib/ceph/osd/ceph-/current/omap /var/lib/ceph/osd/ceph-/omap.orig ■ ulimit -n 65535 ■ ceph-kvstore-tool leveldb /var/lib/ceph/osd/ceph-/omap.orig store-copy /var/lib/ceph/osd/ceph-/current/omap 1 rocksdb ■ ceph-osdomap-tool --omap-path /var/lib/ceph/osd/ceph-/current/omap --command check ■ sed -i s/leveldb/rocksdb/g /var/lib/ceph/osd/ceph-/superblock ■ chown ceph.ceph /var/lib/ceph/osd/ceph-/current/omap -R ■ cd /var/lib/ceph/osd/ceph-; rm -rf omap.orig ■ Start the OSD [5] 2018-09-17 19:23:10.826227 7f1f3f2ab700 -1 abort: Corruption: Snappy not supported or corrupted Snappy compressed block contents 2018-09-17 19:23:10.830525 7f1f3f2ab700 -1 *** Caught signal (Aborted) ** [6] 2018-09-17 19:27:34.010125 7fcdee97cd80 -1 osd.0 0 OSD:init: unable to mount object store 2018-09-17 19:27:34.010131 7fcdee97cd80 -1 ESC[0;31m ** ERROR: osd init failed: (1) Operation not permittedESC[0m 2018-09-17 19:27:54.225941 7f7f03308d80 0 set uid:gid to 167:167 (ceph:ceph) 2018-09-17 19:27:54.225975 7f7f03308d80 0 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable), process (unknown), pid 361535 2018-09-17 19:27:54.231275 7f7f03308d80 0 pidfile_write: ignore empty --pid-file 2018-09-17 19:27:54.260207 7f7f03308d80 0 load: jerasure load: lrc load: isa 2018-09-17 19:27:54.260520 7f7f03308d80 0 filestore(/var/lib/ceph/osd/ceph-0) backend xfs (magic 0x58465342) 2018-09-17 19:27:54.261135 7f7f03308d80 0 filestore(/var/lib/ceph/osd/ceph-0) backend xfs (magic 0x58465342) 2018-09-17 19:27:54.261750 7f7f03308d80 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2018-09-17 19:27:54.261757 7f7f03308d80 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option 2018-09-17 19:27:54.261758 7f7f03308d80 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: splice() is disabled via 'filestore splice' config option 2018-09-17 19:27:54.286454 7f7f03308d80 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) 2018-09-17 19:27:54.286572 7f7f03308d80 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: extsize is disabled by conf 2018-09-17 19:27:54.287119 7f7f03308d80 0 filestore(/var/lib/ceph/osd/ceph-0) start omap initiation 2018-09-17 19:27:54.287527 7f7f03308d80 -1
Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever
You should be able to set them under the global section and that reminds me, since you are on Luminous already, I guess those values are already the default, you can verify from the admin socket of any OSD. But the stack trace didn’t hint as if the superblock on the OSD is still considering the omap backend to be leveldb and to do with the compression. Thanks, -Pavan. From: David Turner Date: Tuesday, September 18, 2018 at 5:07 PM To: Pavan Rallabhandi Cc: ceph-users Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever Are those settings fine to have be global even if not all OSDs on a node have rocksdb as the backend? Or will I need to convert all OSDs on a node at the same time? On Tue, Sep 18, 2018 at 5:02 PM Pavan Rallabhandi <mailto:prallabha...@walmartlabs.com> wrote: The steps that were outlined for conversion are correct, have you tried setting some the relevant ceph conf values too: filestore_rocksdb_options = "max_background_compactions=8;compaction_readahead_size=2097152;compression=kNoCompression" filestore_omap_backend = rocksdb Thanks, -Pavan. From: ceph-users <mailto:ceph-users-boun...@lists.ceph.com> on behalf of David Turner <mailto:drakonst...@gmail.com> Date: Tuesday, September 18, 2018 at 4:09 PM To: ceph-users <mailto:ceph-users@lists.ceph.com> Subject: EXT: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever I've finally learned enough about the OSD backend track down this issue to what I believe is the root cause. LevelDB compaction is the common thread every time we move data around our cluster. I've ruled out PG subfolder splitting, EC doesn't seem to be the root cause of this, and it is cluster wide as opposed to specific hardware. One of the first things I found after digging into leveldb omap compaction was [1] this article with a heading "RocksDB instead of LevelDB" which mentions that leveldb was replaced with rocksdb as the default db backend for filestore OSDs and was even backported to Jewel because of the performance improvements. I figured there must be a way to be able to upgrade an OSD to use rocksdb from leveldb without needing to fully backfill the entire OSD. There is [2] this article, but you need to have an active service account with RedHat to access it. I eventually came across [3] this article about optimizing Ceph Object Storage which mentions a resolution to OSDs flapping due to omap compaction to migrate to using rocksdb. It links to the RedHat article, but also has [4] these steps outlined in it. I tried to follow the steps, but the OSD I tested this on was unable to start with [5] this segfault. And then trying to move the OSD back to the original LevelDB omap folder resulted in [6] this in the log. I apologize that all of my logging is with log level 1. If needed I can get some higher log levels. My Ceph version is 12.2.4. Does anyone have any suggestions for how I can update my filestore backend from leveldb to rocksdb? Or if that's the wrong direction and I should be looking elsewhere? Thank you. [1] https://ceph.com/community/new-luminous-rados-improvements/ [2] https://access.redhat.com/solutions/3210951 [3] https://hubb.blob.core.windows.net/c2511cea-81c5-4386-8731-cc444ff806df-public/resources/Optimize Ceph object storage for production in multisite clouds.pdf [4] ■ Stop the OSD ■ mv /var/lib/ceph/osd/ceph-/current/omap /var/lib/ceph/osd/ceph-/omap.orig ■ ulimit -n 65535 ■ ceph-kvstore-tool leveldb /var/lib/ceph/osd/ceph-/omap.orig store-copy /var/lib/ceph/osd/ceph-/current/omap 1 rocksdb ■ ceph-osdomap-tool --omap-path /var/lib/ceph/osd/ceph-/current/omap --command check ■ sed -i s/leveldb/rocksdb/g /var/lib/ceph/osd/ceph-/superblock ■ chown ceph.ceph /var/lib/ceph/osd/ceph-/current/omap -R ■ cd /var/lib/ceph/osd/ceph-; rm -rf omap.orig ■ Start the OSD [5] 2018-09-17 19:23:10.826227 7f1f3f2ab700 -1 abort: Corruption: Snappy not supported or corrupted Snappy compressed block contents 2018-09-17 19:23:10.830525 7f1f3f2ab700 -1 *** Caught signal (Aborted) ** [6] 2018-09-17 19:27:34.010125 7fcdee97cd80 -1 osd.0 0 OSD:init: unable to mount object store 2018-09-17 19:27:34.010131 7fcdee97cd80 -1 ESC[0;31m ** ERROR: osd init failed: (1) Operation not permittedESC[0m 2018-09-17 19:27:54.225941 7f7f03308d80 0 set uid:gid to 167:167 (ceph:ceph) 2018-09-17 19:27:54.225975 7f7f03308d80 0 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable), process (unknown), pid 361535 2018-09-17 19:27:54.231275 7f7f03308d80 0 pidfile_write: ignore empty --pid-file 2018-09-17 19:27:54.260207 7f7f03308d80 0 load: jerasure load: lrc load: isa 2018-09-17 19:27:54.260520 7f7f03308d80 0 filestore(/var/lib/ceph/osd/ceph-0) backend xfs (magic 0x58465342) 2018-09-17 19:27:54.261135 7f7f03308d80 0 filestore(/var/lib/ceph/osd/ceph-0) bac
Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever
I meant the stack trace hints that the superblock still has leveldb in it, have you verified that already? On 9/18/18, 5:27 PM, "Pavan Rallabhandi" wrote: You should be able to set them under the global section and that reminds me, since you are on Luminous already, I guess those values are already the default, you can verify from the admin socket of any OSD. But the stack trace didn’t hint as if the superblock on the OSD is still considering the omap backend to be leveldb and to do with the compression. Thanks, -Pavan. From: David Turner Date: Tuesday, September 18, 2018 at 5:07 PM To: Pavan Rallabhandi Cc: ceph-users Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever Are those settings fine to have be global even if not all OSDs on a node have rocksdb as the backend? Or will I need to convert all OSDs on a node at the same time? On Tue, Sep 18, 2018 at 5:02 PM Pavan Rallabhandi <mailto:prallabha...@walmartlabs.com> wrote: The steps that were outlined for conversion are correct, have you tried setting some the relevant ceph conf values too: filestore_rocksdb_options = "max_background_compactions=8;compaction_readahead_size=2097152;compression=kNoCompression" filestore_omap_backend = rocksdb Thanks, -Pavan. From: ceph-users <mailto:ceph-users-boun...@lists.ceph.com> on behalf of David Turner <mailto:drakonst...@gmail.com> Date: Tuesday, September 18, 2018 at 4:09 PM To: ceph-users <mailto:ceph-users@lists.ceph.com> Subject: EXT: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever I've finally learned enough about the OSD backend track down this issue to what I believe is the root cause. LevelDB compaction is the common thread every time we move data around our cluster. I've ruled out PG subfolder splitting, EC doesn't seem to be the root cause of this, and it is cluster wide as opposed to specific hardware. One of the first things I found after digging into leveldb omap compaction was [1] this article with a heading "RocksDB instead of LevelDB" which mentions that leveldb was replaced with rocksdb as the default db backend for filestore OSDs and was even backported to Jewel because of the performance improvements. I figured there must be a way to be able to upgrade an OSD to use rocksdb from leveldb without needing to fully backfill the entire OSD. There is [2] this article, but you need to have an active service account with RedHat to access it. I eventually came across [3] this article about optimizing Ceph Object Storage which mentions a resolution to OSDs flapping due to omap compaction to migrate to using rocksdb. It links to the RedHat article, but also has [4] these steps outlined in it. I tried to follow the steps, but the OSD I tested this on was unable to start with [5] this segfault. And then trying to move the OSD back to the original LevelDB omap folder resulted in [6] this in the log. I apologize that all of my logging is with log level 1. If needed I can get some higher log levels. My Ceph version is 12.2.4. Does anyone have any suggestions for how I can update my filestore backend from leveldb to rocksdb? Or if that's the wrong direction and I should be looking elsewhere? Thank you. [1] https://ceph.com/community/new-luminous-rados-improvements/ [2] https://access.redhat.com/solutions/3210951 [3] https://hubb.blob.core.windows.net/c2511cea-81c5-4386-8731-cc444ff806df-public/resources/Optimize Ceph object storage for production in multisite clouds.pdf [4] ■ Stop the OSD ■ mv /var/lib/ceph/osd/ceph-/current/omap /var/lib/ceph/osd/ceph-/omap.orig ■ ulimit -n 65535 ■ ceph-kvstore-tool leveldb /var/lib/ceph/osd/ceph-/omap.orig store-copy /var/lib/ceph/osd/ceph-/current/omap 1 rocksdb ■ ceph-osdomap-tool --omap-path /var/lib/ceph/osd/ceph-/current/omap --command check ■ sed -i s/leveldb/rocksdb/g /var/lib/ceph/osd/ceph-/superblock ■ chown ceph.ceph /var/lib/ceph/osd/ceph-/current/omap -R ■ cd /var/lib/ceph/osd/ceph-; rm -rf omap.orig ■ Start the OSD [5] 2018-09-17 19:23:10.826227 7f1f3f2ab700 -1 abort: Corruption: Snappy not supported or corrupted Snappy compressed block contents 2018-09-17 19:23:10.830525 7f1f3f2ab700 -1 *** Caught signal (Aborted) ** [6] 2018-09-17 19:27:34.010125 7fcdee97cd80 -1 osd.0 0 OSD:init: unable to mount object store 2018-09-17 19:27:34.010131 7fcdee97cd80 -1 ESC[0;31m ** ERROR: osd init failed: (1) Operation not permittedESC[0m 2018-09-17 19:27:54.225941 7f7f03308d80 0 set uid:gid to 167:167 (ceph:ceph) 2018-09-17 19:27:54.225975 7f7f03308d80 0 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e
Re: [ceph-users] Large OMAP Objects in default.rgw.log pool
That can happen if you have lot of objects with swift object expiry (TTL) enabled. You can 'listomapkeys' on these log pool objects and check for the objects that have registered for TTL as omap entries. I know this is the case with at least Jewel version. Thanks, -Pavan. On 3/7/19, 10:09 PM, "ceph-users on behalf of Brad Hubbard" wrote: On Fri, Mar 8, 2019 at 4:46 AM Samuel Taylor Liston wrote: > > Hello All, > I have recently had 32 large map objects appear in my default.rgw.log pool. Running luminous 12.2.8. > > Not sure what to think about these.I’ve done a lot of reading about how when these normally occur it is related to a bucket needing resharding, but it doesn’t look like my default.rgw.log pool has anything in it, let alone buckets. Here’s some info on the system: > > [root@elm-rgw01 ~]# ceph versions > { > "mon": { > "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)": 5 > }, > "mgr": { > "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)": 1 > }, > "osd": { > "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)": 192 > }, > "mds": {}, > "rgw": { > "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)": 1 > }, > "overall": { > "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)": 199 > } > } > [root@elm-rgw01 ~]# ceph osd pool ls > .rgw.root > default.rgw.control > default.rgw.meta > default.rgw.log > default.rgw.buckets.index > default.rgw.buckets.non-ec > default.rgw.buckets.data > [root@elm-rgw01 ~]# ceph health detail > HEALTH_WARN 32 large omap objects > LARGE_OMAP_OBJECTS 32 large omap objects > 32 large objects found in pool 'default.rgw.log' > Search the cluster log for 'Large omap object found' for more details.— > > Looking closer at these object they are all of size 0. Also that pool shows a capacity usage of 0: The size here relates to data size. Object map (omap) data is metadata so an object of size 0 can have considerable omap data associated with it (the omap data is stored separately from the object in a key/value database). The large omap warning in health detail output should tell you " "Search the cluster log for 'Large omap object found' for more details." If you do that you should get the names of the specific objects involved. You can then use the rados commands listomapkeys and listomapvals to see the specifics of the omap data. Someone more familiar with rgw can then probably help you out on what purpose they serve. HTH. > > (just a sampling of the 236 objects at size 0) > > [root@elm-mon01 ceph]# for i in `rados ls -p default.rgw.log`; do echo ${i}; rados stat -p default.rgw.log ${i};done > obj_delete_at_hint.78 > default.rgw.log/obj_delete_at_hint.78 mtime 2019-03-07 11:39:19.00, size 0 > obj_delete_at_hint.70 > default.rgw.log/obj_delete_at_hint.70 mtime 2019-03-07 11:39:19.00, size 0 > obj_delete_at_hint.000104 > default.rgw.log/obj_delete_at_hint.000104 mtime 2019-03-07 11:39:20.00, size 0 > obj_delete_at_hint.26 > default.rgw.log/obj_delete_at_hint.26 mtime 2019-03-07 11:39:19.00, size 0 > obj_delete_at_hint.28 > default.rgw.log/obj_delete_at_hint.28 mtime 2019-03-07 11:39:19.00, size 0 > obj_delete_at_hint.40 > default.rgw.log/obj_delete_at_hint.40 mtime 2019-03-07 11:39:19.00, size 0 > obj_delete_at_hint.15 > default.rgw.log/obj_delete_at_hint.15 mtime 2019-03-07 11:39:19.00, size 0 > obj_delete_at_hint.69 > default.rgw.log/obj_delete_at_hint.69 mtime 2019-03-07 11:39:19.00, size 0 > obj_delete_at_hint.95 > default.rgw.log/obj_delete_at_hint.95 mtime 2019-03-07 11:39:19.00, size 0 > obj_delete_at_hint.03 > default.rgw.log/obj_delete_at_hint.03 mtime 2019-03-07 11:39:19.00, size 0 > obj_delete_at_hint.47 > default.rgw.log/obj_delete_at_hint.47 mtime 2019-03-07 11:39:19.00, size 0 > > > [root@elm-mon01 ceph]# rados df > POOL_NAME USEDOBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPSRD WR_OPSWR > .rgw.root 1.09KiB 4 0 12 0 00 14853 9.67MiB 0 0B > default.rgw.buckets.data444TiB 166829939 0 1000979634 0 00 362357590 859TiB 909188749 703TiB
Re: [ceph-users] Try to log the IP in the header X-Forwarded-For with radosgw behind haproxy
Refer "rgw log http headers" under http://docs.ceph.com/docs/nautilus/radosgw/config-ref/ Or even better in the code https://github.com/ceph/ceph/pull/7639 Thanks, -Pavan. On 4/8/19, 8:32 PM, "ceph-users on behalf of Francois Lafont" wrote: Hi @all, I'm using Ceph rados gateway installed via ceph-ansible with the Nautilus version. The radosgw are behind a haproxy which add these headers (checked via tcpdump): X-Forwarded-Proto: http X-Forwarded-For: 10.111.222.55 where 10.111.222.55 is the IP address of the client. The radosgw use the civetweb http frontend. Currently, this is the IP address of the haproxy itself which is mentioned in logs. I would like to mention the IP address from the X-Forwarded-For HTTP header. How to do that? I have tried this option in ceph.conf: rgw_remote_addr_param = X-Forwarded-For It doesn't work but maybe I have read the doc wrongly. Thx in advance for your help. PS: I have tried too the http frontend "beast" but, in this case, no HTTP request seems to be logged. -- François ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com