[ceph-users] Cache pool latency impact

2015-01-14 Thread Pavan Rallabhandi
This is regarding cache pools and the impact of the flush/evict on the client 
IO latencies.

Am seeing a direct impact on the client IO latencies (making them worse) when 
flush/evict is triggered on the cache pool. In a constant ingress of IOs on the 
cache pool, the write performance is no better than without cache pool, because 
it is limited to the speed at which objects can be flushed/evicted to the 
backend pool.

The questions I have are:

1) When the flush/evict is in progress, are the writes on the cache pool 
blocked, either at the PG or at object granularity? Though I see a blocking 
flag honored per object context here 
https://github.com/ceph/ceph/blob/master/src/osd/ReplicatedPG.cc#L6841, and 
most of the callers of the same seem to set the flag to be false.

2) Is there any mechanism (that I might have overlooked) to avoid this 
situation, by throttling the flush/evict operations on the fly? If not, 
shouldn't there be one?

Thanks,
-Pavan.



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cache pool latency impact

2015-01-14 Thread Pavan Rallabhandi
This is regarding cache pools and the impact of the flush/evict on the client 
IO latencies.

Am seeing a direct impact on the client IO latencies (making them worse) when 
flush/evict is triggered on the cache pool. In a constant ingress of IOs on the 
cache pool, the write performance is no better than without cache pool, because 
it is limited to the speed at which objects can be flushed/evicted to the 
backend pool.

The questions I have are:

1) When the flush/evict is in progress, are the writes on the cache pool 
blocked, either at the PG or at object granularity? Though I see a blocking 
flag honored per object context in ReplicatedPG::start_flush() and most of the 
callers seem to set the flag to be false.

2) Is there any mechanism (that I might have overlooked) to avoid this 
situation, by throttling the flush/evict operations on the fly? If not, 
shouldn't there be one?

Thanks,
-Pavan.



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cache pool latency impact

2015-01-14 Thread Pavan Rallabhandi
This is regarding cache pools and the impact of the flush/evict on the client 
IO latencies.

Am seeing a direct impact on the client IO latencies (making them worse) when 
flush/evict is triggered on the cache pool. In a constant ingress of IOs on the 
cache pool, the write performance is no better than without cache pool, because 
it is limited to the speed at which objects can be flushed/evicted to the 
backend pool.

The questions I have are:

1) When the flush/evict is in progress, are the writes on the cache pool 
blocked, either at the PG or at object granularity? Though I see a blocking 
flag honored per object context in ReplicatedPG::start_flush() and most of the 
callers seem to set the flag to be false.

2) Is there any mechanism (that I might have overlooked) to avoid this 
situation, by throttling the flush/evict operations on the fly? If not, 
shouldn't there be one?

Thanks,
-Pavan.



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW hammer/master woes

2015-03-05 Thread Pavan Rallabhandi
Is there anyone who is hitting this? or any help on this is much appreciated.

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Pavan 
Rallabhandi
Sent: Saturday, February 28, 2015 11:42 PM
To: ceph-us...@ceph.com
Subject: [ceph-users] RGW hammer/master woes

Am struggling to get through a basic PUT via swift client with RGW and CEPH 
binaries built out of Hammer/Master codebase, whereas the same (command on the 
same setup) is going through with RGW and CEPH binaries built out of Giant.

Find below RGW log snippet and the command that was run. Am I missing anything 
obvious here?

The user info looks like this:

{ user_id: johndoe,
  display_name: John Doe,
  email: j...@example.com,
  suspended: 0,
  max_buckets: 1000,
  auid: 0,
  subusers: [
{ id: johndoe:swift,
  permissions: full-control}],
  keys: [
{ user: johndoe,
  access_key: 7B39L2TUQ448LZW4RI3M,
  secret_key: lshKCoacSlbyVc7mBLLr4cJ26fEEM22Tcmp29hT3},
{ user: johndoe:swift,
  access_key: SHZ64EF7CIB4V42I14AH,
  secret_key: }],
  swift_keys: [
{ user: johndoe:swift,
  secret_key: asdf}],
  caps: [],
  op_mask: read, write, delete,
  default_placement: ,
  placement_tags: [],
  bucket_quota: { enabled: false,
  max_size_kb: -1,
  max_objects: -1},
  user_quota: { enabled: false,
  max_size_kb: -1,
  max_objects: -1},
  temp_url_keys: []}


The command that was run and the logs:

snip

swift -A http://localhost:8989/auth -U johndoe:swift -K asdf upload mycontainer 
ceph

2015-02-28 23:28:39.272897 7fb610ff9700  1 == starting new request 
req=0x7fb5f0009990 =
2015-02-28 23:28:39.272913 7fb610ff9700  2 req 0:0.16::PUT 
/swift/v1/mycontainer/ceph::initializing
2015-02-28 23:28:39.272918 7fb610ff9700 10 host=localhost:8989
2015-02-28 23:28:39.272921 7fb610ff9700 20 subdomain= domain= in_hosted_domain=0
2015-02-28 23:28:39.272938 7fb610ff9700 10 meta HTTP_X_OBJECT_META_MTIME
2015-02-28 23:28:39.272945 7fb610ff9700 10 x 
x-amz-meta-mtime:1425140933.648506
2015-02-28 23:28:39.272964 7fb610ff9700 10 ver=v1 first=mycontainer req=ceph
2015-02-28 23:28:39.272971 7fb610ff9700 10 s-object=ceph s-bucket=mycontainer
2015-02-28 23:28:39.272976 7fb610ff9700  2 req 0:0.79:swift:PUT 
/swift/v1/mycontainer/ceph::getting op
2015-02-28 23:28:39.272982 7fb610ff9700  2 req 0:0.85:swift:PUT 
/swift/v1/mycontainer/ceph:put_obj:authorizing
2015-02-28 23:28:39.273008 7fb610ff9700 10 swift_user=johndoe:swift
2015-02-28 23:28:39.273026 7fb610ff9700 20 build_token 
token=0d006a6f686e646f653a73776966744436beb90402b13c4f53f35472c2cf0f
2015-02-28 23:28:39.273057 7fb610ff9700  2 req 0:0.000160:swift:PUT 
/swift/v1/mycontainer/ceph:put_obj:reading permissions
2015-02-28 23:28:39.273100 7fb610ff9700 15 Read 
AccessControlPolicyAccessControlPolicy 
xmlns=http://s3.amazonaws.com/doc/2006-03-01/;OwnerIDjohndoe/IDDisplayNameJohn
 Doe/DisplayName/OwnerAccessControlListGrantGrantee 
xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; 
xsi:type=CanonicalUserIDjohndoe/IDDisplayNameJohn 
Doe/DisplayName/GranteePermissionFULL_CONTROL/Permission/Grant/AccessControlList/AccessControlPolicy
2015-02-28 23:28:39.273114 7fb610ff9700  2 req 0:0.000216:swift:PUT 
/swift/v1/mycontainer/ceph:put_obj:init op
2015-02-28 23:28:39.273120 7fb610ff9700  2 req 0:0.000223:swift:PUT 
/swift/v1/mycontainer/ceph:put_obj:verifying op mask
2015-02-28 23:28:39.273123 7fb610ff9700 20 required_mask= 2 user.op_mask=7
2015-02-28 23:28:39.273125 7fb610ff9700  2 req 0:0.000228:swift:PUT 
/swift/v1/mycontainer/ceph:put_obj:verifying op permissions
2015-02-28 23:28:39.273129 7fb610ff9700  5 Searching permissions for 
uid=johndoe mask=50
2015-02-28 23:28:39.273131 7fb610ff9700  5 Found permission: 15
2015-02-28 23:28:39.273133 7fb610ff9700  5 Searching permissions for group=1 
mask=50
2015-02-28 23:28:39.273135 7fb610ff9700  5 Permissions for group not found
2015-02-28 23:28:39.273136 7fb610ff9700  5 Searching permissions for group=2 
mask=50
2015-02-28 23:28:39.273137 7fb610ff9700  5 Permissions for group not found
2015-02-28 23:28:39.273138 7fb610ff9700  5 Getting permissions id=johndoe 
owner=johndoe perm=2
2015-02-28 23:28:39.273140 7fb610ff9700 10  uid=johndoe requested perm 
(type)=2, policy perm=2, user_perm_mask=2, acl perm=2
2015-02-28 23:28:39.273143 7fb610ff9700  2 req 0:0.000246:swift:PUT 
/swift/v1/mycontainer/ceph:put_obj:verifying op params
2015-02-28 23:28:39.273146 7fb610ff9700  2 req 0:0.000249:swift:PUT 
/swift/v1/mycontainer/ceph:put_obj:executing
2015-02-28 23:28:39.273279 7fb610ff9700 10 x 
x-amz-meta-mtime:1425140933.648506
2015-02-28 23:28:39.273313 7fb610ff9700 20 get_obj_state: rctx=0x7fb610ff41f0 
obj=mycontainer:ceph state=0x7fb5f0016940 s-prefetch_data=0
2015-02-28 23:28:39.274354 7fb610ff9700 20 get_obj_state: rctx=0x7fb610ff41f0 
obj=mycontainer:ceph state=0x7fb5f0016940 s-prefetch_data=0
2015-02-28 23:28:39.274394

[ceph-users] RGW hammer/master woes

2015-02-28 Thread Pavan Rallabhandi
Am struggling to get through a basic PUT via swift client with RGW and CEPH 
binaries built out of Hammer/Master codebase, whereas the same (command on the 
same setup) is going through with RGW and CEPH binaries built out of Giant.

Find below RGW log snippet and the command that was run. Am I missing anything 
obvious here?

The user info looks like this:

{ user_id: johndoe,
  display_name: John Doe,
  email: j...@example.com,
  suspended: 0,
  max_buckets: 1000,
  auid: 0,
  subusers: [
{ id: johndoe:swift,
  permissions: full-control}],
  keys: [
{ user: johndoe,
  access_key: 7B39L2TUQ448LZW4RI3M,
  secret_key: lshKCoacSlbyVc7mBLLr4cJ26fEEM22Tcmp29hT3},
{ user: johndoe:swift,
  access_key: SHZ64EF7CIB4V42I14AH,
  secret_key: }],
  swift_keys: [
{ user: johndoe:swift,
  secret_key: asdf}],
  caps: [],
  op_mask: read, write, delete,
  default_placement: ,
  placement_tags: [],
  bucket_quota: { enabled: false,
  max_size_kb: -1,
  max_objects: -1},
  user_quota: { enabled: false,
  max_size_kb: -1,
  max_objects: -1},
  temp_url_keys: []}


The command that was run and the logs:

snip

swift -A http://localhost:8989/auth -U johndoe:swift -K asdf upload mycontainer 
ceph

2015-02-28 23:28:39.272897 7fb610ff9700  1 == starting new request 
req=0x7fb5f0009990 =
2015-02-28 23:28:39.272913 7fb610ff9700  2 req 0:0.16::PUT 
/swift/v1/mycontainer/ceph::initializing
2015-02-28 23:28:39.272918 7fb610ff9700 10 host=localhost:8989
2015-02-28 23:28:39.272921 7fb610ff9700 20 subdomain= domain= in_hosted_domain=0
2015-02-28 23:28:39.272938 7fb610ff9700 10 meta HTTP_X_OBJECT_META_MTIME
2015-02-28 23:28:39.272945 7fb610ff9700 10 x 
x-amz-meta-mtime:1425140933.648506
2015-02-28 23:28:39.272964 7fb610ff9700 10 ver=v1 first=mycontainer req=ceph
2015-02-28 23:28:39.272971 7fb610ff9700 10 s-object=ceph s-bucket=mycontainer
2015-02-28 23:28:39.272976 7fb610ff9700  2 req 0:0.79:swift:PUT 
/swift/v1/mycontainer/ceph::getting op
2015-02-28 23:28:39.272982 7fb610ff9700  2 req 0:0.85:swift:PUT 
/swift/v1/mycontainer/ceph:put_obj:authorizing
2015-02-28 23:28:39.273008 7fb610ff9700 10 swift_user=johndoe:swift
2015-02-28 23:28:39.273026 7fb610ff9700 20 build_token 
token=0d006a6f686e646f653a73776966744436beb90402b13c4f53f35472c2cf0f
2015-02-28 23:28:39.273057 7fb610ff9700  2 req 0:0.000160:swift:PUT 
/swift/v1/mycontainer/ceph:put_obj:reading permissions
2015-02-28 23:28:39.273100 7fb610ff9700 15 Read 
AccessControlPolicyAccessControlPolicy 
xmlns=http://s3.amazonaws.com/doc/2006-03-01/;OwnerIDjohndoe/IDDisplayNameJohn
 Doe/DisplayName/OwnerAccessControlListGrantGrantee 
xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; 
xsi:type=CanonicalUserIDjohndoe/IDDisplayNameJohn 
Doe/DisplayName/GranteePermissionFULL_CONTROL/Permission/Grant/AccessControlList/AccessControlPolicy
2015-02-28 23:28:39.273114 7fb610ff9700  2 req 0:0.000216:swift:PUT 
/swift/v1/mycontainer/ceph:put_obj:init op
2015-02-28 23:28:39.273120 7fb610ff9700  2 req 0:0.000223:swift:PUT 
/swift/v1/mycontainer/ceph:put_obj:verifying op mask
2015-02-28 23:28:39.273123 7fb610ff9700 20 required_mask= 2 user.op_mask=7
2015-02-28 23:28:39.273125 7fb610ff9700  2 req 0:0.000228:swift:PUT 
/swift/v1/mycontainer/ceph:put_obj:verifying op permissions
2015-02-28 23:28:39.273129 7fb610ff9700  5 Searching permissions for 
uid=johndoe mask=50
2015-02-28 23:28:39.273131 7fb610ff9700  5 Found permission: 15
2015-02-28 23:28:39.273133 7fb610ff9700  5 Searching permissions for group=1 
mask=50
2015-02-28 23:28:39.273135 7fb610ff9700  5 Permissions for group not found
2015-02-28 23:28:39.273136 7fb610ff9700  5 Searching permissions for group=2 
mask=50
2015-02-28 23:28:39.273137 7fb610ff9700  5 Permissions for group not found
2015-02-28 23:28:39.273138 7fb610ff9700  5 Getting permissions id=johndoe 
owner=johndoe perm=2
2015-02-28 23:28:39.273140 7fb610ff9700 10  uid=johndoe requested perm 
(type)=2, policy perm=2, user_perm_mask=2, acl perm=2
2015-02-28 23:28:39.273143 7fb610ff9700  2 req 0:0.000246:swift:PUT 
/swift/v1/mycontainer/ceph:put_obj:verifying op params
2015-02-28 23:28:39.273146 7fb610ff9700  2 req 0:0.000249:swift:PUT 
/swift/v1/mycontainer/ceph:put_obj:executing
2015-02-28 23:28:39.273279 7fb610ff9700 10 x 
x-amz-meta-mtime:1425140933.648506
2015-02-28 23:28:39.273313 7fb610ff9700 20 get_obj_state: rctx=0x7fb610ff41f0 
obj=mycontainer:ceph state=0x7fb5f0016940 s-prefetch_data=0
2015-02-28 23:28:39.274354 7fb610ff9700 20 get_obj_state: rctx=0x7fb610ff41f0 
obj=mycontainer:ceph state=0x7fb5f0016940 s-prefetch_data=0
2015-02-28 23:28:39.274394 7fb610ff9700 10 setting object 
write_tag=default.14199.0
2015-02-28 23:28:39.274554 7fb610ff9700 20 reading from 
.rgw:.bucket.meta.mycontainer:default.14199.3
2015-02-28 23:28:39.274574 7fb610ff9700 20 get_obj_state: rctx=0x7fb610ff2ef0 
obj=.rgw:.bucket.meta.mycontainer:default.14199.3 state=0x7fb5f001db30 

Re: [ceph-users] FW: RGW performance issue

2015-11-12 Thread Pavan Rallabhandi
If you are on >=hammer builds, you might want to consider the option of using 
'rgw_num_rados_handles', which opens up more handles to the cluster from RGW. 
This would help in scenarios, where you have enough number of OSDs to drive the 
cluster bandwidth, which I guess is the case with you.

Thanks,
-Pavan.

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of ?? 

Sent: Thursday, November 12, 2015 1:51 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] FW: RGW performance issue

Hello,

We are building a cluster for archive storage. We plan to use an Object Storage 
(RGW) only, no Block Devices and File System. We doesn't require high speed, so 
we are using old weak servers (4 cores, 3 GB RAM) with new huge but slow HDDs 
(8TB, 5900rpm). We have 3 storage nodes with 24 OSDs totally now and 3 RGW 
nodes based on a default Civetweb engine.

We have got about 50 MB/sec "raw" write speed with librados-level benches 
(measured by rados bench, rados put), and it's quite enough for us. However, 
RGW performance is dramatically low: no more than 5 MB/sec for file uploading 
via s3cmd and swift client. It's too slow for ours tasks and it's abnormally 
slow compared with librados write speed, imho.

Write speed is a most important for us now, our the first goal is to download 
about 50 TB of archive data from a public cloud to our promise storage. We need 
no less than 20 MB/sec of write speed.

Can anybody help my with RGW performance? Who use RGW, what performance penalty 
does it give? And where to find the cause of the problem? I have checked all 
performance counters what I know and I haven't found any critical values.

Thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] FW: RGW performance issue

2015-11-13 Thread Pavan Rallabhandi
No documentation that am aware of. The idea is to avoid having multiple RGW 
instances, if there is enough cluster bandwidth that a single RGW instance can 
drive through.

-Original Message-
From: Jens Rosenboom [mailto:j.rosenb...@x-ion.de] 
Sent: Friday, November 13, 2015 8:58 PM
To: Pavan Rallabhandi
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] FW: RGW performance issue

2015-11-13 5:47 GMT+01:00 Pavan Rallabhandi <pavan.rallabha...@sandisk.com>:
> If you are on >=hammer builds, you might want to consider the option 
> of using 'rgw_num_rados_handles', which opens up more handles to the 
> cluster from RGW. This would help in scenarios, where you have enough 
> number of OSDs to drive the cluster bandwidth, which I guess is the case with 
> you.

Is there any documentation on this option other than the source itself? My 
google foo failed to come up with anything except pull requests.

In particular it would be interesting to know what a useful target value would 
be and how to define "enough" OSDs.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rgw bucket deletion woes

2016-06-15 Thread Pavan Rallabhandi
To update this thread, this is now fixed via 
https://github.com/ceph/ceph/pull/8679

Thanks!

From: Ben Hines <bhi...@gmail.com<mailto:bhi...@gmail.com>>
Date: Thursday, March 17, 2016 at 4:47 AM
To: Yehuda Sadeh-Weinraub <yeh...@redhat.com<mailto:yeh...@redhat.com>>
Cc: Pavan Rallabhandi 
<prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>>, 
"ceph-us...@ceph.com<mailto:ceph-us...@ceph.com>" 
<ceph-us...@ceph.com<mailto:ceph-us...@ceph.com>>
Subject: Re: [ceph-users] rgw bucket deletion woes

We would be a big user of this. We delete large buckets often and it takes 
forever.

Though didn't I read that 'object expiration' support is on the near-term RGW 
roadmap? That may do what we want.. we're creating thousands of objects a day, 
and thousands of objects a day will be expiring, so RGW will need to handle.


-Ben

On Wed, Mar 16, 2016 at 9:40 AM, Yehuda Sadeh-Weinraub 
<yeh...@redhat.com<mailto:yeh...@redhat.com>> wrote:
On Tue, Mar 15, 2016 at 11:36 PM, Pavan Rallabhandi
<prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>> wrote:
> Hi,
>
> I find this to be discussed here before, but couldn¹t find any solution
> hence the mail. In RGW, for a bucket holding objects in the range of ~
> millions, one can find it to take for ever to delete the bucket(via
> radosgw-admin). I understand the gc(and its parameters) that would reclaim
> the space eventually, but am looking more at the bucket deletion options
> that can possibly speed up the operation.
>
> I realize, currently rgw_remove_bucket(), does it 1000 objects at a time,
> serially. Wanted to know if there is a reason(that am possibly missing and
> discussed) for this to be left that way, otherwise I was considering a
> patch to make it happen better.
>

There is no real reason. You might want to have a version of that
command that doesn't schedule the removal to gc, but rather removes
all the object parts by itself. Otherwise, you're just going to flood
the gc. You'll need to iterate through all the objects, and for each
object you'll need to remove all of it's rados objects (starting with
the tail, then the head). Removal of each rados object can be done
asynchronously, but you'll need to throttle the operations, not send
everything to the osds at once (which will be impossible, as the
objecter will throttle the requests anyway, which will lead to a high
memory consumption).

Thanks,
Yehuda
___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rgw bucket deletion woes

2016-03-16 Thread Pavan Rallabhandi
Hi,

I find this to be discussed here before, but couldn¹t find any solution
hence the mail. In RGW, for a bucket holding objects in the range of ~
millions, one can find it to take for ever to delete the bucket(via
radosgw-admin). I understand the gc(and its parameters) that would reclaim
the space eventually, but am looking more at the bucket deletion options
that can possibly speed up the operation.

I realize, currently rgw_remove_bucket(), does it 1000 objects at a time,
serially. Wanted to know if there is a reason(that am possibly missing and
discussed) for this to be left that way, otherwise I was considering a
patch to make it happen better.

Thanks,
-Pavan.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CBT results parsing/plotting

2016-07-06 Thread Pavan Rallabhandi
Thanks Mark,  we did look at tools from Ben England here 
https://github.com/bengland2/cbt/blob/fio-thru-analysis/tools/parse-fio-result-log.sh
 but not with much luck, that’s partly because we didn’t bother to look into 
the gory details if things didn’t work.

Thanks for the scripts you have attached, will give a shot.

Thanks,
-Pavan.

On 7/6/16, 11:05 PM, "ceph-users on behalf of Mark Nelson" 
<ceph-users-boun...@lists.ceph.com on behalf of mnel...@redhat.com> wrote:

Hi Pavan,

A couple of us have some pretty ugly home-grown scripts for doing this. 
Basically just bash/awk that loop through the directories and grab the 
fio bw/latency lines.  Eventually the whole way that cbt stores data 
should be revamped since the way data gets laid out in a nested 
directory structure doesn't really scale.

Lately we've been more focused on parsing the fio log output.  We've had 
issues with strange clock skew and multiple concurrent clients not all 
running at exactly the same time, leading to bias in the results.  To 
get around this, we wrote a tool to parse the output of multiple fio 
bw/iops/latency log files.

It was upstreamed into fio itself recently here:

https://github.com/axboe/fio/blob/master/tools/fiologparser.py

a potentially better (but still experimental and maybe buggy) version of 
this is here:

https://github.com/markhpc/fio/blob/wip-interval/tools/fiologparser.py

The idea is to fit samples into intervals based on how much they 
overlap, but if you have multiple files (say from multiple simulatenous 
clients) only calculate averages based on the time the clients ran at 
the same time (in case some clients started late or ended early).  Ben 
England is working on speeding this up when fio records every IO rather 
than interval averages and we have an intern now that is looking at this 
and other methods for getting better logging data out of fio.

Anyway, I've included two scripts that will run through a cbt output 
directory and either read the fio output file or run this script to 
parse the output data.  Like I said, these are basically terrible (even 
embarrassing!) and will almost certainly need some slight hacking to 
work based on the iodepth/readahead/etc.  I think Ben England has 
something a bit nicer but he's on vacation at the moment so you'll have 
to suffer with my horrible hacked up bash for now. ;)

Mark

On 07/06/2016 03:22 AM, Pavan Rallabhandi wrote:
> Wanted to check if there are any readily available tools that the community 
> is aware of/using for parsing/plotting CBT run results. Am particularly 
> interested in tools for the CBT librbdfio runs, where in the aggregated 
> BW/IOPS/Latency reports are generated either in CSV/graphical.
>
> Thanks!
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CBT results parsing/plotting

2016-07-06 Thread Pavan Rallabhandi
Wanted to check if there are any readily available tools that the community is 
aware of/using for parsing/plotting CBT run results. Am particularly interested 
in tools for the CBT librbdfio runs, where in the aggregated BW/IOPS/Latency 
reports are generated either in CSV/graphical.

Thanks!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rgw meta pool

2016-09-08 Thread Pavan Rallabhandi
Trying it one more time on the users list.

In our clusters running Jewel 10.2.2, I see default.rgw.meta pool running into 
large number of objects, potentially to the same range of objects contained in 
the data pool. 

I understand that the immutable metadata entries are now stored in this heap 
pool, but I couldn’t reason out why the metadata objects are left in this pool 
even after the actual bucket/object/user deletions.

The put_entry() promptly seems to be storing the same in the heap pool 
https://github.com/ceph/ceph/blob/master/src/rgw/rgw_metadata.cc#L880, but I do 
not see them to be reaped ever. Are they left there for some reason?

Thanks,
-Pavan.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rgw meta pool

2016-09-09 Thread Pavan Rallabhandi
Any help on this is much appreciated, am considering to fix this, given it’s 
confirmed an issue unless am missing something obvious. 

Thanks,
-Pavan.

On 9/8/16, 5:04 PM, "ceph-users on behalf of Pavan Rallabhandi" 
<ceph-users-boun...@lists.ceph.com on behalf of prallabha...@walmartlabs.com> 
wrote:

Trying it one more time on the users list.

In our clusters running Jewel 10.2.2, I see default.rgw.meta pool running 
into large number of objects, potentially to the same range of objects 
contained in the data pool. 

I understand that the immutable metadata entries are now stored in this 
heap pool, but I couldn’t reason out why the metadata objects are left in this 
pool even after the actual bucket/object/user deletions.

The put_entry() promptly seems to be storing the same in the heap pool 
https://github.com/ceph/ceph/blob/master/src/rgw/rgw_metadata.cc#L880, but I do 
not see them to be reaped ever. Are they left there for some reason?

Thanks,
-Pavan.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rgw meta pool

2016-09-10 Thread Pavan Rallabhandi
Thanks Casey for the reply, more on the tracker.

Thanks!

On 9/9/16, 11:32 PM, "ceph-users on behalf of Casey Bodley" 
<ceph-users-boun...@lists.ceph.com on behalf of cbod...@redhat.com> wrote:

Hi,

My (limited) understanding of this metadata heap pool is that it's an 
archive of metadata entries and their versions. According to Yehuda, 
this was intended to support recovery operations by reverting specific 
metadata objects to a previous version. But nothing has been implemented 
so far, and I'm not aware of any plans to do so. So these objects are 
being created, but never read or deleted.

This was discussed in the rgw standup this morning, and we agreed that 
this archival should be made optional (and default to off), most likely 
by assigning an empty pool name to the zone's 'metadata_heap' field. 
I've created a ticket at http://tracker.ceph.com/issues/17256 to track 
this issue.

Casey


On 09/09/2016 11:01 AM, Warren Wang - ISD wrote:
> A little extra context here. Currently the metadata pool looks like it is
> on track to exceed the number of objects in the data pool, over time. In a
> brand new cluster, we¹re already up to almost 2 million in each pool.
>
>  NAME  ID USED  %USED MAX AVAIL
> OBJECTS
>  default.rgw.buckets.data  17 3092G  0.86  345T
> 2013585
>  default.rgw.meta  25  743M 0  172T
> 1975937
>
> We¹re concerned this will be unmanageable over time.
>
> Warren Wang
>
    >
> On 9/9/16, 10:54 AM, "ceph-users on behalf of Pavan Rallabhandi"
> <ceph-users-boun...@lists.ceph.com on behalf of
> prallabha...@walmartlabs.com> wrote:
>
>> Any help on this is much appreciated, am considering to fix this, given
>> it¹s confirmed an issue unless am missing something obvious.
>>
>> Thanks,
>> -Pavan.
>>
>> On 9/8/16, 5:04 PM, "ceph-users on behalf of Pavan Rallabhandi"
>> <ceph-users-boun...@lists.ceph.com on behalf of
>> prallabha...@walmartlabs.com> wrote:
>>
>> Trying it one more time on the users list.
>> 
>> In our clusters running Jewel 10.2.2, I see default.rgw.meta pool
>> running into large number of objects, potentially to the same range of
>> objects contained in the data pool.
>> 
>> I understand that the immutable metadata entries are now stored in
>> this heap pool, but I couldn¹t reason out why the metadata objects are
>> left in this pool even after the actual bucket/object/user deletions.
>> 
>> The put_entry() promptly seems to be storing the same in the heap
>> pool
>> https://github.com/ceph/ceph/blob/master/src/rgw/rgw_metadata.cc#L880,
>> but I do not see them to be reaped ever. Are they left there for some
>> reason?
>> 
>> Thanks,
>> -Pavan.
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> This email and any files transmitted with it are confidential and 
intended solely for the individual or entity to whom they are addressed. If you 
have received this email in error destroy it immediately. *** Walmart 
Confidential ***
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Same pg scrubbed over and over (Jewel)

2016-09-21 Thread Pavan Rallabhandi
We find this as well in our fresh built Jewel clusters, and seems to happen 
only with a handful of PGs from couple of pools.

Thanks!

On 9/21/16, 3:14 PM, "ceph-users on behalf of Tobias Böhm" 
 wrote:

Hi,

there is an open bug in the tracker: http://tracker.ceph.com/issues/16474

It also suggests restarting OSDs as a workaround. We faced the same issue 
after increasing the number of PGs in our cluster and restarting OSDs solved it 
as well.

Tobias

> Am 21.09.2016 um 11:26 schrieb Dan van der Ster :
> 
> There was a thread about this a few days ago:
> 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/012857.html
> And the OP found a workaround.
> Looks like a bug though... (by default PGs scrub at most once per day).
> 
> -- dan
> 
> 
> 
> On Tue, Sep 20, 2016 at 10:43 PM, Martin Bureau  
wrote:
>> Hello,
>> 
>> 
>> I noticed that the same pg gets scrubbed repeatedly on our new Jewel
>> cluster:
>> 
>> 
>> Here's an excerpt from log:
>> 
>> 
>> 2016-09-20 20:36:31.236123 osd.12 10.1.82.82:6820/14316 150514 : cluster
>> [INF] 25.3f scrub ok
>> 2016-09-20 20:36:32.232918 osd.12 10.1.82.82:6820/14316 150515 : cluster
>> [INF] 25.3f scrub starts
>> 2016-09-20 20:36:32.236876 osd.12 10.1.82.82:6820/14316 150516 : cluster
>> [INF] 25.3f scrub ok
>> 2016-09-20 20:36:33.233268 osd.12 10.1.82.82:6820/14316 150517 : cluster
>> [INF] 25.3f deep-scrub starts
>> 2016-09-20 20:36:33.242258 osd.12 10.1.82.82:6820/14316 150518 : cluster
>> [INF] 25.3f deep-scrub ok
>> 2016-09-20 20:36:36.233604 osd.12 10.1.82.82:6820/14316 150519 : cluster
>> [INF] 25.3f scrub starts
>> 2016-09-20 20:36:36.237221 osd.12 10.1.82.82:6820/14316 150520 : cluster
>> [INF] 25.3f scrub ok
>> 2016-09-20 20:36:41.234490 osd.12 10.1.82.82:6820/14316 150521 : cluster
>> [INF] 25.3f deep-scrub starts
>> 2016-09-20 20:36:41.243720 osd.12 10.1.82.82:6820/14316 150522 : cluster
>> [INF] 25.3f deep-scrub ok
>> 2016-09-20 20:36:45.235128 osd.12 10.1.82.82:6820/14316 150523 : cluster
>> [INF] 25.3f deep-scrub starts
>> 2016-09-20 20:36:45.352589 osd.12 10.1.82.82:6820/14316 150524 : cluster
>> [INF] 25.3f deep-scrub ok
>> 2016-09-20 20:36:47.235310 osd.12 10.1.82.82:6820/14316 150525 : cluster
>> [INF] 25.3f scrub starts
>> 2016-09-20 20:36:47.239348 osd.12 10.1.82.82:6820/14316 150526 : cluster
>> [INF] 25.3f scrub ok
>> 2016-09-20 20:36:49.235538 osd.12 10.1.82.82:6820/14316 150527 : cluster
>> [INF] 25.3f deep-scrub starts
>> 2016-09-20 20:36:49.243121 osd.12 10.1.82.82:6820/14316 150528 : cluster
>> [INF] 25.3f deep-scrub ok
>> 2016-09-20 20:36:51.235956 osd.12 10.1.82.82:6820/14316 150529 : cluster
>> [INF] 25.3f deep-scrub starts
>> 2016-09-20 20:36:51.244201 osd.12 10.1.82.82:6820/14316 150530 : cluster
>> [INF] 25.3f deep-scrub ok
>> 2016-09-20 20:36:52.236076 osd.12 10.1.82.82:6820/14316 150531 : cluster
>> [INF] 25.3f scrub starts
>> 2016-09-20 20:36:52.239376 osd.12 10.1.82.82:6820/14316 150532 : cluster
>> [INF] 25.3f scrub ok
>> 2016-09-20 20:36:56.236740 osd.12 10.1.82.82:6820/14316 150533 : cluster
>> [INF] 25.3f scrub starts
>> 
>> 
>> How can I troubleshoot / resolve this ?
>> 
>> 
>> Regards,
>> 
>> Martin
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs are flapping and marked down wrongly

2016-10-17 Thread Pavan Rallabhandi
Regarding the mon_osd_min_down_reports I was looking at it recently, this could 
provide some insight 
https://github.com/ceph/ceph/commit/0269a0c17723fd3e22738f7495fe017225b924a4

Thanks!

On 10/17/16, 1:36 PM, "ceph-users on behalf of Somnath Roy" 
 wrote:

Thanks Piotr, Wido for quick response.

@Wido , yes, I thought of trying with those values but I am seeing in the 
log messages at least 7 osds are reporting failure , so, didn't try. BTW, I 
found default mon_osd_min_down_reporters is 2 , not 1 and latest master is not 
having mon_osd_min_down_reports anymore. Not sure what it is replaced with..

@Piotr , yes, your PR really helps , thanks !  Regarding each messenger 
needs to respond to HB is confusing, I know each thread has a HB timeout value 
and beyond which it will crash with suicide timeout , are you talking about 
that ?

Regards
Somnath

-Original Message-
From: Piotr Dałek [mailto:bra...@predictor.org.pl]
Sent: Monday, October 17, 2016 12:52 AM
To: ceph-users@lists.ceph.com; Somnath Roy; ceph-de...@vger.kernel.org
Subject: Re: OSDs are flapping and marked down wrongly

On Mon, Oct 17, 2016 at 07:16:44AM +, Somnath Roy wrote:
> Hi Sage et. al,
>
> I know this issue is reported number of times in community and attributed 
to either network issue or unresponsive OSDs.
> Recently, we are seeing this issue when our all SSD cluster (Jewel based) 
 is stressed with large block size and very high QD. Lowering QD it is working 
just fine.
> We are seeing the lossy connection message like below and followed by the 
osd marked down by monitor.
>
> 2016-10-15 14:30:13.957534 7f6297bff700  0 -- 10.10.10.94:6810/2461767
> submit_message osd_op_reply(1463
> rbd_data.55246b8b4567.d633 [set-alloc-hint object_size
> 4194304 write_size 4194304,write 3932160~262144] v222'95890 uv95890
> ondisk = 0) v7 remote, 10.10.10.98:0/1174431362, dropping message
>
> In the monitor log, I am seeing the osd is reported down by peers and 
subsequently monitor is marking it down.
> OSDs is rejoining the cluster after detecting it is marked down wrongly 
and rebalancing started. This is hurting performance very badly.
>
> My question is the following.
>
> 1. I have 40Gb network and I am seeing network is not utilized beyond 
10-12Gb/s , no network error is reported. So, why this lossy connection message 
is coming ? what could go wrong here ? Is it network prioritization issue of 
smaller ping packets ? I tried to gaze ping round time during this and nothing 
seems abnormal.
>
> 2. Nothing is saturated on the OSD side , plenty of 
network/memory/cpu/disk is left. So, I doubt my osds are unresponsive but yes 
it is really busy on IO path. Heartbeat is going through separate messenger and 
threads as well, so, busy op threads should not be making heartbeat delayed. 
Increasing osd heartbeat grace is only delaying this phenomenon , but, 
eventually happens after several hours. Anything else we can tune here ?

There's a bunch of messengers in OSD code, if ANY of them doesn't respond 
to heartbeat messages in reasonable time, it is marked as down. Since packets 
are processed in FIFO/synchronous manner, overloading OSD with large I/O will 
cause it to time-out on at least one messenger.
There was an idea to have heartbeat messages go in the OOB TCP/IP stream 
and process them asynchronously, but I don't know if that went beyond the idea 
stage.

> 3. What could be the side effect of big grace period ? I understand that 
detecting a faulty osd will be delayed, anything else ?

Yes - stalled ops. Assume that primary OSD goes down and replicas are still 
alive. Having big grace period will cause all ops going to that OSD to stall 
until that particular OSD is marked down or resumes normal operation.

> 4. I saw if an OSD is crashed, monitor will detect the down osd almost 
instantaneously and it is not waiting till this grace period. How it is 
distinguishing between unresponsive and crashed osds ? In which scenario this 
heartbeat grace is coming into picture ?

This is the effect of my PR#8558 (https://github.com/ceph/ceph/pull/8558)
which causes any OSD that crash to be immediately marked as down, 
preventing stalled I/Os in most common cases. Grace period is only applied to 
unresponsive OSDs (i.e. temporary packet loss, bad cases of network lags, 
routing issues, in other words, everything that is known to be at least 
possible to resolve by itself in a finite amount of time). OSDs that crash and 
burn won't respond - instead, OS will respond with ECONNREFUSED indicating that 
OSD is not listening and in that case the OSD will be immediately marked down.

--
Piotr Dałek
bra...@predictor.org.pl
http://blog.predictor.org.pl
   

Re: [ceph-users] rbd cache writethrough until flush

2016-10-21 Thread Pavan Rallabhandi
Thanks for verifying at your end Jason.

It’s pretty weird that the difference is >~10X, with 
"rbd_cache_writethrough_until_flush = true" I see ~400 IOPS vs with 
"rbd_cache_writethrough_until_flush = false" I see them to be ~6000 IOPS. 

The QEMU cache is none for all of the rbd drives. On that note, would older 
librbd versions (like Hammer) have any caching issues while dealing with Jewel 
clusters?

Thanks,
-Pavan.

On 10/21/16, 8:17 PM, "Jason Dillaman"  wrote:

QEMU cache setting for the rbd drive?



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd cache writethrough until flush

2016-10-21 Thread Pavan Rallabhandi
The VM am testing against is created after the librbd upgrade.

Always had this confusion around this bit in the docs here  
http://docs.ceph.com/docs/jewel/rbd/qemu-rbd/#qemu-cache-options that:

“QEMU’s cache settings override Ceph’s default settings (i.e., settings that 
are not explicitly set in the Ceph configuration file). If you explicitly set 
RBD Cache settings in your Ceph configuration file, your Ceph settings override 
the QEMU cache settings. If you set cache settings on the QEMU command line, 
the QEMU command line settings override the Ceph configuration file settings.”

Thanks,
-Pavan.

On 10/21/16, 11:31 PM, "Jason Dillaman" <jdill...@redhat.com> wrote:

On Fri, Oct 21, 2016 at 1:15 PM, Pavan Rallabhandi
<prallabha...@walmartlabs.com> wrote:
> The QEMU cache is none for all of the rbd drives

Hmm -- if you have QEMU cache disabled, I would expect it to disable
the librbd cache.

I have to ask, but did you (re)start/live-migrate these VMs you are
testing against after you upgraded to librbd v10.2.3?

-- 
Jason



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd cache writethrough until flush

2016-10-21 Thread Pavan Rallabhandi
From my VMs that have cinder provisioned volumes, I tried dd / fio (like below) 
to find the IOPS to be less, even a sync before the runs didn’t help. Same runs 
by setting the option to false yield better results.

Both the clients and the cluster are running 10.2.3, perhaps the only 
difference is that the clients are on Trusty and the cluster is Xenial.

dd if=/dev/zero of=/dev/vdd bs=4K count=1000 oflag=direct

fio -name iops -rw=write -bs=4k -direct=1  -runtime=60 -iodepth 1 -filename 
/dev/vde -ioengine=libaio 

Thanks,
-Pavan.

On 10/21/16, 6:15 PM, "Jason Dillaman" <jdill...@redhat.com> wrote:

It's in the build and has tests to verify that it is properly being
triggered [1].

$ git tag --contains 5498377205523052476ed81aebb2c2e6973f67ef
v10.2.3

What are your tests that say otherwise?

[1] 
https://github.com/ceph/ceph/pull/10797/commits/5498377205523052476ed81aebb2c2e6973f67ef

On Fri, Oct 21, 2016 at 7:42 AM, Pavan Rallabhandi
<prallabha...@walmartlabs.com> wrote:
> I see the fix for write back cache not getting turned on after flush has 
made into Jewel 10.2.3 ( http://tracker.ceph.com/issues/17080 ) but our testing 
says otherwise.
>
> The cache is still behaving as if its writethrough, though the setting is 
set to true. Wanted to check if it’s still broken in Jewel 10.2.3 or am I 
missing anything here?
>
> Thanks,
> -Pavan.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd cache writethrough until flush

2016-10-21 Thread Pavan Rallabhandi
And to add, the host running Cinder services is having Hammer 0.94.9 but the 
rest of them like Nova are on Jewel 10.2.3

FWIW, the rbd info for one such image looks like this:

rbd image 'volume-f6ec45e2-b644-4b58-b6b5-b3a418c3c5b2':
size 2048 MB in 512 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.5ebf12d1934e
format: 2
features: layering, striping
flags: 
stripe unit: 4096 kB
stripe count: 1

Thanks!

On 10/21/16, 7:26 PM, "ceph-users on behalf of Pavan Rallabhandi" 
<ceph-users-boun...@lists.ceph.com on behalf of prallabha...@walmartlabs.com> 
wrote:

Both the clients and the cluster are running 10.2.3, perhaps the only 
difference is that the clients are on Trusty and the cluster is Xenial.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd cache writethrough until flush

2016-10-21 Thread Pavan Rallabhandi
I see the fix for write back cache not getting turned on after flush has made 
into Jewel 10.2.3 ( http://tracker.ceph.com/issues/17080 ) but our testing says 
otherwise. 

The cache is still behaving as if its writethrough, though the setting is set 
to true. Wanted to check if it’s still broken in Jewel 10.2.3 or am I missing 
anything here?

Thanks,
-Pavan.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RadosGW and Openstack Keystone revoked tokens

2017-04-21 Thread Pavan Rallabhandi
You may want to look here http://tracker.ceph.com/issues/19499 and 
http://tracker.ceph.com/issues/9493

Thanks,

From: ceph-users  on behalf of 
"magicb...@gmail.com" 
Date: Friday, 21 April 2017 1:11 pm
To: ceph-users 
Subject: EXT: Re: [ceph-users] RadosGW and Openstack Keystone revoked tokens

Hi

any ideas?

thanks,
J.
On 17/04/17 12:50, magicb...@gmail.com wrote:
Hi

is it possible to configure radosGW (10.2.6-0ubuntu0.16.04.1) to work with 
Openstack Keystone UUID based tokens? RadosGW is expecting a list of revoked 
tokens, but that option only works in keystone deployments based on PKI token 
(not uuid/fernet tokens)

error log: 
2017-04-17 10:40:43.753674 7f38b4fe9700 0 revoked tokens response is missing 
signed section
2017-04-17 10:40:43.753694 7f38b4fe9700 0 ERROR: keystone revocation processing 
returned error r=-22

Thanks.
J. 




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Speeding up garbage collection in RGW

2017-07-25 Thread Pavan Rallabhandi
If your Ceph version is >=Jewel, you can try the `--bypass-gc` option in 
radosgw-admin, which would remove the tails objects as well without marking 
them to be GCed.

Thanks,

On 25/07/17, 1:34 AM, "ceph-users on behalf of Bryan Stillwell" 
 wrote:

I'm in the process of cleaning up a test that an internal customer did on 
our production cluster that produced over a billion objects spread across 6000 
buckets.  So far I've been removing the buckets like this:

printf %s\\n bucket{1..6000} | xargs -I{} -n 1 -P 32 radosgw-admin bucket 
rm --bucket={} --purge-objects

However, the disk usage doesn't seem to be getting reduced at the same rate 
the objects are being removed.  From what I can tell a large number of the 
objects are waiting for garbage collection.

When I first read the docs it sounded like the garbage collector would only 
remove 32 objects every hour, but after looking through the logs I'm seeing 
about 55,000 objects removed every hour.  That's about 1.3 million a day, so at 
this rate it'll take a couple years to clean up the rest!  For comparison, the 
purge-objects command above is removing (but not GC'ing) about 30 million 
objects a day, so a much more manageable 33 days to finish.

I've done some digging and it appears like I should be changing these 
configuration options:

rgw gc max objs (default: 32)
rgw gc obj min wait (default: 7200)
rgw gc processor max time (default: 3600)
rgw gc processor period (default: 3600)

A few questions I have though are:

Should 'rgw gc processor max time' and 'rgw gc processor period' always be 
set to the same value?

Which would be better, increasing 'rgw gc max objs' to something like 1024, 
or reducing the 'rgw gc processor' times to something like 60 seconds?

Any other guidance on the best way to adjust these values?

Thanks,
Bryan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Speeding up garbage collection in RGW

2017-07-25 Thread Pavan Rallabhandi
I’ve just realized that the option is present in Hammer (0.94.10) as well, you 
should try that.

From: Bryan Stillwell <bstillw...@godaddy.com>
Date: Tuesday, 25 July 2017 at 9:45 PM
To: Pavan Rallabhandi <prallabha...@walmartlabs.com>, 
"ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
Subject: EXT: Re: [ceph-users] Speeding up garbage collection in RGW

Unfortunately, we're on hammer still (0.94.10).  That option looks like it 
would work better, so maybe it's time to move the upgrade up in the schedule.

I've been playing with the various gc options and I haven't seen any speedups 
like we would need to remove them in a reasonable amount of time.

Thanks,
Bryan

From: Pavan Rallabhandi <prallabha...@walmartlabs.com>
Date: Tuesday, July 25, 2017 at 3:00 AM
To: Bryan Stillwell <bstillw...@godaddy.com>, "ceph-users@lists.ceph.com" 
<ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Speeding up garbage collection in RGW

If your Ceph version is >=Jewel, you can try the `--bypass-gc` option in 
radosgw-admin, which would remove the tails objects as well without marking 
them to be GCed.

Thanks,

On 25/07/17, 1:34 AM, "ceph-users on behalf of Bryan Stillwell" 
<ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com> on 
behalf of bstillw...@godaddy.com<mailto:bstillw...@godaddy.com>> wrote:

I'm in the process of cleaning up a test that an internal customer did on 
our production cluster that produced over a billion objects spread across 6000 
buckets.  So far I've been removing the buckets like this:

printf %s\\n bucket{1..6000} | xargs -I{} -n 1 -P 32 radosgw-admin bucket 
rm --bucket={} --purge-objects

However, the disk usage doesn't seem to be getting reduced at the same rate 
the objects are being removed.  From what I can tell a large number of the 
objects are waiting for garbage collection.

When I first read the docs it sounded like the garbage collector would only 
remove 32 objects every hour, but after looking through the logs I'm seeing 
about 55,000 objects removed every hour.  That's about 1.3 million a day, so at 
this rate it'll take a couple years to clean up the rest!  For comparison, the 
purge-objects command above is removing (but not GC'ing) about 30 million 
objects a day, so a much more manageable 33 days to finish.

I've done some digging and it appears like I should be changing these 
configuration options:

rgw gc max objs (default: 32)
rgw gc obj min wait (default: 7200)
rgw gc processor max time (default: 3600)
rgw gc processor period (default: 3600)

A few questions I have though are:

Should 'rgw gc processor max time' and 'rgw gc processor period' always be 
set to the same value?

Which would be better, increasing 'rgw gc max objs' to something like 1024, 
or reducing the 'rgw gc processor' times to something like 60 seconds?

Any other guidance on the best way to adjust these values?

Thanks,
Bryan


___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] FW: radosgw: stale/leaked bucket index entries

2017-06-20 Thread Pavan Rallabhandi
Hi Orit,

No, we do not use multi-site.

Thanks,
-Pavan.

From: Orit Wasserman <owass...@redhat.com>
Date: Tuesday, 20 June 2017 at 12:49 PM
To: Pavan Rallabhandi <prallabha...@walmartlabs.com>
Cc: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
Subject: EXT: Re: [ceph-users] FW: radosgw: stale/leaked bucket index entries

Hi Pavan, 

On Tue, Jun 20, 2017 at 8:29 AM, Pavan Rallabhandi 
<prallabha...@walmartlabs.com> wrote:
Trying one more time with ceph-users

On 19/06/17, 11:07 PM, "Pavan Rallabhandi" <prallabha...@walmartlabs.com> wrote:

    On many of our clusters running Jewel (10.2.5+), am running into a strange 
problem of having stale bucket index entries left over for (some of the) 
objects deleted. Though it is not reproducible at will, it has been pretty 
consistent of late and am clueless at this point for the possible reasons to 
happen so.

    The symptoms are that the actual delete operation of an object is reported 
successful in the RGW logs, but a bucket list on the container would still show 
the deleted object. An attempt to download/stat of the object appropriately 
results in a failure. No failures are seen in the respective OSDs where the 
bucket index object is located. And rebuilding the bucket index by running 
‘radosgw-admin bucket check –fix’ would fix the issue.

    Though I could simulate the problem by instrumenting the code, to not to 
have invoked `complete_del` on the bucket index op 
https://github.com/ceph/ceph/blob/master/src/rgw/rgw_rados.cc#L8793, but that 
call is always seem to be made unless there is a cascading error from the 
actual delete operation of the object, which doesn’t seem to be the case here.

    I wanted to know the possible reasons where the bucket index would be left 
in such limbo, any pointers would be much appreciated. FWIW, we are not 
sharding the buckets and very recently I’ve seen this happen with buckets 
having as low as
    < 10 objects, and we are using swift for all the operations.

Do you use multisite? 

Regards,
Orit
 
    Thanks,
    -Pavan.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] FW: radosgw: stale/leaked bucket index entries

2017-06-19 Thread Pavan Rallabhandi
Trying one more time with ceph-users

On 19/06/17, 11:07 PM, "Pavan Rallabhandi" <prallabha...@walmartlabs.com> wrote:

On many of our clusters running Jewel (10.2.5+), am running into a strange 
problem of having stale bucket index entries left over for (some of the) 
objects deleted. Though it is not reproducible at will, it has been pretty 
consistent of late and am clueless at this point for the possible reasons to 
happen so. 

The symptoms are that the actual delete operation of an object is reported 
successful in the RGW logs, but a bucket list on the container would still show 
the deleted object. An attempt to download/stat of the object appropriately 
results in a failure. No failures are seen in the respective OSDs where the 
bucket index object is located. And rebuilding the bucket index by running 
‘radosgw-admin bucket check –fix’ would fix the issue.

Though I could simulate the problem by instrumenting the code, to not to 
have invoked `complete_del` on the bucket index op 
https://github.com/ceph/ceph/blob/master/src/rgw/rgw_rados.cc#L8793, but that 
call is always seem to be made unless there is a cascading error from the 
actual delete operation of the object, which doesn’t seem to be the case here.

I wanted to know the possible reasons where the bucket index would be left 
in such limbo, any pointers would be much appreciated. FWIW, we are not 
sharding the buckets and very recently I’ve seen this happen with buckets 
having as low as 
< 10 objects, and we are using swift for all the operations.

Thanks,
-Pavan.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] FW: radosgw: stale/leaked bucket index entries

2017-06-22 Thread Pavan Rallabhandi
Looks like I’ve now got a consistent repro scenario, please find the gory 
details here http://tracker.ceph.com/issues/20380

Thanks!

On 20/06/17, 2:04 PM, "Pavan Rallabhandi" <prallabha...@walmartlabs.com> wrote:

Hi Orit,

No, we do not use multi-site.

Thanks,
-Pavan.

From: Orit Wasserman <owass...@redhat.com>
Date: Tuesday, 20 June 2017 at 12:49 PM
    To: Pavan Rallabhandi <prallabha...@walmartlabs.com>
Cc: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
Subject: EXT: Re: [ceph-users] FW: radosgw: stale/leaked bucket index 
entries

Hi Pavan, 
    
On Tue, Jun 20, 2017 at 8:29 AM, Pavan Rallabhandi 
<prallabha...@walmartlabs.com> wrote:
Trying one more time with ceph-users

On 19/06/17, 11:07 PM, "Pavan Rallabhandi" <prallabha...@walmartlabs.com> 
wrote:

On many of our clusters running Jewel (10.2.5+), am running into a 
strange problem of having stale bucket index entries left over for (some of 
the) objects deleted. Though it is not reproducible at will, it has been pretty 
consistent of late and am clueless at this point for the possible reasons to 
happen so.

The symptoms are that the actual delete operation of an object is 
reported successful in the RGW logs, but a bucket list on the container would 
still show the deleted object. An attempt to download/stat of the object 
appropriately results in a failure. No failures are seen in the respective OSDs 
where the bucket index object is located. And rebuilding the bucket index by 
running ‘radosgw-admin bucket check –fix’ would fix the issue.

Though I could simulate the problem by instrumenting the code, to not 
to have invoked `complete_del` on the bucket index op 
https://github.com/ceph/ceph/blob/master/src/rgw/rgw_rados.cc#L8793, but that 
call is always seem to be made unless there is a cascading error from the 
actual delete operation of the object, which doesn’t seem to be the case here.

I wanted to know the possible reasons where the bucket index would be 
left in such limbo, any pointers would be much appreciated. FWIW, we are not 
sharding the buckets and very recently I’ve seen this happen with buckets 
having as low as
< 10 objects, and we are using swift for all the operations.

Do you use multisite? 

Regards,
Orit
 
Thanks,
-Pavan.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bucket reporting content inconsistently

2018-05-21 Thread Pavan Rallabhandi
Can possibly be due to these http://tracker.ceph.com/issues/20380, 
http://tracker.ceph.com/issues/22555

Thanks,

From: ceph-users  on behalf of Tom W 

Date: Saturday, May 12, 2018 at 10:57 AM
To: ceph-users 
Subject: EXT: Re: [ceph-users] Bucket reporting content inconsistently

Thanks for posting this for me Sean. Just to update, it seems that despite the 
bucket checks completing and reporting no issues, the objects continued to show 
in any tools to list the contents of the bucket.

I put together a simple loop to upload a new file to overwrite the existing one 
then trigger a delete request though the API and this seems to be working in 
lieu of a cleaner solution.

We will be upgrading to Luminous in the coming week, I’ll report back if we see 
any significant change in this issue when we do.

Kind Regards,

Tom

From: ceph-users  On Behalf Of Sean Redmond
Sent: 11 May 2018 17:15
To: ceph-users 
Subject: [ceph-users] Bucket reporting content inconsistently


HI all,



We have recently upgraded to 10.2.10 in preparation for our upcoming upgrade to 
Luminous and I have been attempting to remove a bucket. When using tools such 
as s3cmd I can see files are listed, verified by the checking with bi list too 
as shown below:



root@ceph-rgw-1:~# radosgw-admin --id rgw.ceph-rgw-1 bi list 
--bucket='bucketnamehere' | grep -i "\"idx\":" | wc -l

3278



However, on attempting to delete the bucket and purge the objects , it appears 
not to be recognised:



root@ceph-rgw-1:~# radosgw-admin --id rgw.ceph-rgw-1 bucket rm --bucket= 
bucketnamehere --purge-objects

2018-05-10 14:11:05.393851 7f0ab07b6a00 -1 ERROR: unable to remove bucket(2) No 
such file or directory



Checking the bucket stats, it does appear that the bucket is reporting no 
content, and repeat the above content test there has been no change to the 3278 
figure:



root@ceph-rgw-1:~# radosgw-admin --id rgw.ceph-rgw-1 bucket stats 
--bucket="bucketnamehere"

{

"bucket": "bucketnamehere",

"pool": ".rgw.buckets",

"index_pool": ".rgw.buckets.index",

"id": "default.28142894.1",

"marker": "default.28142894.1",

"owner": "16355",

"ver": 
"0#5463545,1#5483686,2#5483484,3#5474696,4#5479052,5#5480339,6#5469460,7#5463976",

"master_ver": "0#0,1#0,2#0,3#0,4#0,5#0,6#0,7#0",

"mtime": "2015-12-08 12:42:26.286153",

"max_marker": "0#,1#,2#,3#,4#,5#,6#,7#",

"usage": {

"rgw.main": {

"size_kb": 0,

"size_kb_actual": 0,

"num_objects": 0

},

"rgw.multimeta": {

"size_kb": 0,

"size_kb_actual": 0,

"num_objects": 0

}

},

"bucket_quota": {

"enabled": false,

"max_size_kb": -1,

"max_objects": -1

}

}



I have attempted a bucket index check and fix on this, however, it does not 
appear to have made a difference and no fixes or errors reported from it. Does 
anyone have any advice on how to proceed with removing this content? At this 
stage I am not too concerned if the method needed to remove this generates 
orphans, as we will shortly be running a large orphan scan after our upgrade to 
Luminous. Cluster health otherwise reports normal.



Thanks

Sean Redmond



NOTICE AND DISCLAIMER
This e-mail (including any attachments) is intended for the above-named 
person(s). If you are not the intended recipient, notify the sender 
immediately, delete this email from your system and do not disclose or use for 
any purpose. We may monitor all incoming and outgoing emails in line with 
current legislation. We have taken steps to ensure that this email and 
attachments are free from any virus, but it remains your responsibility to 
ensure that viruses do not adversely affect you

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel PG stuck inconsistent with 3 0-size objects

2018-07-16 Thread Pavan Rallabhandi
Yes, that suggestion worked for us, although we hit this when we've upgraded to 
10.2.10 from 10.2.7.

I guess this was fixed via http://tracker.ceph.com/issues/21440 and 
http://tracker.ceph.com/issues/19404

Thanks,
-Pavan. 

On 7/16/18, 5:07 AM, "ceph-users on behalf of Matthew Vernon" 
 wrote:

Hi,

Our cluster is running 10.2.9 (from Ubuntu; on 16.04 LTS), and we have a
pg that's stuck inconsistent; if I repair it, it logs "failed to pick
suitable auth object" (repair log attached, to try and stop my MUA
mangling it).

We then deep-scrubbed that pg, at which point
rados list-inconsistent-obj 67.2e --format=json-pretty produces a bit of
output (also attached), which includes that all 3 osds have a zero-sized
object e.g.

"osd": 1937,
"errors": [
"omap_digest_mismatch_oi"
],
"size": 0,
"omap_digest": "0x45773901",
"data_digest": "0x"

All 3 osds have different omap_digest, but all have 0 size. Indeed,
looking on the OSD disks directly, each object is 0 size (i.e. they are
identical).

This looks similar to one of the failure modes in
http://tracker.ceph.com/issues/21388 where the is a suggestion (comment
19 from David Zafman) to do:

rados -p default.rgw.buckets.index setomapval
.dir.861ae926-7ff0-48c5-86d6-a6ba8d0a7a14.7130858.6 temporary-key anything
[deep-scrub]
rados -p default.rgw.buckets.index rmomapkey
.dir.861ae926-7ff0-48c5-86d6-a6ba8d0a7a14.7130858.6 temporary-key

Is this likely to be the correct approach here, to? And is there an
underlying bug in ceph that still needs fixing? :)

Thanks,

Matthew



-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Backfilling on Luminous

2018-03-30 Thread Pavan Rallabhandi
Somehow, I missed replying to this, the random split would be enabled for all 
new PGs or the PGs that get mapped to new OSDs. For existing OSDs, one has to 
use ceph-objectstore-tool’s apply-layout commad to run on each OSD while the 
OSD is offline.

If you want to pre-split PGs using ‘expected_num_objects’ at the time of pool 
creation, be aware of this fix http://tracker.ceph.com/issues/22530.

Thanks,
-Pavan.

From: David Turner <drakonst...@gmail.com>
Date: Tuesday, March 20, 2018 at 1:50 PM
To: Pavan Rallabhandi <prallabha...@walmartlabs.com>
Cc: ceph-users <ceph-users@lists.ceph.com>
Subject: EXT: Re: [ceph-users] Backfilling on Luminous

@Pavan, I did not know about 'filestore split rand factor'.  That looks like it 
was added in Jewel and I must have missed it.  To disable it, would I just set 
it to 0 and restart all of the OSDs?  That isn't an option at the moment, but 
restarting the OSDs after this backfilling is done is definitely doable.

On Mon, Mar 19, 2018 at 5:28 PM Pavan Rallabhandi 
<prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>> wrote:
David,

Pretty sure you must be aware of the filestore random split on existing OSD 
PGs, `filestore split rand factor`, may be you could try that too.

Thanks,
-Pavan.

From: ceph-users 
<ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of David Turner <drakonst...@gmail.com<mailto:drakonst...@gmail.com>>
Date: Monday, March 19, 2018 at 1:36 PM
To: Caspar Smit <caspars...@supernas.eu<mailto:caspars...@supernas.eu>>
Cc: ceph-users <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
Subject: EXT: Re: [ceph-users] Backfilling on Luminous

Sorry for being away. I set all of my backfilling to VERY slow settings over 
the weekend and things have been stable, but incredibly slow (1% recovery from 
3% misplaced to 2% all weekend).  I'm back on it now and well rested.

@Caspar, SWAP isn't being used on these nodes and all of the affected OSDs have 
been filestore.

@Dan, I think you hit the nail on the head.  I didn't know that logging was 
added for subfolder splitting in Luminous!!! That's AMAZING  We are seeing 
consistent subfolder splitting all across the cluster.  The majority of the 
crashed OSDs have a split started before the crash and then commenting about it 
in the crash dump.  Looks like I just need to write a daemon to watch for 
splitting to start and throttle recovery until it's done.

I had injected the following timeout settings, but it didn't seem to affect 
anything.  I may need to have placed them in ceph.conf and let them pick up the 
new settings as the OSDs crashed, but I didn't really want different settings 
on some OSDs in the cluster.

osd_op_thread_suicide_timeout=1200 (from 180)
osd-recovery-thread-timeout=300  (from 30)

My game plan for now is to watch for splitting in the log, increase recovery 
sleep, decrease osd_recovery_max_active, and watch for splitting to finish 
before setting them back to more aggressive settings.  After this cluster is 
done backfilling I'm going to do my best to reproduce this scenario in a test 
environment and open a ticket to hopefully fix why this is happening so 
detrimentally.


On Fri, Mar 16, 2018 at 4:00 AM Caspar Smit 
<caspars...@supernas.eu<mailto:caspars...@supernas.eu>> wrote:
Hi David,

What about memory usage?

1] 23 OSD nodes: 15x 10TB Seagate Ironwolf filestore with journals on Intel DC 
P3700, 70% full cluster, Dual Socket E5-2620 v4 @ 2.10GHz, 128GB RAM.

If you upgrade to bluestore, memory usage will likely increase. 15x10TB ~~ 
150GB RAM needed especially in recovery/backfilling scenario's like these.

Kind regards,
Caspar


2018-03-15 21:53 GMT+01:00 Dan van der Ster 
<d...@vanderster.com<mailto:d...@vanderster.com>>:
Did you use perf top or iotop to try to identify where the osd is stuck?
Did you try increasing the op thread suicide timeout from 180s?

Splitting should log at the beginning and end of an op, so it should be clear 
if it's taking longer than the timeout.

.. Dan



On Mar 15, 2018 9:23 PM, "David Turner" 
<drakonst...@gmail.com<mailto:drakonst...@gmail.com>> wrote:
I am aware of the filestore splitting happening.  I manually split all of the 
subfolders a couple weeks ago on this cluster, but every time we have 
backfilling the newly moved PGs have a chance to split before the backfilling 
is done.  When that has happened in the past it causes some blocked requests 
and will flap OSDs if we don't increase the osd_heartbeat_grace, but it has 
never consistently killed the OSDs during the task.  Maybe that's new in 
Luminous due to some of the priority and timeout settings.

This problem in general seems unrelated to the subfolder splitting, though, 
since it started to happen very quickly into the backfilling process.  
Definitely before many of the recently moved 

Re: [ceph-users] Does jewel 10.2.10 support filestore_split_rand_factor?

2018-04-01 Thread Pavan Rallabhandi
No, it is supported in the next version of Jewel 
http://tracker.ceph.com/issues/22658

From: ceph-users  on behalf of shadow_lin 

Date: Sunday, April 1, 2018 at 3:53 AM
To: ceph-users 
Subject: EXT: [ceph-users] Does jewel 10.2.10 support 
filestore_split_rand_factor?

Hi list,
The document page of jewel has filestore_split_rand_factor config but I can't 
find the config by using 'ceph daemon osd.x config'.

ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
ceph daemon osd.0 config show|grep split
"mon_osd_max_split_count": "32",
"journaler_allow_split_entries": "true",
"mds_bal_split_size": "1",
"mds_bal_split_rd": "25000",
"mds_bal_split_wr": "1",
"mds_bal_split_bits": "3",
"filestore_split_multiple": "4",
"filestore_debug_verify_split": "false",


2018-04-01

shadow_lin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Backfilling on Luminous

2018-03-19 Thread Pavan Rallabhandi
David,

Pretty sure you must be aware of the filestore random split on existing OSD 
PGs, `filestore split rand factor`, may be you could try that too.

Thanks,
-Pavan.

From: ceph-users  on behalf of David Turner 

Date: Monday, March 19, 2018 at 1:36 PM
To: Caspar Smit 
Cc: ceph-users 
Subject: EXT: Re: [ceph-users] Backfilling on Luminous

Sorry for being away. I set all of my backfilling to VERY slow settings over 
the weekend and things have been stable, but incredibly slow (1% recovery from 
3% misplaced to 2% all weekend).  I'm back on it now and well rested.

@Caspar, SWAP isn't being used on these nodes and all of the affected OSDs have 
been filestore.

@Dan, I think you hit the nail on the head.  I didn't know that logging was 
added for subfolder splitting in Luminous!!! That's AMAZING  We are seeing 
consistent subfolder splitting all across the cluster.  The majority of the 
crashed OSDs have a split started before the crash and then commenting about it 
in the crash dump.  Looks like I just need to write a daemon to watch for 
splitting to start and throttle recovery until it's done.

I had injected the following timeout settings, but it didn't seem to affect 
anything.  I may need to have placed them in ceph.conf and let them pick up the 
new settings as the OSDs crashed, but I didn't really want different settings 
on some OSDs in the cluster.

osd_op_thread_suicide_timeout=1200 (from 180)
osd-recovery-thread-timeout=300  (from 30)

My game plan for now is to watch for splitting in the log, increase recovery 
sleep, decrease osd_recovery_max_active, and watch for splitting to finish 
before setting them back to more aggressive settings.  After this cluster is 
done backfilling I'm going to do my best to reproduce this scenario in a test 
environment and open a ticket to hopefully fix why this is happening so 
detrimentally.


On Fri, Mar 16, 2018 at 4:00 AM Caspar Smit 
> wrote:
Hi David,

What about memory usage?

1] 23 OSD nodes: 15x 10TB Seagate Ironwolf filestore with journals on Intel DC 
P3700, 70% full cluster, Dual Socket E5-2620 v4 @ 2.10GHz, 128GB RAM.

If you upgrade to bluestore, memory usage will likely increase. 15x10TB ~~ 
150GB RAM needed especially in recovery/backfilling scenario's like these.

Kind regards,
Caspar


2018-03-15 21:53 GMT+01:00 Dan van der Ster 
>:
Did you use perf top or iotop to try to identify where the osd is stuck?
Did you try increasing the op thread suicide timeout from 180s?

Splitting should log at the beginning and end of an op, so it should be clear 
if it's taking longer than the timeout.

.. Dan



On Mar 15, 2018 9:23 PM, "David Turner" 
> wrote:
I am aware of the filestore splitting happening.  I manually split all of the 
subfolders a couple weeks ago on this cluster, but every time we have 
backfilling the newly moved PGs have a chance to split before the backfilling 
is done.  When that has happened in the past it causes some blocked requests 
and will flap OSDs if we don't increase the osd_heartbeat_grace, but it has 
never consistently killed the OSDs during the task.  Maybe that's new in 
Luminous due to some of the priority and timeout settings.

This problem in general seems unrelated to the subfolder splitting, though, 
since it started to happen very quickly into the backfilling process.  
Definitely before many of the recently moved PGs would have reached that point. 
 I've also confirmed that the OSDs that are dying are not just stuck on a 
process (like it looks like with filestore splitting), but actually segfaulting 
and restarting.

On Thu, Mar 15, 2018 at 4:08 PM Dan van der Ster 
> wrote:
Hi,

Do you see any split or merge messages in the osd logs?
I recall some surprise filestore splitting on a few osds after the luminous 
upgrade.

.. Dan


On Mar 15, 2018 6:04 PM, "David Turner" 
> wrote:
I upgraded a [1] cluster from Jewel 10.2.7 to Luminous 12.2.2 and last week I 
added 2 nodes to the cluster.  The backfilling has been ATROCIOUS.  I have OSDs 
consistently [2] segfaulting during recovery.  There's no pattern of which OSDs 
are segfaulting, which hosts have segfaulting OSDs, etc... It's all over the 
cluster.  I have been trying variants on all of these following settings with 
different levels of success, but I cannot eliminate the blocked requests and 
segfaulting OSDs.  osd_heartbeat_grace, osd_max_backfills, 
osd_op_thread_suicide_timeout, osd_recovery_max_active, osd_recovery_sleep_hdd, 
osd_recovery_sleep_hybrid, osd_recovery_thread_timeout, and 
osd_scrub_during_recovery.  Except for setting nobackfilling on the cluster I 
can't stop OSDs from 

Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever

2018-09-27 Thread Pavan Rallabhandi
I see Filestore symbols on the stack, so the bluestore config doesn’t affect. 
And the top frame of the stack hints at a RocksDB issue, and there are a whole 
lot of these too:

“2018-09-17 19:23:06.480258 7f1f3d2a7700  2 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/table/block_based_table_reader.cc:636]
 Cannot find Properties block from file.”

It really seems to be something with RocksDB on centOS. I still think you can 
try removing “compression=kNoCompression” from the filestore_rocksdb_options 
And/Or check if rocksdb is expecting snappy to be enabled.

Thanks,
-Pavan.

From: David Turner 
Date: Thursday, September 27, 2018 at 1:18 PM
To: Pavan Rallabhandi 
Cc: ceph-users 
Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster 
unusable and takes forever

I got pulled away from this for a while.  The error in the log is "abort: 
Corruption: Snappy not supported or corrupted Snappy compressed block contents" 
and the OSD has 2 settings set to snappy by default, async_compressor_type and 
bluestore_compression_algorithm.  Do either of these settings affect the omap 
store?

On Wed, Sep 19, 2018 at 2:33 PM Pavan Rallabhandi 
<mailto:prallabha...@walmartlabs.com> wrote:
Looks like you are running on CentOS, fwiw. We’ve successfully ran the 
conversion commands on Jewel, Ubuntu 16.04.

Have a feel it’s expecting the compression to be enabled, can you try removing 
“compression=kNoCompression” from the filestore_rocksdb_options? And/or you 
might want to check if rocksdb is expecting snappy to be enabled.

From: David Turner <mailto:drakonst...@gmail.com>
Date: Tuesday, September 18, 2018 at 6:01 PM
To: Pavan Rallabhandi <mailto:prallabha...@walmartlabs.com>
Cc: ceph-users <mailto:ceph-users@lists.ceph.com>
Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster 
unusable and takes forever

Here's the [1] full log from the time the OSD was started to the end of the 
crash dump.  These logs are so hard to parse.  Is there anything useful in them?

I did confirm that all perms were set correctly and that the superblock was 
changed to rocksdb before the first time I attempted to start the OSD with it's 
new DB.  This is on a fully Luminous cluster with [2] the defaults you 
mentioned.

[1] https://gist.github.com/drakonstein/fa3ac0ad9b2ec1389c957f95e05b79ed
[2] "filestore_omap_backend": "rocksdb",
"filestore_rocksdb_options": 
"max_background_compactions=8,compaction_readahead_size=2097152,compression=kNoCompression",

On Tue, Sep 18, 2018 at 5:29 PM Pavan Rallabhandi 
<mailto:mailto:prallabha...@walmartlabs.com> wrote:
I meant the stack trace hints that the superblock still has leveldb in it, have 
you verified that already?

On 9/18/18, 5:27 PM, "Pavan Rallabhandi" 
<mailto:mailto:prallabha...@walmartlabs.com> wrote:

    You should be able to set them under the global section and that reminds 
me, since you are on Luminous already, I guess those values are already the 
default, you can verify from the admin socket of any OSD.

    But the stack trace didn’t hint as if the superblock on the OSD is still 
considering the omap backend to be leveldb and to do with the compression.

    Thanks,
    -Pavan.

    From: David Turner <mailto:mailto:drakonst...@gmail.com>
    Date: Tuesday, September 18, 2018 at 5:07 PM
    To: Pavan Rallabhandi <mailto:mailto:prallabha...@walmartlabs.com>
    Cc: ceph-users <mailto:mailto:ceph-users@lists.ceph.com>
    Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the 
cluster unusable and takes forever

    Are those settings fine to have be global even if not all OSDs on a node 
have rocksdb as the backend?  Or will I need to convert all OSDs on a node at 
the same time?

    On Tue, Sep 18, 2018 at 5:02 PM Pavan Rallabhandi 
<mailto:mailto:mailto:mailto:prallabha...@walmartlabs.com> wrote:
    The steps that were outlined for conversion are correct, have you tried 
setting some the relevant ceph conf values too:

    filestore_rocksdb_options = 
"max_background_compactions=8;compaction_readahead_size=2097152;compression=kNoCompression"

    filestore_omap_backend = rocksdb

    Thanks,
    -Pavan.

    From: ceph-users 
<mailto:mailto:mailto:mailto:ceph-users-boun...@lists.ceph.com> on behalf of 
David Turner <mailto:mailto:mailto:mailto:drakonst...@gmail.com>
    Date: Tuesday, September 18, 2018 at 4:09 PM
    To: ceph-users <mailto:mailto:mailto:mailto:ceph-users@lists.ceph.com>
    Subject: EXT: [ceph-users] Any backfill in our cluster makes the cluster 
unusable and takes forever

    I've finally learned enough about the OSD backend track down this issue to 
what I believe is the root cause.  LevelDB compac

Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever

2018-11-03 Thread Pavan Rallabhandi
Not exactly, this feature was supported in Jewel starting 10.2.11, ref 
https://github.com/ceph/ceph/pull/18010

I thought you mentioned you were using Luminous 12.2.4.

From: David Turner 
Date: Friday, November 2, 2018 at 5:21 PM
To: Pavan Rallabhandi 
Cc: ceph-users 
Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster 
unusable and takes forever

That makes so much more sense. It seems like RHCS had had this ability since 
Jewel while it was only put into the community version as of Mimic. So my 
version of the version isn't actually capable of changing the backend db. Whole 
digging into the coffee I did find a bug with the creation of the rocksdb 
backend created with ceph-kvstore-tool. It doesn't use the ceph defaults or any 
settings in your config file for the db settings. I'm working on testing a 
modified version that should take those settings into account. If the fix does 
work, the fix will be able to apply to a few other tools as well that can be 
used to set up the omap backend db.

On Fri, Nov 2, 2018, 4:26 PM Pavan Rallabhandi 
mailto:prallabha...@walmartlabs.com>> wrote:
It was Redhat versioned Jewel. But may be more relevantly, we are on Ubuntu 
unlike your case.

From: David Turner mailto:drakonst...@gmail.com>>
Date: Friday, November 2, 2018 at 10:24 AM

To: Pavan Rallabhandi 
mailto:prallabha...@walmartlabs.com>>
Cc: ceph-users mailto:ceph-users@lists.ceph.com>>
Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster 
unusable and takes forever

Pavan, which version of Ceph were you using when you changed your backend to 
rocksdb?

On Mon, Oct 1, 2018 at 4:24 PM Pavan Rallabhandi 
mailto:prallabha...@walmartlabs.com>> wrote:
Yeah, I think this is something to do with the CentOS binaries, sorry that I 
couldn’t be of much help here.

Thanks,
-Pavan.

From: David Turner mailto:drakonst...@gmail.com>>
Date: Monday, October 1, 2018 at 1:37 PM
To: Pavan Rallabhandi 
mailto:prallabha...@walmartlabs.com>>
Cc: ceph-users mailto:ceph-users@lists.ceph.com>>
Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster 
unusable and takes forever

I tried modifying filestore_rocksdb_options by removing 
compression=kNoCompression as well as setting it to 
compression=kSnappyCompression.  Leaving it with kNoCompression or removing it 
results in the same segfault in the previous log.  Setting it to 
kSnappyCompression resulted in [1] this being logged and the OSD just failing 
to start instead of segfaulting.  Is there anything else you would suggest 
trying before I purge this OSD from the cluster?  I'm afraid it might be 
something with the CentOS binaries.

[1] 2018-10-01 17:10:37.134930 7f1415dfcd80  0  set rocksdb option compression 
= kSnappyCompression
2018-10-01 17:10:37.134986 7f1415dfcd80 -1 rocksdb: Invalid argument: 
Compression type Snappy is not linked with the binary.
2018-10-01 17:10:37.135004 7f1415dfcd80 -1 filestore(/var/lib/ceph/osd/ceph-1) 
mount(1723): Error initializing rocksdb :
2018-10-01 17:10:37.135020 7f1415dfcd80 -1 osd.1 0 OSD:init: unable to mount 
object store
2018-10-01 17:10:37.135029 7f1415dfcd80 -1 ESC[0;31m ** ERROR: osd init failed: 
(1) Operation not permittedESC[0m

On Sat, Sep 29, 2018 at 1:57 PM Pavan Rallabhandi 
<mailto:prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>> 
wrote:
I looked at one of my test clusters running Jewel on Ubuntu 16.04, and 
interestingly I found this(below) in one of the OSD logs, which is different 
from your OSD boot log, where none of the compression algorithms seem to be 
supported. This hints more at how rocksdb was built on CentOS for Ceph.

2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb: Compression algorithms 
supported:
2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb: Snappy supported: 1
2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb: Zlib supported: 1
2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb: Bzip supported: 0
2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb: LZ4 supported: 0
2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb: ZSTD supported: 0
2018-09-29 17:38:38.629115 7fbd318d4b00  4 rocksdb: Fast CRC32 supported: 0

On 9/27/18, 2:56 PM, "Pavan Rallabhandi" 
<mailto:prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>> 
wrote:

I see Filestore symbols on the stack, so the bluestore config doesn’t 
affect. And the top frame of the stack hints at a RocksDB issue, and there are 
a whole lot of these too:

“2018-09-17 19:23:06.480258 7f1f3d2a7700  2 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/table/block_based_table_reader.cc:636]
 Cannot find Properties block from file.”

It really seems to be something with RocksD

Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever

2018-11-02 Thread Pavan Rallabhandi
It was Redhat versioned Jewel. But may be more relevantly, we are on Ubuntu 
unlike your case.

From: David Turner 
Date: Friday, November 2, 2018 at 10:24 AM
To: Pavan Rallabhandi 
Cc: ceph-users 
Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster 
unusable and takes forever

Pavan, which version of Ceph were you using when you changed your backend to 
rocksdb?

On Mon, Oct 1, 2018 at 4:24 PM Pavan Rallabhandi 
mailto:prallabha...@walmartlabs.com>> wrote:
Yeah, I think this is something to do with the CentOS binaries, sorry that I 
couldn’t be of much help here.

Thanks,
-Pavan.

From: David Turner mailto:drakonst...@gmail.com>>
Date: Monday, October 1, 2018 at 1:37 PM
To: Pavan Rallabhandi 
mailto:prallabha...@walmartlabs.com>>
Cc: ceph-users mailto:ceph-users@lists.ceph.com>>
Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster 
unusable and takes forever

I tried modifying filestore_rocksdb_options by removing 
compression=kNoCompression as well as setting it to 
compression=kSnappyCompression.  Leaving it with kNoCompression or removing it 
results in the same segfault in the previous log.  Setting it to 
kSnappyCompression resulted in [1] this being logged and the OSD just failing 
to start instead of segfaulting.  Is there anything else you would suggest 
trying before I purge this OSD from the cluster?  I'm afraid it might be 
something with the CentOS binaries.

[1] 2018-10-01 17:10:37.134930 7f1415dfcd80  0  set rocksdb option compression 
= kSnappyCompression
2018-10-01 17:10:37.134986 7f1415dfcd80 -1 rocksdb: Invalid argument: 
Compression type Snappy is not linked with the binary.
2018-10-01 17:10:37.135004 7f1415dfcd80 -1 filestore(/var/lib/ceph/osd/ceph-1) 
mount(1723): Error initializing rocksdb :
2018-10-01 17:10:37.135020 7f1415dfcd80 -1 osd.1 0 OSD:init: unable to mount 
object store
2018-10-01 17:10:37.135029 7f1415dfcd80 -1 ESC[0;31m ** ERROR: osd init failed: 
(1) Operation not permittedESC[0m

On Sat, Sep 29, 2018 at 1:57 PM Pavan Rallabhandi 
<mailto:prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>> 
wrote:
I looked at one of my test clusters running Jewel on Ubuntu 16.04, and 
interestingly I found this(below) in one of the OSD logs, which is different 
from your OSD boot log, where none of the compression algorithms seem to be 
supported. This hints more at how rocksdb was built on CentOS for Ceph.

2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb: Compression algorithms 
supported:
2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb: Snappy supported: 1
2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb: Zlib supported: 1
2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb: Bzip supported: 0
2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb: LZ4 supported: 0
2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb: ZSTD supported: 0
2018-09-29 17:38:38.629115 7fbd318d4b00  4 rocksdb: Fast CRC32 supported: 0

On 9/27/18, 2:56 PM, "Pavan Rallabhandi" 
<mailto:prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>> 
wrote:

I see Filestore symbols on the stack, so the bluestore config doesn’t 
affect. And the top frame of the stack hints at a RocksDB issue, and there are 
a whole lot of these too:

“2018-09-17 19:23:06.480258 7f1f3d2a7700  2 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/table/block_based_table_reader.cc:636]
 Cannot find Properties block from file.”

It really seems to be something with RocksDB on centOS. I still think you 
can try removing “compression=kNoCompression” from the 
filestore_rocksdb_options And/Or check if rocksdb is expecting snappy to be 
enabled.

Thanks,
-Pavan.

From: David Turner 
<mailto:drakonst...@gmail.com<mailto:drakonst...@gmail.com>>
Date: Thursday, September 27, 2018 at 1:18 PM
To: Pavan Rallabhandi 
<mailto:prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>>
Cc: ceph-users 
<mailto:ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the 
cluster unusable and takes forever

I got pulled away from this for a while.  The error in the log is "abort: 
Corruption: Snappy not supported or corrupted Snappy compressed block contents" 
and the OSD has 2 settings set to snappy by default, async_compressor_type and 
bluestore_compression_algorithm.  Do either of these settings affect the omap 
store?

On Wed, Sep 19, 2018 at 2:33 PM Pavan Rallabhandi 
<mailto:mailto<mailto:mailto>:prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>>
 wrote:
Looks like you are running on CentOS, fwiw. We’ve successfully ran the 
conversion commands

Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever

2018-11-05 Thread Pavan Rallabhandi
Not sure I understand that, but starting Luminous, the filestore omap backend 
is rocksdb by default.

From: David Turner 
Date: Monday, November 5, 2018 at 3:25 PM
To: Pavan Rallabhandi 
Cc: ceph-users 
Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster 
unusable and takes forever

Digging into the code a little more, that functionality was added in 10.2.11 
and 13.0.1, but it still isn't anywhere in the 12.x.x Luminous version.  That's 
so bizarre.

On Sat, Nov 3, 2018 at 11:56 AM Pavan Rallabhandi 
mailto:prallabha...@walmartlabs.com>> wrote:
Not exactly, this feature was supported in Jewel starting 10.2.11, ref 
https://github.com/ceph/ceph/pull/18010

I thought you mentioned you were using Luminous 12.2.4.

From: David Turner mailto:drakonst...@gmail.com>>
Date: Friday, November 2, 2018 at 5:21 PM

To: Pavan Rallabhandi 
mailto:prallabha...@walmartlabs.com>>
Cc: ceph-users mailto:ceph-users@lists.ceph.com>>
Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster 
unusable and takes forever

That makes so much more sense. It seems like RHCS had had this ability since 
Jewel while it was only put into the community version as of Mimic. So my 
version of the version isn't actually capable of changing the backend db. Whole 
digging into the coffee I did find a bug with the creation of the rocksdb 
backend created with ceph-kvstore-tool. It doesn't use the ceph defaults or any 
settings in your config file for the db settings. I'm working on testing a 
modified version that should take those settings into account. If the fix does 
work, the fix will be able to apply to a few other tools as well that can be 
used to set up the omap backend db.

On Fri, Nov 2, 2018, 4:26 PM Pavan Rallabhandi 
mailto:prallabha...@walmartlabs.com>> wrote:
It was Redhat versioned Jewel. But may be more relevantly, we are on Ubuntu 
unlike your case.

From: David Turner mailto:drakonst...@gmail.com>>
Date: Friday, November 2, 2018 at 10:24 AM

To: Pavan Rallabhandi 
mailto:prallabha...@walmartlabs.com>>
Cc: ceph-users mailto:ceph-users@lists.ceph.com>>
Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster 
unusable and takes forever

Pavan, which version of Ceph were you using when you changed your backend to 
rocksdb?

On Mon, Oct 1, 2018 at 4:24 PM Pavan Rallabhandi 
mailto:prallabha...@walmartlabs.com>> wrote:
Yeah, I think this is something to do with the CentOS binaries, sorry that I 
couldn’t be of much help here.

Thanks,
-Pavan.

From: David Turner mailto:drakonst...@gmail.com>>
Date: Monday, October 1, 2018 at 1:37 PM
To: Pavan Rallabhandi 
mailto:prallabha...@walmartlabs.com>>
Cc: ceph-users mailto:ceph-users@lists.ceph.com>>
Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster 
unusable and takes forever

I tried modifying filestore_rocksdb_options by removing 
compression=kNoCompression as well as setting it to 
compression=kSnappyCompression.  Leaving it with kNoCompression or removing it 
results in the same segfault in the previous log.  Setting it to 
kSnappyCompression resulted in [1] this being logged and the OSD just failing 
to start instead of segfaulting.  Is there anything else you would suggest 
trying before I purge this OSD from the cluster?  I'm afraid it might be 
something with the CentOS binaries.

[1] 2018-10-01 17:10:37.134930 7f1415dfcd80  0  set rocksdb option compression 
= kSnappyCompression
2018-10-01 17:10:37.134986 7f1415dfcd80 -1 rocksdb: Invalid argument: 
Compression type Snappy is not linked with the binary.
2018-10-01 17:10:37.135004 7f1415dfcd80 -1 filestore(/var/lib/ceph/osd/ceph-1) 
mount(1723): Error initializing rocksdb :
2018-10-01 17:10:37.135020 7f1415dfcd80 -1 osd.1 0 OSD:init: unable to mount 
object store
2018-10-01 17:10:37.135029 7f1415dfcd80 -1 ESC[0;31m ** ERROR: osd init failed: 
(1) Operation not permittedESC[0m

On Sat, Sep 29, 2018 at 1:57 PM Pavan Rallabhandi 
<mailto:prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>> 
wrote:
I looked at one of my test clusters running Jewel on Ubuntu 16.04, and 
interestingly I found this(below) in one of the OSD logs, which is different 
from your OSD boot log, where none of the compression algorithms seem to be 
supported. This hints more at how rocksdb was built on CentOS for Ceph.

2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb: Compression algorithms 
supported:
2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb: Snappy supported: 1
2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb: Zlib supported: 1
2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb: Bzip supported: 0
2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb: LZ4 supported: 0
2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb: ZSTD supported: 0
2018-09-29 17:38:38.629115 7fbd318d4b00  4 rocksdb: Fast CRC32 supported: 0

On 9/27/18, 2:56 PM, "Pav

Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever

2018-09-29 Thread Pavan Rallabhandi
I looked at one of my test clusters running Jewel on Ubuntu 16.04, and 
interestingly I found this(below) in one of the OSD logs, which is different 
from your OSD boot log, where none of the compression algorithms seem to be 
supported. This hints more at how rocksdb was built on CentOS for Ceph.

2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb: Compression algorithms 
supported:
2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb: Snappy supported: 1
2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb: Zlib supported: 1
2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb: Bzip supported: 0
2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb: LZ4 supported: 0
2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb: ZSTD supported: 0
2018-09-29 17:38:38.629115 7fbd318d4b00  4 rocksdb: Fast CRC32 supported: 0

On 9/27/18, 2:56 PM, "Pavan Rallabhandi"  wrote:

I see Filestore symbols on the stack, so the bluestore config doesn’t 
affect. And the top frame of the stack hints at a RocksDB issue, and there are 
a whole lot of these too:

“2018-09-17 19:23:06.480258 7f1f3d2a7700  2 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/table/block_based_table_reader.cc:636]
 Cannot find Properties block from file.”

It really seems to be something with RocksDB on centOS. I still think you 
can try removing “compression=kNoCompression” from the 
filestore_rocksdb_options And/Or check if rocksdb is expecting snappy to be 
enabled.

Thanks,
-Pavan.

From: David Turner 
Date: Thursday, September 27, 2018 at 1:18 PM
    To: Pavan Rallabhandi 
Cc: ceph-users 
Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the 
cluster unusable and takes forever

I got pulled away from this for a while.  The error in the log is "abort: 
Corruption: Snappy not supported or corrupted Snappy compressed block contents" 
and the OSD has 2 settings set to snappy by default, async_compressor_type and 
bluestore_compression_algorithm.  Do either of these settings affect the omap 
store?

On Wed, Sep 19, 2018 at 2:33 PM Pavan Rallabhandi 
<mailto:prallabha...@walmartlabs.com> wrote:
Looks like you are running on CentOS, fwiw. We’ve successfully ran the 
conversion commands on Jewel, Ubuntu 16.04.

Have a feel it’s expecting the compression to be enabled, can you try 
removing “compression=kNoCompression” from the filestore_rocksdb_options? 
And/or you might want to check if rocksdb is expecting snappy to be enabled.

From: David Turner <mailto:drakonst...@gmail.com>
Date: Tuesday, September 18, 2018 at 6:01 PM
To: Pavan Rallabhandi <mailto:prallabha...@walmartlabs.com>
Cc: ceph-users <mailto:ceph-users@lists.ceph.com>
Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the 
cluster unusable and takes forever

Here's the [1] full log from the time the OSD was started to the end of the 
crash dump.  These logs are so hard to parse.  Is there anything useful in them?

I did confirm that all perms were set correctly and that the superblock was 
changed to rocksdb before the first time I attempted to start the OSD with it's 
new DB.  This is on a fully Luminous cluster with [2] the defaults you 
mentioned.

[1] https://gist.github.com/drakonstein/fa3ac0ad9b2ec1389c957f95e05b79ed
[2] "filestore_omap_backend": "rocksdb",
"filestore_rocksdb_options": 
"max_background_compactions=8,compaction_readahead_size=2097152,compression=kNoCompression",

On Tue, Sep 18, 2018 at 5:29 PM Pavan Rallabhandi 
<mailto:mailto:prallabha...@walmartlabs.com> wrote:
I meant the stack trace hints that the superblock still has leveldb in it, 
have you verified that already?

On 9/18/18, 5:27 PM, "Pavan Rallabhandi" 
<mailto:mailto:prallabha...@walmartlabs.com> wrote:

You should be able to set them under the global section and that 
reminds me, since you are on Luminous already, I guess those values are already 
the default, you can verify from the admin socket of any OSD.

But the stack trace didn’t hint as if the superblock on the OSD is 
still considering the omap backend to be leveldb and to do with the compression.

Thanks,
-Pavan.

From: David Turner <mailto:mailto:drakonst...@gmail.com>
Date: Tuesday, September 18, 2018 at 5:07 PM
To: Pavan Rallabhandi <mailto:mailto:prallabha...@walmartlabs.com>
Cc: ceph-users <mailto:mailto:ceph-users@lists.ceph.com>
Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the 
cluster unusable and takes forever

Are those settings fine to have be global 

Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever

2018-10-01 Thread Pavan Rallabhandi
Yeah, I think this is something to do with the CentOS binaries, sorry that I 
couldn’t be of much help here.

Thanks,
-Pavan.

From: David Turner 
Date: Monday, October 1, 2018 at 1:37 PM
To: Pavan Rallabhandi 
Cc: ceph-users 
Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster 
unusable and takes forever

I tried modifying filestore_rocksdb_options by removing 
compression=kNoCompression as well as setting it to 
compression=kSnappyCompression.  Leaving it with kNoCompression or removing it 
results in the same segfault in the previous log.  Setting it to 
kSnappyCompression resulted in [1] this being logged and the OSD just failing 
to start instead of segfaulting.  Is there anything else you would suggest 
trying before I purge this OSD from the cluster?  I'm afraid it might be 
something with the CentOS binaries. 

[1] 2018-10-01 17:10:37.134930 7f1415dfcd80  0  set rocksdb option compression 
= kSnappyCompression
2018-10-01 17:10:37.134986 7f1415dfcd80 -1 rocksdb: Invalid argument: 
Compression type Snappy is not linked with the binary.
2018-10-01 17:10:37.135004 7f1415dfcd80 -1 filestore(/var/lib/ceph/osd/ceph-1) 
mount(1723): Error initializing rocksdb :
2018-10-01 17:10:37.135020 7f1415dfcd80 -1 osd.1 0 OSD:init: unable to mount 
object store
2018-10-01 17:10:37.135029 7f1415dfcd80 -1 ESC[0;31m ** ERROR: osd init failed: 
(1) Operation not permittedESC[0m

On Sat, Sep 29, 2018 at 1:57 PM Pavan Rallabhandi 
<mailto:prallabha...@walmartlabs.com> wrote:
I looked at one of my test clusters running Jewel on Ubuntu 16.04, and 
interestingly I found this(below) in one of the OSD logs, which is different 
from your OSD boot log, where none of the compression algorithms seem to be 
supported. This hints more at how rocksdb was built on CentOS for Ceph.

2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb: Compression algorithms 
supported:
2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb:     Snappy supported: 1
2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb:     Zlib supported: 1
2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb:     Bzip supported: 0
2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb:     LZ4 supported: 0
2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb:     ZSTD supported: 0
2018-09-29 17:38:38.629115 7fbd318d4b00  4 rocksdb: Fast CRC32 supported: 0

On 9/27/18, 2:56 PM, "Pavan Rallabhandi" <mailto:prallabha...@walmartlabs.com> 
wrote:

    I see Filestore symbols on the stack, so the bluestore config doesn’t 
affect. And the top frame of the stack hints at a RocksDB issue, and there are 
a whole lot of these too:

    “2018-09-17 19:23:06.480258 7f1f3d2a7700  2 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/table/block_based_table_reader.cc:636]
 Cannot find Properties block from file.”

    It really seems to be something with RocksDB on centOS. I still think you 
can try removing “compression=kNoCompression” from the 
filestore_rocksdb_options And/Or check if rocksdb is expecting snappy to be 
enabled.

    Thanks,
    -Pavan.

    From: David Turner <mailto:drakonst...@gmail.com>
    Date: Thursday, September 27, 2018 at 1:18 PM
    To: Pavan Rallabhandi <mailto:prallabha...@walmartlabs.com>
    Cc: ceph-users <mailto:ceph-users@lists.ceph.com>
    Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the 
cluster unusable and takes forever

    I got pulled away from this for a while.  The error in the log is "abort: 
Corruption: Snappy not supported or corrupted Snappy compressed block contents" 
and the OSD has 2 settings set to snappy by default, async_compressor_type and 
bluestore_compression_algorithm.  Do either of these settings affect the omap 
store?

    On Wed, Sep 19, 2018 at 2:33 PM Pavan Rallabhandi 
<mailto:mailto:prallabha...@walmartlabs.com> wrote:
    Looks like you are running on CentOS, fwiw. We’ve successfully ran the 
conversion commands on Jewel, Ubuntu 16.04.

    Have a feel it’s expecting the compression to be enabled, can you try 
removing “compression=kNoCompression” from the filestore_rocksdb_options? 
And/or you might want to check if rocksdb is expecting snappy to be enabled.

    From: David Turner <mailto:mailto:drakonst...@gmail.com>
    Date: Tuesday, September 18, 2018 at 6:01 PM
    To: Pavan Rallabhandi <mailto:mailto:prallabha...@walmartlabs.com>
    Cc: ceph-users <mailto:mailto:ceph-users@lists.ceph.com>
    Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the 
cluster unusable and takes forever

    Here's the [1] full log from the time the OSD was started to the end of the 
crash dump.  These logs are so hard to parse.  Is there anything useful in them?

    I did confirm that all perms were set correctly and that the superblock was 
changed to ro

Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever

2018-09-19 Thread Pavan Rallabhandi
Looks like you are running on CentOS, fwiw. We’ve successfully ran the 
conversion commands on Jewel, Ubuntu 16.04.

Have a feel it’s expecting the compression to be enabled, can you try removing 
“compression=kNoCompression” from the filestore_rocksdb_options? And/or you 
might want to check if rocksdb is expecting snappy to be enabled.

From: David Turner 
Date: Tuesday, September 18, 2018 at 6:01 PM
To: Pavan Rallabhandi 
Cc: ceph-users 
Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster 
unusable and takes forever

Here's the [1] full log from the time the OSD was started to the end of the 
crash dump.  These logs are so hard to parse.  Is there anything useful in them?

I did confirm that all perms were set correctly and that the superblock was 
changed to rocksdb before the first time I attempted to start the OSD with it's 
new DB.  This is on a fully Luminous cluster with [2] the defaults you 
mentioned.

[1] https://gist.github.com/drakonstein/fa3ac0ad9b2ec1389c957f95e05b79ed
[2] "filestore_omap_backend": "rocksdb",
"filestore_rocksdb_options": 
"max_background_compactions=8,compaction_readahead_size=2097152,compression=kNoCompression",

On Tue, Sep 18, 2018 at 5:29 PM Pavan Rallabhandi 
<mailto:prallabha...@walmartlabs.com> wrote:
I meant the stack trace hints that the superblock still has leveldb in it, have 
you verified that already?

On 9/18/18, 5:27 PM, "Pavan Rallabhandi" <mailto:prallabha...@walmartlabs.com> 
wrote:

    You should be able to set them under the global section and that reminds 
me, since you are on Luminous already, I guess those values are already the 
default, you can verify from the admin socket of any OSD.

    But the stack trace didn’t hint as if the superblock on the OSD is still 
considering the omap backend to be leveldb and to do with the compression.

    Thanks,
    -Pavan.

    From: David Turner <mailto:drakonst...@gmail.com>
    Date: Tuesday, September 18, 2018 at 5:07 PM
    To: Pavan Rallabhandi <mailto:prallabha...@walmartlabs.com>
    Cc: ceph-users <mailto:ceph-users@lists.ceph.com>
    Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the 
cluster unusable and takes forever

    Are those settings fine to have be global even if not all OSDs on a node 
have rocksdb as the backend?  Or will I need to convert all OSDs on a node at 
the same time?

    On Tue, Sep 18, 2018 at 5:02 PM Pavan Rallabhandi 
<mailto:mailto:prallabha...@walmartlabs.com> wrote:
    The steps that were outlined for conversion are correct, have you tried 
setting some the relevant ceph conf values too:

    filestore_rocksdb_options = 
"max_background_compactions=8;compaction_readahead_size=2097152;compression=kNoCompression"

    filestore_omap_backend = rocksdb

    Thanks,
    -Pavan.

    From: ceph-users <mailto:mailto:ceph-users-boun...@lists.ceph.com> on 
behalf of David Turner <mailto:mailto:drakonst...@gmail.com>
    Date: Tuesday, September 18, 2018 at 4:09 PM
    To: ceph-users <mailto:mailto:ceph-users@lists.ceph.com>
    Subject: EXT: [ceph-users] Any backfill in our cluster makes the cluster 
unusable and takes forever

    I've finally learned enough about the OSD backend track down this issue to 
what I believe is the root cause.  LevelDB compaction is the common thread 
every time we move data around our cluster.  I've ruled out PG subfolder 
splitting, EC doesn't seem to be the root cause of this, and it is cluster wide 
as opposed to specific hardware. 

    One of the first things I found after digging into leveldb omap compaction 
was [1] this article with a heading "RocksDB instead of LevelDB" which mentions 
that leveldb was replaced with rocksdb as the default db backend for filestore 
OSDs and was even backported to Jewel because of the performance improvements.

    I figured there must be a way to be able to upgrade an OSD to use rocksdb 
from leveldb without needing to fully backfill the entire OSD.  There is [2] 
this article, but you need to have an active service account with RedHat to 
access it.  I eventually came across [3] this article about optimizing Ceph 
Object Storage which mentions a resolution to OSDs flapping due to omap 
compaction to migrate to using rocksdb.  It links to the RedHat article, but 
also has [4] these steps outlined in it.  I tried to follow the steps, but the 
OSD I tested this on was unable to start with [5] this segfault.  And then 
trying to move the OSD back to the original LevelDB omap folder resulted in [6] 
this in the log.  I apologize that all of my logging is with log level 1.  If 
needed I can get some higher log levels.

    My Ceph version is 12.2.4.  Does anyone have any suggestions for how I can 
update my filestore backend from leveldb to rocksdb?  Or if that's the wrong 
direction and I should be looking elsewhere?  Thank you.


    [1] ht

Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever

2018-09-18 Thread Pavan Rallabhandi
The steps that were outlined for conversion are correct, have you tried setting 
some the relevant ceph conf values too:

filestore_rocksdb_options = 
"max_background_compactions=8;compaction_readahead_size=2097152;compression=kNoCompression"

filestore_omap_backend = rocksdb

Thanks,
-Pavan.

From: ceph-users  on behalf of David Turner 

Date: Tuesday, September 18, 2018 at 4:09 PM
To: ceph-users 
Subject: EXT: [ceph-users] Any backfill in our cluster makes the cluster 
unusable and takes forever

I've finally learned enough about the OSD backend track down this issue to what 
I believe is the root cause.  LevelDB compaction is the common thread every 
time we move data around our cluster.  I've ruled out PG subfolder splitting, 
EC doesn't seem to be the root cause of this, and it is cluster wide as opposed 
to specific hardware. 

One of the first things I found after digging into leveldb omap compaction was 
[1] this article with a heading "RocksDB instead of LevelDB" which mentions 
that leveldb was replaced with rocksdb as the default db backend for filestore 
OSDs and was even backported to Jewel because of the performance improvements.

I figured there must be a way to be able to upgrade an OSD to use rocksdb from 
leveldb without needing to fully backfill the entire OSD.  There is [2] this 
article, but you need to have an active service account with RedHat to access 
it.  I eventually came across [3] this article about optimizing Ceph Object 
Storage which mentions a resolution to OSDs flapping due to omap compaction to 
migrate to using rocksdb.  It links to the RedHat article, but also has [4] 
these steps outlined in it.  I tried to follow the steps, but the OSD I tested 
this on was unable to start with [5] this segfault.  And then trying to move 
the OSD back to the original LevelDB omap folder resulted in [6] this in the 
log.  I apologize that all of my logging is with log level 1.  If needed I can 
get some higher log levels.

My Ceph version is 12.2.4.  Does anyone have any suggestions for how I can 
update my filestore backend from leveldb to rocksdb?  Or if that's the wrong 
direction and I should be looking elsewhere?  Thank you.


[1] https://ceph.com/community/new-luminous-rados-improvements/
[2] https://access.redhat.com/solutions/3210951
[3] 
https://hubb.blob.core.windows.net/c2511cea-81c5-4386-8731-cc444ff806df-public/resources/Optimize
 Ceph object storage for production in multisite clouds.pdf

[4] ■ Stop the OSD
■ mv /var/lib/ceph/osd/ceph-/current/omap /var/lib/ceph/osd/ceph-/omap.orig
■ ulimit -n 65535
■ ceph-kvstore-tool leveldb /var/lib/ceph/osd/ceph-/omap.orig store-copy 
/var/lib/ceph/osd/ceph-/current/omap 1 rocksdb
■ ceph-osdomap-tool --omap-path /var/lib/ceph/osd/ceph-/current/omap --command 
check
■ sed -i s/leveldb/rocksdb/g /var/lib/ceph/osd/ceph-/superblock
■ chown ceph.ceph /var/lib/ceph/osd/ceph-/current/omap -R
■ cd /var/lib/ceph/osd/ceph-; rm -rf omap.orig
■ Start the OSD

[5] 2018-09-17 19:23:10.826227 7f1f3f2ab700 -1 abort: Corruption: Snappy not 
supported or corrupted Snappy compressed block contents
2018-09-17 19:23:10.830525 7f1f3f2ab700 -1 *** Caught signal (Aborted) **

[6] 2018-09-17 19:27:34.010125 7fcdee97cd80 -1 osd.0 0 OSD:init: unable to 
mount object store
2018-09-17 19:27:34.010131 7fcdee97cd80 -1 ESC[0;31m ** ERROR: osd init failed: 
(1) Operation not permittedESC[0m
2018-09-17 19:27:54.225941 7f7f03308d80  0 set uid:gid to 167:167 (ceph:ceph)
2018-09-17 19:27:54.225975 7f7f03308d80  0 ceph version 12.2.4 
(52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable), process 
(unknown), pid 361535
2018-09-17 19:27:54.231275 7f7f03308d80  0 pidfile_write: ignore empty 
--pid-file
2018-09-17 19:27:54.260207 7f7f03308d80  0 load: jerasure load: lrc load: isa
2018-09-17 19:27:54.260520 7f7f03308d80  0 filestore(/var/lib/ceph/osd/ceph-0) 
backend xfs (magic 0x58465342)
2018-09-17 19:27:54.261135 7f7f03308d80  0 filestore(/var/lib/ceph/osd/ceph-0) 
backend xfs (magic 0x58465342)
2018-09-17 19:27:54.261750 7f7f03308d80  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP ioctl 
is disabled via 'filestore fiemap' config option
2018-09-17 19:27:54.261757 7f7f03308d80  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: 
SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2018-09-17 19:27:54.261758 7f7f03308d80  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: splice() is 
disabled via 'filestore splice' config option
2018-09-17 19:27:54.286454 7f7f03308d80  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: syncfs(2) 
syscall fully supported (by glibc and kernel)
2018-09-17 19:27:54.286572 7f7f03308d80  0 
xfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: extsize is 
disabled by conf
2018-09-17 19:27:54.287119 7f7f03308d80  0 filestore(/var/lib/ceph/osd/ceph-0) 
start omap initiation
2018-09-17 19:27:54.287527 7f7f03308d80 -1 

Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever

2018-09-18 Thread Pavan Rallabhandi
You should be able to set them under the global section and that reminds me, 
since you are on Luminous already, I guess those values are already the 
default, you can verify from the admin socket of any OSD.

But the stack trace didn’t hint as if the superblock on the OSD is still 
considering the omap backend to be leveldb and to do with the compression.

Thanks,
-Pavan.

From: David Turner 
Date: Tuesday, September 18, 2018 at 5:07 PM
To: Pavan Rallabhandi 
Cc: ceph-users 
Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster 
unusable and takes forever

Are those settings fine to have be global even if not all OSDs on a node have 
rocksdb as the backend?  Or will I need to convert all OSDs on a node at the 
same time?

On Tue, Sep 18, 2018 at 5:02 PM Pavan Rallabhandi 
<mailto:prallabha...@walmartlabs.com> wrote:
The steps that were outlined for conversion are correct, have you tried setting 
some the relevant ceph conf values too:

filestore_rocksdb_options = 
"max_background_compactions=8;compaction_readahead_size=2097152;compression=kNoCompression"

filestore_omap_backend = rocksdb

Thanks,
-Pavan.

From: ceph-users <mailto:ceph-users-boun...@lists.ceph.com> on behalf of David 
Turner <mailto:drakonst...@gmail.com>
Date: Tuesday, September 18, 2018 at 4:09 PM
To: ceph-users <mailto:ceph-users@lists.ceph.com>
Subject: EXT: [ceph-users] Any backfill in our cluster makes the cluster 
unusable and takes forever

I've finally learned enough about the OSD backend track down this issue to what 
I believe is the root cause.  LevelDB compaction is the common thread every 
time we move data around our cluster.  I've ruled out PG subfolder splitting, 
EC doesn't seem to be the root cause of this, and it is cluster wide as opposed 
to specific hardware. 

One of the first things I found after digging into leveldb omap compaction was 
[1] this article with a heading "RocksDB instead of LevelDB" which mentions 
that leveldb was replaced with rocksdb as the default db backend for filestore 
OSDs and was even backported to Jewel because of the performance improvements.

I figured there must be a way to be able to upgrade an OSD to use rocksdb from 
leveldb without needing to fully backfill the entire OSD.  There is [2] this 
article, but you need to have an active service account with RedHat to access 
it.  I eventually came across [3] this article about optimizing Ceph Object 
Storage which mentions a resolution to OSDs flapping due to omap compaction to 
migrate to using rocksdb.  It links to the RedHat article, but also has [4] 
these steps outlined in it.  I tried to follow the steps, but the OSD I tested 
this on was unable to start with [5] this segfault.  And then trying to move 
the OSD back to the original LevelDB omap folder resulted in [6] this in the 
log.  I apologize that all of my logging is with log level 1.  If needed I can 
get some higher log levels.

My Ceph version is 12.2.4.  Does anyone have any suggestions for how I can 
update my filestore backend from leveldb to rocksdb?  Or if that's the wrong 
direction and I should be looking elsewhere?  Thank you.


[1] https://ceph.com/community/new-luminous-rados-improvements/
[2] https://access.redhat.com/solutions/3210951
[3] 
https://hubb.blob.core.windows.net/c2511cea-81c5-4386-8731-cc444ff806df-public/resources/Optimize
 Ceph object storage for production in multisite clouds.pdf

[4] ■ Stop the OSD
■ mv /var/lib/ceph/osd/ceph-/current/omap /var/lib/ceph/osd/ceph-/omap.orig
■ ulimit -n 65535
■ ceph-kvstore-tool leveldb /var/lib/ceph/osd/ceph-/omap.orig store-copy 
/var/lib/ceph/osd/ceph-/current/omap 1 rocksdb
■ ceph-osdomap-tool --omap-path /var/lib/ceph/osd/ceph-/current/omap --command 
check
■ sed -i s/leveldb/rocksdb/g /var/lib/ceph/osd/ceph-/superblock
■ chown ceph.ceph /var/lib/ceph/osd/ceph-/current/omap -R
■ cd /var/lib/ceph/osd/ceph-; rm -rf omap.orig
■ Start the OSD

[5] 2018-09-17 19:23:10.826227 7f1f3f2ab700 -1 abort: Corruption: Snappy not 
supported or corrupted Snappy compressed block contents
2018-09-17 19:23:10.830525 7f1f3f2ab700 -1 *** Caught signal (Aborted) **

[6] 2018-09-17 19:27:34.010125 7fcdee97cd80 -1 osd.0 0 OSD:init: unable to 
mount object store
2018-09-17 19:27:34.010131 7fcdee97cd80 -1 ESC[0;31m ** ERROR: osd init failed: 
(1) Operation not permittedESC[0m
2018-09-17 19:27:54.225941 7f7f03308d80  0 set uid:gid to 167:167 (ceph:ceph)
2018-09-17 19:27:54.225975 7f7f03308d80  0 ceph version 12.2.4 
(52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable), process 
(unknown), pid 361535
2018-09-17 19:27:54.231275 7f7f03308d80  0 pidfile_write: ignore empty 
--pid-file
2018-09-17 19:27:54.260207 7f7f03308d80  0 load: jerasure load: lrc load: isa
2018-09-17 19:27:54.260520 7f7f03308d80  0 filestore(/var/lib/ceph/osd/ceph-0) 
backend xfs (magic 0x58465342)
2018-09-17 19:27:54.261135 7f7f03308d80  0 filestore(/var/lib/ceph/osd/ceph-0) 
bac

Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever

2018-09-18 Thread Pavan Rallabhandi
I meant the stack trace hints that the superblock still has leveldb in it, have 
you verified that already?

On 9/18/18, 5:27 PM, "Pavan Rallabhandi"  wrote:

You should be able to set them under the global section and that reminds 
me, since you are on Luminous already, I guess those values are already the 
default, you can verify from the admin socket of any OSD.

But the stack trace didn’t hint as if the superblock on the OSD is still 
considering the omap backend to be leveldb and to do with the compression.

Thanks,
-Pavan.

From: David Turner 
Date: Tuesday, September 18, 2018 at 5:07 PM
    To: Pavan Rallabhandi 
Cc: ceph-users 
Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the 
cluster unusable and takes forever

Are those settings fine to have be global even if not all OSDs on a node 
have rocksdb as the backend?  Or will I need to convert all OSDs on a node at 
the same time?

On Tue, Sep 18, 2018 at 5:02 PM Pavan Rallabhandi 
<mailto:prallabha...@walmartlabs.com> wrote:
The steps that were outlined for conversion are correct, have you tried 
setting some the relevant ceph conf values too:

filestore_rocksdb_options = 
"max_background_compactions=8;compaction_readahead_size=2097152;compression=kNoCompression"

filestore_omap_backend = rocksdb

Thanks,
-Pavan.

From: ceph-users <mailto:ceph-users-boun...@lists.ceph.com> on behalf of 
David Turner <mailto:drakonst...@gmail.com>
Date: Tuesday, September 18, 2018 at 4:09 PM
To: ceph-users <mailto:ceph-users@lists.ceph.com>
Subject: EXT: [ceph-users] Any backfill in our cluster makes the cluster 
unusable and takes forever

I've finally learned enough about the OSD backend track down this issue to 
what I believe is the root cause.  LevelDB compaction is the common thread 
every time we move data around our cluster.  I've ruled out PG subfolder 
splitting, EC doesn't seem to be the root cause of this, and it is cluster wide 
as opposed to specific hardware. 

One of the first things I found after digging into leveldb omap compaction 
was [1] this article with a heading "RocksDB instead of LevelDB" which mentions 
that leveldb was replaced with rocksdb as the default db backend for filestore 
OSDs and was even backported to Jewel because of the performance improvements.

I figured there must be a way to be able to upgrade an OSD to use rocksdb 
from leveldb without needing to fully backfill the entire OSD.  There is [2] 
this article, but you need to have an active service account with RedHat to 
access it.  I eventually came across [3] this article about optimizing Ceph 
Object Storage which mentions a resolution to OSDs flapping due to omap 
compaction to migrate to using rocksdb.  It links to the RedHat article, but 
also has [4] these steps outlined in it.  I tried to follow the steps, but the 
OSD I tested this on was unable to start with [5] this segfault.  And then 
trying to move the OSD back to the original LevelDB omap folder resulted in [6] 
this in the log.  I apologize that all of my logging is with log level 1.  If 
needed I can get some higher log levels.

My Ceph version is 12.2.4.  Does anyone have any suggestions for how I can 
update my filestore backend from leveldb to rocksdb?  Or if that's the wrong 
direction and I should be looking elsewhere?  Thank you.


[1] https://ceph.com/community/new-luminous-rados-improvements/
[2] https://access.redhat.com/solutions/3210951
[3] 
https://hubb.blob.core.windows.net/c2511cea-81c5-4386-8731-cc444ff806df-public/resources/Optimize
 Ceph object storage for production in multisite clouds.pdf

[4] ■ Stop the OSD
■ mv /var/lib/ceph/osd/ceph-/current/omap /var/lib/ceph/osd/ceph-/omap.orig
■ ulimit -n 65535
■ ceph-kvstore-tool leveldb /var/lib/ceph/osd/ceph-/omap.orig store-copy 
/var/lib/ceph/osd/ceph-/current/omap 1 rocksdb
■ ceph-osdomap-tool --omap-path /var/lib/ceph/osd/ceph-/current/omap 
--command check
■ sed -i s/leveldb/rocksdb/g /var/lib/ceph/osd/ceph-/superblock
■ chown ceph.ceph /var/lib/ceph/osd/ceph-/current/omap -R
■ cd /var/lib/ceph/osd/ceph-; rm -rf omap.orig
■ Start the OSD

[5] 2018-09-17 19:23:10.826227 7f1f3f2ab700 -1 abort: Corruption: Snappy 
not supported or corrupted Snappy compressed block contents
2018-09-17 19:23:10.830525 7f1f3f2ab700 -1 *** Caught signal (Aborted) **

[6] 2018-09-17 19:27:34.010125 7fcdee97cd80 -1 osd.0 0 OSD:init: unable to 
mount object store
2018-09-17 19:27:34.010131 7fcdee97cd80 -1 ESC[0;31m ** ERROR: osd init 
failed: (1) Operation not permittedESC[0m
2018-09-17 19:27:54.225941 7f7f03308d80  0 set uid:gid to 167:167 
(ceph:ceph)
2018-09-17 19:27:54.225975 7f7f03308d80  0 ceph version 12.2.4 
(52085d5249a80c5f5121a76d6288429f35e4e

Re: [ceph-users] Large OMAP Objects in default.rgw.log pool

2019-03-09 Thread Pavan Rallabhandi
That can happen if you have lot of objects with swift object expiry (TTL) 
enabled. You can 'listomapkeys' on these log pool objects and check for the 
objects that have registered for TTL as omap entries. I know this is the case 
with at least Jewel version.

Thanks,
-Pavan.

On 3/7/19, 10:09 PM, "ceph-users on behalf of Brad Hubbard" 
 wrote:

On Fri, Mar 8, 2019 at 4:46 AM Samuel Taylor Liston  
wrote:
>
> Hello All,
> I have recently had 32 large map objects appear in my 
default.rgw.log pool.  Running luminous 12.2.8.
>
> Not sure what to think about these.I’ve done a lot of reading 
about how when these normally occur it is related to a bucket needing 
resharding, but it doesn’t look like my default.rgw.log pool  has anything in 
it, let alone buckets.  Here’s some info on the system:
>
> [root@elm-rgw01 ~]# ceph versions
> {
> "mon": {
> "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) 
luminous (stable)": 5
> },
> "mgr": {
> "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) 
luminous (stable)": 1
> },
> "osd": {
> "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) 
luminous (stable)": 192
> },
> "mds": {},
> "rgw": {
> "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) 
luminous (stable)": 1
> },
> "overall": {
> "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) 
luminous (stable)": 199
> }
> }
> [root@elm-rgw01 ~]# ceph osd pool ls
> .rgw.root
> default.rgw.control
> default.rgw.meta
> default.rgw.log
> default.rgw.buckets.index
> default.rgw.buckets.non-ec
> default.rgw.buckets.data
> [root@elm-rgw01 ~]# ceph health detail
> HEALTH_WARN 32 large omap objects
> LARGE_OMAP_OBJECTS 32 large omap objects
> 32 large objects found in pool 'default.rgw.log'
> Search the cluster log for 'Large omap object found' for more 
details.—
>
> Looking closer at these object they are all of size 0.  Also that pool 
shows a capacity usage of 0:

The size here relates to data size. Object map (omap) data is metadata
so an object of size 0 can have considerable omap data associated with
it (the omap data is stored separately from the object in a key/value
database). The large omap warning in health detail output should tell
you " "Search the cluster log for 'Large omap object found' for more
details." If you do that you should get the names of the specific
objects involved. You can then use the rados commands listomapkeys and
listomapvals to see the specifics of the omap data. Someone more
familiar with rgw can then probably help you out on what purpose they
serve.

HTH.

>
> (just a sampling of the 236 objects at size 0)
>
> [root@elm-mon01 ceph]# for i in `rados ls -p default.rgw.log`; do echo 
${i}; rados stat -p default.rgw.log ${i};done
> obj_delete_at_hint.78
> default.rgw.log/obj_delete_at_hint.78 mtime 2019-03-07 
11:39:19.00, size 0
> obj_delete_at_hint.70
> default.rgw.log/obj_delete_at_hint.70 mtime 2019-03-07 
11:39:19.00, size 0
> obj_delete_at_hint.000104
> default.rgw.log/obj_delete_at_hint.000104 mtime 2019-03-07 
11:39:20.00, size 0
> obj_delete_at_hint.26
> default.rgw.log/obj_delete_at_hint.26 mtime 2019-03-07 
11:39:19.00, size 0
> obj_delete_at_hint.28
> default.rgw.log/obj_delete_at_hint.28 mtime 2019-03-07 
11:39:19.00, size 0
> obj_delete_at_hint.40
> default.rgw.log/obj_delete_at_hint.40 mtime 2019-03-07 
11:39:19.00, size 0
> obj_delete_at_hint.15
> default.rgw.log/obj_delete_at_hint.15 mtime 2019-03-07 
11:39:19.00, size 0
> obj_delete_at_hint.69
> default.rgw.log/obj_delete_at_hint.69 mtime 2019-03-07 
11:39:19.00, size 0
> obj_delete_at_hint.95
> default.rgw.log/obj_delete_at_hint.95 mtime 2019-03-07 
11:39:19.00, size 0
> obj_delete_at_hint.03
> default.rgw.log/obj_delete_at_hint.03 mtime 2019-03-07 
11:39:19.00, size 0
> obj_delete_at_hint.47
> default.rgw.log/obj_delete_at_hint.47 mtime 2019-03-07 
11:39:19.00, size 0
>
>
> [root@elm-mon01 ceph]# rados df
> POOL_NAME  USEDOBJECTS   CLONES COPIES 
MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPSRD  WR_OPSWR
> .rgw.root  1.09KiB 4  0 12
  0   00 14853 9.67MiB 0 0B
> default.rgw.buckets.data444TiB 166829939  0 1000979634
  0   00 362357590  859TiB 909188749 703TiB
  

Re: [ceph-users] Try to log the IP in the header X-Forwarded-For with radosgw behind haproxy

2019-04-08 Thread Pavan Rallabhandi
Refer "rgw log http headers" under 
http://docs.ceph.com/docs/nautilus/radosgw/config-ref/

Or even better in the code https://github.com/ceph/ceph/pull/7639

Thanks,
-Pavan.

On 4/8/19, 8:32 PM, "ceph-users on behalf of Francois Lafont" 
 
wrote:

Hi @all,

I'm using Ceph rados gateway installed via ceph-ansible with the Nautilus
version. The radosgw are behind a haproxy which add these headers (checked
via tcpdump):

 X-Forwarded-Proto: http
 X-Forwarded-For: 10.111.222.55

where 10.111.222.55 is the IP address of the client. The radosgw use the
civetweb http frontend. Currently, this is the IP address of the haproxy
itself which is mentioned in logs. I would like to mention the IP address
from the X-Forwarded-For HTTP header. How to do that?

I have tried this option in ceph.conf:

 rgw_remote_addr_param = X-Forwarded-For

It doesn't work but maybe I have read the doc wrongly.

Thx in advance for your help.

PS: I have tried too the http frontend "beast" but, in this case, no HTTP
request seems to be logged.

-- 
François
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com