Re: [ceph-users] block db sizing and calculation

2020-01-14 Thread Konstantin Shalygin

i'm plannung to split the block db to a seperate flash device which i
also would like to use as an OSD for erasure coding metadata for rbd
devices.

If i want to use 14x 14TB HDDs per Node
https://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing

recommends a minimum size of 140GB per 14TB HDD.

Is there any recommandation of how many osds a single flash device can
serve? The optane ones can do 2000MB/s write + 500.000 iop/s.


Any sizes of db, except 3/30/300 is useless.

How much OSD's per NVMe - quantity of OSD's that you can lose once at time.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Install specific version using ansible

2020-01-09 Thread Konstantin Shalygin

Hello all!
I'm trying to install a specific version of luminous (12.2.4). In the
directory group_vars/all.yml I can specify the luminous version, but i
didn't find a place where I can be more specific about the version.

The ansible installs the latest version (12.2.12 at this time).

I'm using ceph ansible stable-3.1

Is it possible, or I have to downgrade?


Just install packages before deploy. And don't upgrade it.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Use telegraf/influx to detect problems is very difficult

2019-12-10 Thread Konstantin Shalygin

But it is very difficult/complicated to make simple queries because, for
example I have osd up and osd total but not osd down metric.

To determine how much osds down you don't need special metric, because 
you already


have osd_up and osd_in metrics. Just use math.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is a scrub error (read_error) on a primary osd safe to repair?

2019-12-04 Thread Konstantin Shalygin

I tried to dig in the mailinglist archives but couldn't find a clear answer
to the following situation:

Ceph encountered a scrub error resulting in HEALTH_ERR
Two PG's are active+clean+inconsistent. When investigating the PG i see a
"read_error" on the primary OSD. Both PG's  are replicated PG's with 3
copies.

I'm on Luminous 12.2.5 on this installation, is it safe to just run "ceph
pg repair" on those PG's or will it then overwrite the two good copies with
the bad one from the primary?
If the latter is true, what is the correct way to resolve this?

Yes, you should call pg repair. Also It's better to upgrade to 12.2.12.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd image size

2019-11-25 Thread Konstantin Shalygin

Hello ,  I  use ceph as block storage in kubernetes. I want to get the rbd usage by command 
"rbd diff image_id | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }’”, 
but I found it is a lot bigger than the value which I got by command “df -h” in the pod. I do 
not know the reason and need your help.


`rbd du image_id` will save your life.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Impact of a small DB size with Bluestore

2019-11-25 Thread Konstantin Shalygin

I have an Ceph cluster which was designed for file store. Each host
have 5 SSDs write intensive of 400GB and 20 HDD of 6TB. So each HDD
have a WAL of 5 GB on SSD
If i want to put Bluestore on this cluster, i can only allocate ~75GB
of WAL and DB on SSD for each HDD which is far below the 4% limit of
240GB (for 6TB)
In the doc, i read "It is recommended that the block.db size isn’t
smaller than 4% of block. For example, if the block size is 1TB, then
block.db shouldn’t be less than 40GB."
Are the 4% mandatory ? What should i expect ? Only relative slow
performance or problem with such a configuration ?


You should use not more 1Gb for WAL and 30Gb for RocksDB. Numbers ! 
3,30,300 (Gb) for block.db is useless.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Strange CEPH_ARGS problems

2019-11-15 Thread Konstantin Shalygin

I found a typo in my post:

Of course I tried

export CEPH_ARGS="-n client.rz --keyring="

and not

export CEPH_ARGS=="-n client.rz --keyring="


try `export CEPH_ARGS="--id rz --keyring=..."`



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] changing set-require-min-compat-client will cause hiccup?

2019-10-31 Thread Konstantin Shalygin

On 10/31/19 2:12 PM, Philippe D'Anjou wrote:

Hi, it is NOT safe.
All clients fail to mount rbds now :(


Your clients is upmap compatible?



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] changing set-require-min-compat-client will cause hiccup?

2019-10-30 Thread Konstantin Shalygin

Hi,I need to change set-require-min-compat-clientto use upmap mode for the PG 
balancer. Will this cause a disconnect of all clients? We're talking cephfs and 
RBD images for VMs.
Or is it save to switch that live?

Is safe.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph balancer do not start

2019-10-25 Thread Konstantin Shalygin

connections coming from qemu vm clients.
It's generally easy to upgrade. Just switch your Ceph yum repo from 
jewel to luminous.


Then update `librbd` on your hypervisors and migrate your VM's. It's 
fast and without downtime of your VM's.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph balancer do not start

2019-10-24 Thread Konstantin Shalygin

Hi,

ceph features
{
 "mon": {
 "group": {
 "features": "0x3ffddff8eeacfffb",
 "release": "luminous",
 "num": 3
 }
 },
 "osd": {
 "group": {
 "features": "0x3ffddff8eeacfffb",
 "release": "luminous",
 "num": 40
 }
 },
 "client": {
 "group": {
 "features": "0x27fddff8ee8cbffb",
 "release": "jewel",
 "num": 813
 },
 "group": {
 "features": "0x3ffddff8eeacfffb",
 "release": "luminous",
 "num": 3
 }
 }
}


Yes, 0x27fddff8ee8cbffb is not support upmap. This is kernel clients or 
qemu vms?




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph balancer do not start

2019-10-23 Thread Konstantin Shalygin

root at ceph-mgr  :~# 
ceph balancer mode upmap
root at ceph-mgr  :~# 
ceph balancer optimize myplan
root at ceph-mgr  :~# 
ceph balancer show myplan
# starting osdmap epoch 409753
# starting crush version 84
# mode upmap
ceph osd pg-upmap-items 4.18e 34 13
ceph osd pg-upmap-items 4.36d 24 20
ceph osd pg-upmap-items 7.2 10 15
ceph osd pg-upmap-items 7.3 24 20 4 17
ceph osd pg-upmap-items 7.4 0 16 4 25
ceph osd pg-upmap-items 7.5 19 2 8 13
ceph osd pg-upmap-items 7.7 8 21
root at ceph-mgr  :~# 
ceph balancer execute myplan
Error EPERM: min_compat_client jewel < luminous, which is required for 
pg-upmap. Try 'ceph osd set-require-min-compat-client luminous' before using the 
new interface
root at ceph-mgr  :~# 
ceph osd set-require-min-compat-client luminous
Error EPERM: cannot set require_min_compat_client to luminous: 811 connected 
client(s) look like jewel (missing 0x820); add 
--yes-i-really-mean-it to do it anyway
root at ceph-mgr  :~#


What is your `ceph features`?



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to reset compat weight-set changes caused by PG balancer module?

2019-10-22 Thread Konstantin Shalygin

Apparently the PG balancer crush-compat mode adds some crush bucket weights. 
Those cause major havoc in our cluster, our PG distribution is all over the 
place.
Seeing things like this:...
  97   hdd 9.09470  1.0 9.1 TiB 6.3 TiB 6.3 TiB  32 KiB  17 GiB 2.8 TiB 
69.03 1.08  28 up
  98   hdd 9.09470  1.0 9.1 TiB 4.5 TiB 4.5 TiB  96 KiB  11 GiB 4.6 TiB 
49.51 0.77  20 up
  99   hdd 9.09470  1.0 9.1 TiB 7.0 TiB 6.9 TiB  80 KiB  18 GiB 2.1 TiB 
76.47 1.20  31 up
Filling rates are from 50 - 90%. Unfortunately reweighing doesn't seem to help 
and I suspect it's because of bucket weights which are WEIRD
     bucket_id -42
     weight_set [
   [ 7.846 11.514 9.339 9.757 10.173 8.900 9.164 6.759 ]


I disabled the module already but the rebalance is broken now.
Do I have to hand reset this and push a new crush map? This is a sensitive 
production cluster, I don't feel pretty good about that.
Thanks for any ideas..


`osd crush weight-set rm-compat` and use upmap mode instead.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pool statistics via API

2019-10-10 Thread Konstantin Shalygin

Currently I am getting the pool statistics (especially USED/MAX AVAIL) via the
command line:
ceph df -f json-pretty| jq '.pools[] | select(.name == "poolname") |
.stats.max_avail'
ceph df -f json-pretty| jq '.pools[] | select(.name == "poolname") |
.stats.bytes_used'

Command "ceph df" does not show the (total) size of the provisioned RBD images.
It only shows the real usage.

I managed to get the total size of provisioned images using the Python rbd
modulehttps://docs.ceph.com/docs/master/rbd/api/librbdpy/

https://docs.ceph.com/docs/master/rbd/api/librbdpy/
Using the same Python module I also would like to get the USED/MAX AVAIL per
pool. That should be possible using rbd.RBD().pool_stats_get, but unfortunately
my python-rbd version doesn't support that (running 12.2.8).

So I went ahead and enabled the dashboard to see if the data is present in the
dashboard and it seems it is. Next step is to enable the restful module and
access this information, right? But unfortunately the restful api doesn't
provide this information.

My question is, how can I access the USED/MAX AVAIL information of a pool
without using the ceph command line and without upgrading my python-rbd package?

Kind regards


Why just not use Prometheus metrics?



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] hanging/stopped recovery/rebalance in Nautilus

2019-10-03 Thread Konstantin Shalygin

Hi,I often observed now that the recovery/rebalance in Nautilus starts quite 
fast but gets extremely slow (2-3 objects/s) even if there are like 20 OSDs 
involved. Right now I am moving (reweighted to 0) 16x8TB disks, it's running 
since 4 days and since 12h it's kind of stuck now at
   cluster:
     id: 2f525d60-aada-4da6-830f-7ba7b46c546b
     health: HEALTH_WARN
     Degraded data redundancy: 1070/899796274 objects degraded 
(0.000%), 1 pg degraded, 1 pg undersized
     1216 pgs not deep-scrubbed in time
     1216 pgs not scrubbed in time
   
   services:

     mon: 1 daemons, quorum km-fsn-1-dc4-m1-797678 (age 8w)
     mgr: km-fsn-1-dc4-m1-797678(active, since 6w)
     mds: xfd:1 {0=km-fsn-1-dc4-m1-797678=up:active}
     osd: 151 osds: 151 up (since 3d), 151 in (since 7d); 24 remapped pgs
     rgw: 1 daemon active (km-fsn-1-dc4-m1-797680)
  
   data:

     pools:   13 pools, 10433 pgs
     objects: 447.45M objects, 282 TiB
     usage:   602 TiB used, 675 TiB / 1.2 PiB avail
     pgs: 1070/899796274 objects degraded (0.000%)
  261226/899796274 objects misplaced (0.029%)
  10388 active+clean
  24    active+clean+remapped
  19    active+clean+scrubbing+deep
  1 active+undersized+degraded
  1 active+clean+scrubbing
  
   io:

     client:   10 MiB/s rd, 18 MiB/s wr, 141 op/s rd, 292 op/s wr


osd-max-backfill is at 16 for all OSDs.
Anyone got an idea why rebalance completely stopped?
Thanks


Try to lower sleep options:

osd_recovery_sleep_hdd -> if hdd without RocksDB on NVMe;
osd_recovery_sleep_hybrid -> for hybrid solutions, i.e. RocksDB on NVMe;
osd_recovery_sleep_ssd -> for non rotational devices;



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Local Device Health PG inconsistent

2019-09-18 Thread Konstantin Shalygin

I was able to get OSDs to boot by updating from 14.2.2 to 14.2.4.
Unclear why this would improve things, but it at least got me running again.

I guess it was covered by this PR [1].



[1] https://github.com/ceph/ceph/pull/29115

k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multiple RESETSESSION messages

2019-09-13 Thread Konstantin Shalygin

We have a 5 node Luminous cluster on which we see multiple RESETSESSION
messages for OSDs on the last node alone.

's=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=2613 cs=1
l=0).handle_connect_reply connect got RESETSESSION'

We found the below fix for this issue, but not able to identify the correct
Luminous release in which this is/will be available.
https://github.com/ceph/ceph/pull/25343

Can someone help us with this please?

This fix still not backported to Luminous [1]



[1] https://tracker.ceph.com/issues/37521

k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iostat and dashboard freezing

2019-09-12 Thread Konstantin Shalygin

On 9/13/19 4:51 AM, Reed Dier wrote:
I would love to deprecate the multi-root, and may try to do just that 
in my next OSD add, just worried about data shuffling unnecessarily.

Would this in theory help my distribution across disparate OSD topologies?


May be. Actually I don't know where is balancer stucks too much time (my 
cluster don't have this issue).




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to create multiple Ceph pools, based on drive type/size/model etc?

2019-09-11 Thread Konstantin Shalygin





Right - but what is you have two types of NVMe drives?

I thought that there's only a fixed enum of device classes - hdd, ssd, or
nvme.

You can't add your own ones, right?

Indeed you can: `ceph osd crush set-device-class nvme2 osd.0`.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to create multiple Ceph pools, based on drive type/size/model etc?

2019-09-11 Thread Konstantin Shalygin

Right - but what is you have two types of NVMe drives?

I thought that there's only a fixed enum of device classes - hdd, ssd, or
nvme.

You can't add your own ones, right?

Indeed you can: `ceph set-device-class nvme2 osd.0`.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Balancer Limitations

2019-09-11 Thread Konstantin Shalygin

We're using Nautilus 14.2.2 (upgrading soon to 14.2.3) on 29 CentOS osd servers.

We've got a large variation of disk sizes and host densities. Such
that the default crush mappings lead to an unbalanced data and pg
distribution.

We enabled the balancer manager module in pg upmap mode. The balancer
commands frequently hang indefinitely when enabled and then queried.
Even issuing a balancer off will hang for hours unless issued within
about a minute of the manager restarting. I digress.

In upmap mode, it looks like ceph only moves osd mappings within a
host. Is this the case?

I bring this up because we've got one disk that is sitting at 88%
utilization and I've been unable to bring this down. The next most
utilized disks are at 80%, and even then, I think that could be
reduced.

If the limitation is that upmap mode cannot map to osds to different
hosts, than that might be something to document. As it is a
significant difference to crush-compat.

Another thing to document would be how to move between the two modes.

I think this is needed to move between crush-compat and upmap: ceph
osd crush weight-set rm-compat

I don't know about the reverse, though.

ceph osd df tree [1]
pg upmap items from the osdmap [2]

[1]https://people.cs.ksu.edu/~mozes/ceph_balancer_query/ceph_osd_df_tree.txt
[2]https://people.cs.ksu.edu/~mozes/ceph_balancer_query/pg_upmap_items.txt


To remove upmaps you can execute `ceph osd rm-pg-upmap-items ${upmap}` 
from your dump.


Don't forget to "off" balancer before that operation.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to create multiple Ceph pools, based on drive type/size/model etc?

2019-09-11 Thread Konstantin Shalygin

I have a 3-node Ceph cluster, with a mixture of Intel Optane 905P PCIe
disks, and normal SATA SSD drives.

I want to create two Ceph pools, one with only the Optane disks, and the
other with only the SATA SSDs.

When I checked "ceph osd tree", all the drives had device class "ssd".

As a hack - I was able to change the device class for the Optane drives to
"nvme", and leave the SATA SSDs as "ssd".

I then created crush rules based on device classes.

However, what if I don't want to overload device classes to achieve this,
or have more than two models of disks?

Is there an easy way to assign specific drives of a model/type/capacity to
different pools?

It's already simple way to do that - device classes.

It's like 8021q in networks.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] AutoScale PG Questions - EC Pool

2019-09-10 Thread Konstantin Shalygin

On 9/10/19 1:17 PM, Ashley Merrick wrote:
So I am correct in 2048 being a very high number and should go for 
either 256 or 512 like you said for a cluster of my size with the EC 
Pool of 8+2?



Indeed. I suggest stay at 256.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] AutoScale PG Questions - EC Pool

2019-09-10 Thread Konstantin Shalygin

I have a EC Pool (8+2) which has 30 OSD (3 Nodes), grown from the orginal 10 
OSD (1 Node).



I originally set the pool with a PG_NUM of 300, however the AutoScale PG is 
showing a warn saying I should set this to 2048, I am not sure if this is a 
good suggestion or if the Autoscale currently is not suggested for EC pool's 
due to the slightly diff calculations used.



Currently with a PG NUM of 300 each OSD has around 100PG, changing to 2048 is 
going to increase this massively.


You should not use 300PG because is not a power of two. You should set 
pg_num to 256 or 512.





k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iostat and dashboard freezing

2019-09-09 Thread Konstantin Shalygin

On 9/2/19 5:47 PM, Jake Grimmett wrote:

Hi Konstantin,

To confirm, disabling the balancer allows the mgr to work properly.

I tried re-enabling the balancer, it briefly worked, then locked up the
mgr again.

Here it's working OK...
[root@ceph-s1 ~]# time ceph balancer optimize new

real0m1.628s
user0m0.583s
sys 0m0.075s

[root@ceph-s1 ~]# ceph balancer status
{
 "active": false,
 "plans": [
 "new"
 ],
 "mode": "upmap"
}

[root@ceph-s1 ~]# ceph balancer on

At this point, the balancer seems initially to be working as 'ceph -s'
shows the misplaced count going from 0 to ...
 pgs: 6829497/4977639365 objects misplaced (0.137%)

However mgr now goes back up to 100% CPU, and stopping balancer is very
difficult

[root@ceph-s1 ~]# ceph balancer off
real5m37.641s
user0m0.751s
sys 0m0.158s

[root@ceph-s1 ~]# time ceph balancer optimize new

real18m19.202s
user0m1.388s
sys 0m0.413s


Here is the other data you requested:
[root@ceph-s1 ~]# ceph config-key ls | grep balance
 "config-history/10/+mgr/mgr/balancer/active",
 "config-history/29/+mgr/mgr/balancer/active",
 "config-history/29/-mgr/mgr/balancer/active",
 "config-history/30/+mgr/mgr/balancer/active",
 "config-history/30/-mgr/mgr/balancer/active",
 "config-history/31/+mgr/mgr/balancer/active",
 "config-history/31/-mgr/mgr/balancer/active",
 "config-history/32/+mgr/mgr/balancer/active",
 "config-history/32/-mgr/mgr/balancer/active",
 "config-history/33/+mgr/mgr/balancer/active",
 "config-history/33/-mgr/mgr/balancer/active",
 "config-history/9/+mgr/mgr/balancer/mode",
 "config/mgr/mgr/balancer/active",
 "config/mgr/mgr/balancer/mode",

We have two main pools:
pool #1 is 3x replicated, has 4 NVMe OSD and is only used for cephfs
metadata. This is on 4 nodes (that also run the mgr, mon and mds)

Pool #2 is erasure encoded 8+2, has 324 x 12TB OSD over 36 nodes, and is
the data partition for cephfs. All osd in pool 2 have a db/wal on nvme
(6 hdd per NVMe)

'ceph df detail' is here:


'ceph osd tree' is here:
http://p.ip.fi/k1x2

'ceph osd df tree' output is  here:
http://p.ip.fi/g7ma

any help appreciated,



Jake, you already have good VAR for your OSD's.

I suggest to set `mgr/balancer/default_sleep_interval` to '2', and 
decrease `mgr/balancer/default_sleep_interval` to `300`.





k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iostat and dashboard freezing

2019-09-09 Thread Konstantin Shalygin

On 8/29/19 9:56 PM, Reed Dier wrote:

"config/mgr/mgr/balancer/active",
"config/mgr/mgr/balancer/max_misplaced",
"config/mgr/mgr/balancer/mode",
"config/mgr/mgr/balancer/pool_ids",


This is useless keys, you may to remove it.


https://pastebin.com/bXPs28h1

Issues that you have:

1. Multi-root. You should deprecate your 'ssd' root and move your osds 
of this root to 'default' root.


2. Some of your OSD's are reweighted, for example 
osd.44,osd.50,osd.57,osd.60,osd.102,osd.107. For proper upmap work all 
osds should be not reweighted.



$ time ceph balancer optimize newplan1
Error EALREADY: Unable to find further optimization, or pool(s)' 
pg_num is decreasing, or distribution is already perfect


real    3m10.627s
user    0m0.352s
sys     0m0.055s


Set the key `mgr/balancer/upmap_max_iterations` to '2' should decrease 
this time.





k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iostat and dashboard freezing

2019-08-28 Thread Konstantin Shalygin

Just a follow up 24h later, and the mgr's seem to be far more stable, and have 
had no issues or weirdness after disabling the balancer module.

Which isn't great, because the balancer plays an important role, but after 
fighting distribution for a few weeks and getting it 'good enough' I'm taking 
the stability.

Just wanted to follow up with another 2¢.
What is your balancer settings (`ceph config-key ls`)? Your mgr running 
in virtual environment or on bare metal?


How much pools you have? Please also paste `ceph osd tree` & `ceph osd 
df tree`.


Measure time of balancer plan creation: `time ceph balancer optimize new`.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + SAMBA (vfs_ceph)

2019-08-28 Thread Konstantin Shalygin


On 8/29/19 1:32 AM, Salsa wrote:

This is the result:

# testparm -s
Load smb config files from /etc/samba/smb.conf
rlimit_max: increasing rlimit_max (1024) to minimum Windows limit (16384)
Processing section "[homes]"
Processing section "[cephfs]"
Processing section "[printers]"
Processing section "[print$]"
Loaded services file OK.
Server role: ROLE_STANDALONE

# Global parameters
[global]
load printers = No
netbios name = SAMBA-CEPH
printcap name = cups
security = USER
workgroup = CEPH
smbd: backgroundqueue = no
idmap config * : backend = tdb
cups options = raw
valid users = samba
...
[cephfs]
create mask = 0777
directory mask = 0777
guest ok = Yes
guest only = Yes
kernel share modes = No
path = /
read only = No
vfs objects = ceph
ceph: user_id = samba
ceph:config_file = /etc/ceph/ceph.conf


I cut off some parts I thought were not relevant.



`map to guest = Bad User`, instead `valid users = samba`.

```

[cephfs]
  path = /
  vfs objects = acl_xattr ceph
  ceph: config_file = /etc/ceph/ceph.conf
  ceph: user_id = samba
  oplocks = no
  kernel share modes = no
  browseable = yes
  public = yes
  writable = yes
  guest ok = yes
  force user = root
  force group = root
  create mask = 0644
  directory mode = 0755
```

Reload and try `smbclient -U guest -N //10.17.6.68/cephfs`



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] health: HEALTH_ERR Module 'devicehealth' has failed: Failed to import _strptime because the import lockis held by another thread.

2019-08-28 Thread Konstantin Shalygin

On 8/28/19 8:16 PM, Peter Eisch wrote:


Thank you for your reply. The I receive an error as the module can't 
be disabled.


I may have worked through this by restarting the nodes in a rapid 
succession. 



What exactly error? May be you catches a bug and should be create 
redmine ticket for this issue.





k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph + SAMBA (vfs_ceph)

2019-08-28 Thread Konstantin Shalygin

I'm running a ceph installation on a lab to evaluate for production and I have 
a cluster running, but I need to mount on different windows servers and 
desktops. I created an NFS share and was able to mount it on my Linux desktop, 
but not a Win 10 desktop. Since it seems that Windows server 2016 is required 
to mount the NFS share I quit that route and decided to try samba.

I compiled a version of Samba that has this vfs_ceph module, but I can't set it 
up correctly. It seems I'm missing some user configuration as I've hit this 
error:

"
~$ smbclient -U samba.gw //10.17.6.68/cephfs_a
WARNING: The "syslog" option is deprecated
Enter WORKGROUP\samba.gw's password:
session setup failed: NT_STATUS_LOGON_FAILURE
"
Does anyone know of any good setup tutorial to follow?

This is my smb config so far:

# Global parameters
[global]
load printers = No
netbios name = SAMBA-CEPH
printcap name = cups
security = USER
workgroup = CEPH
smbd: backgroundqueue = no
idmap config * : backend = tdb
cups options = raw
valid users = samba

[cephfs]
create mask = 0777
directory mask = 0777
guest ok = Yes
guest only = Yes
kernel share modes = No
path = /
read only = No
vfs objects = ceph
ceph: user_id = samba
ceph:config_file = /etc/ceph/ceph.conf

Thanks


Your configuration seems correct, but conf have or don't have special 
characters such a spaces, lower case options. First what you should do 
is run `testparm -s` and paste here what in output.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] health: HEALTH_ERR Module 'devicehealth' has failed: Failed to import _strptime because the import lockis held by another thread.

2019-08-28 Thread Konstantin Shalygin

What is the correct/best way to address a this?  It seems like a python issue, maybe it's 
time I learn how to "restart" modules?  The cluster seems to be working beyond 
this.


Restart of single module is: `ceph mgr module disable devicehealth ; 
ceph mgr module enable devicehealth`.





k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph capacity versus pool replicated size discrepancy?

2019-08-14 Thread Konstantin Shalygin

On 8/14/19 6:19 PM, Kenneth Van Alstyne wrote:
Got it!  I can calculate individual clone usage using “rbd du”, but 
does anything exist to show total clone usage across the pool? 
 Otherwise it looks like phantom space is just missing. 


rbd du for each snapshot, I think...




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph capacity versus pool replicated size discrepancy?

2019-08-13 Thread Konstantin Shalygin

Hey guys, this is probably a really silly question, but I’m trying to reconcile 
where all of my space has gone in one cluster that I am responsible for.

The cluster is made up of 36 2TB SSDs across 3 nodes (12 OSDs per node), all 
using FileStore on XFS.  We are running Ceph Luminous 12.2.8 on this particular 
cluster. The only pool where data is heavily stored is the “rbd” pool, of which 
7.09TiB is consumed.  With a replication of “3”, I would expect that the raw 
used to be close to 21TiB, but it’s actually closer to 35TiB.  Some additional 
details are below.  Any thoughts?

[cluster]root at dashboard  
:~# ceph df
GLOBAL:
 SIZEAVAIL   RAW USED %RAW USED
 62.8TiB 27.8TiB  35.1TiB 55.81
POOLS:
 NAME   ID USED%USED MAX AVAIL 
OBJECTS
 rbd0  7.09TiB 53.76   6.10TiB 
3056783
 data   3  29.4GiB  0.47   6.10TiB  
  7918
 metadata   4  57.2MiB 0   6.10TiB  
95
 .rgw.root  5  1.09KiB 0   6.10TiB  
 4
 default.rgw.control6   0B 0   6.10TiB  
 8
 default.rgw.meta   7   0B 0   6.10TiB  
 0
 default.rgw.log8   0B 0   6.10TiB  
   207
 default.rgw.buckets.index  9   0B 0   6.10TiB  
 0
 default.rgw.buckets.data   10  0B 0   6.10TiB  
 0
 default.rgw.buckets.non-ec 11  0B 0   6.10TiB  
 0

[cluster]root at dashboard  
:~# ceph --version
ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) luminous (stable)

[cluster]root at dashboard  
:~# ceph osd dump | 
grep 'replicated size'
pool 0 'rbd' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins 
pg_num 682 pgp_num 682 last_change 414873 flags hashpspool 
min_write_recency_for_promote 1 stripe_width 0 application rbd
pool 3 'data' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins 
pg_num 682 pgp_num 682 last_change 409614 flags hashpspool 
crash_replay_interval 45 min_write_recency_for_promote 1 stripe_width 0 
application cephfs
pool 4 'metadata' replicated size 3 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 682 pgp_num 682 last_change 409617 flags hashpspool 
min_write_recency_for_promote 1 stripe_width 0 application cephfs
pool 5 '.rgw.root' replicated size 3 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 409 pgp_num 409 last_change 409710 lfor 0/336229 flags 
hashpspool stripe_width 0 application rgw
pool 6 'default.rgw.control' replicated size 3 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 409 pgp_num 409 last_change 409711 lfor 0/336232 
flags hashpspool stripe_width 0 application rgw
pool 7 'default.rgw.meta' replicated size 3 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 409 pgp_num 409 last_change 409713 lfor 0/336235 flags 
hashpspool stripe_width 0 application rgw
pool 8 'default.rgw.log' replicated size 3 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 409 pgp_num 409 last_change 409712 lfor 0/336238 flags 
hashpspool stripe_width 0 application rgw
pool 9 'default.rgw.buckets.index' replicated size 3 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 409 pgp_num 409 last_change 409714 lfor 0/336241 
flags hashpspool stripe_width 0 application rgw
pool 10 'default.rgw.buckets.data' replicated size 3 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 409 pgp_num 409 last_change 409715 lfor 0/336244 
flags hashpspool stripe_width 0 application rgw
pool 11 'default.rgw.buckets.non-ec' replicated size 3 min_size 1 crush_rule 0 
object_hash rjenkins pg_num 409 pgp_num 409 last_change 409716 lfor 0/336247 
flags hashpspool stripe_width 0 application rgw

[cluster]root at dashboard  
:~# ceph osd lspools
0 rbd,3 data,4 metadata,5 .rgw.root,6 default.rgw.control,7 default.rgw.meta,8 
default.rgw.log,9 default.rgw.buckets.index,10 default.rgw.buckets.data,11 
default.rgw.buckets.non-ec,

[cluster]root at dashboard  
:~# rados df
POOL_NAME  USEDOBJECTS CLONES  COPIES  MISSING_ON_PRIMARY 
UNFOUND DEGRADED RD_OPS  RD  WR_OPS  WR
.rgw.root  1.09KiB   4   0  12  0   
00  128KiB   0  0B
data   29.4GiB7918   0   23754  0   
00 1414777 3.74TiB 3524833 4.54TiB
default.rgw.buckets.data0B   0   0   0  0   
0

Re: [ceph-users] Nautilus - Balancer is always on

2019-08-07 Thread Konstantin Shalygin

ceph mgr module disable balancer

Error EINVAL: module 'balancer' cannot be disabled (always-on)

  


Whats the way to restart balanacer? Restart MGR service?

  


I wanna suggest to Balancer developer to setup a ceph-balancer.log for this
module get more information about whats doing.



Maybe you should `ceph balancer off` first?



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New CRUSH device class questions

2019-08-07 Thread Konstantin Shalygin

On 8/7/19 2:30 PM, Robert LeBlanc wrote:

... plus 11 more hosts just like this


Interesting. Please paste full `ceph osd df tree`. What is actually your 
NVMe models?


Yes, our HDD cluster is much like this, but not Luminous, so we 
created as separate root with SSD OSD for the metadata and set up a 
CRUSH rule for the metadata pool to be mapped to SSD. I understand 
that the CRUSH rule should have a `step take default class ssd` which 
I don't see in your rule unless the `~` in the item_name means device 
class.

Indeed, this is a device class.


And new crush rule may be created like this `ceph osd crush rule 
create-replicated
`, for me it is: `ceph osd crush rule create-replicated 
replicated_racks_nvme default rack nvme`




k


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New CRUSH device class questions

2019-08-07 Thread Konstantin Shalygin

On 8/7/19 1:40 PM, Robert LeBlanc wrote:

Maybe it's the lateness of the day, but I'm not sure how to do that. 
Do you have an example where all the OSDs are of class ssd?
Can't parse what you mean. You always should paste your `ceph osd tree` 
first.


Yes, we can set quotas to limit space usage (or number objects), but 
you can not reserve some space that other pools can't use. The problem 
is if we set a quota for the CephFS data pool to the equivalent of 95% 
there are at least two scenario that make that quota useless.


Of course. 95% of CephFS deployments is where meta_pool on flash drives 
with enough space for this.



```

pool 21 'fs_data' replicated size 3 min_size 2 crush_rule 4 object_hash 
rjenkins pg_num 64 pgp_num 64 last_change 56870 flags hashpspool 
stripe_width 0 application cephfs
pool 22 'fs_meta' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 16 pgp_num 16 last_change 56870 flags hashpspool 
stripe_width 0 application cephfs


```

```

# ceph osd crush rule dump replicated_racks_nvme
{
    "rule_id": 0,
    "rule_name": "replicated_racks_nvme",
    "ruleset": 0,
    "type": 1,
    "min_size": 1,
    "max_size": 10,
    "steps": [
    {
    "op": "take",
    "item": -44,
    "item_name": "default~nvme"    <
    },
    {
    "op": "chooseleaf_firstn",
    "num": 0,
    "type": "rack"
    },
    {
    "op": "emit"
    }
    ]
}
```



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New CRUSH device class questions

2019-08-06 Thread Konstantin Shalygin

Is it possible to add a new device class like 'metadata'?


Yes, but you don't need this. Just use your existing class with another 
crush ruleset.




If I set the device class manually, will it be overwritten when the OSD
boots up?


Nope. Classes assigned automatically when OSD is created, not boot'ed.



I readhttps://ceph.com/community/new-luminous-crush-device-classes/  and it
mentions that Ceph automatically classifies into hdd, ssd, and nvme. Hence
the question.


But it's not a magic. Sometimes drive can be sata ssd, but in kernel is 
'rotational'...




We will still have 13 OSDs, it will be overkill for space for metadata, but
since Ceph lacks a reserve space feature, we don't have  many options. This
cluster is so fast that it can fill up in the blink of an eye.



Not true. You always can set per-pool quota in bytes, for example:

* your meta is 1G;

* your raw space is 300G;

* your data is 90G;

Set quota to your data pool: `ceph osd pool set-quota  
max_bytes 96636762000`





k



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is the admin burden avoidable? "1 pg inconsistent" every other day?

2019-08-04 Thread Konstantin Shalygin

Question:  If you have enough osds it seems an almost daily thing when
you get to work in the morning there' s a "ceph health error"  "1 pg
inconsistent"   arising from a 'scrub error'.   Or 2, etc.   Then like
most such mornings you look to see there's two or more valid instances
of the pg and one with an issue.  So, like putting on socks that just
takes time every day: there's the 'ceph pg repair xx' (making note of
the likely soon to fail osd) then hey presto on with the day.

Am I missing some way to automate this and be notified only if one
attempt at pg repair has failed and just a log entry for successful
repairs?   Calls about dashboard "HEALTH ERR" warnings so often I don't
need.

Ideas welcome!



You can set `osd_scrub_auto_repair` to true to automate repair damaged 
objects detected during scrub.





k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balancer in HEALTH_ERR

2019-08-01 Thread Konstantin Shalygin

Two weeks ago, we started a data migration from one old ceph node to a new
one.

For task we added a 120TB Host to the cluster and evacuated the old one with
the ceph osd crush reweight osd.X 0.0 that move near 15 TB per day.

  


After 1 week and few days we found that balancer module don't work fine
under this situacion it don't distribute data between OSD if cluster is not
HEALTH status.

  


The current situation , some osd are at 96% and others at 75% , causing some
pools get very nearfull 99%.

  


I read several post about balancer only works in HEALHTY mode and that's the
problem, because ceph don't distribute the data equal between OSD in native
mode, causing in the scenario of "Evacuate+Add" huge problems.

  


Info:https://pastebin.com/HuEt5Ukn

  


Right now for solve we are manually change weight of most used osd.

  


Anyone more got this problem?



You can determine your biggest pools like this:


```

(header="pool objects bytes_used max_avail"; echo "$header"; echo 
"$header" | tr '[[:alpha:]_]' '-'; ceph df --format=json | jq 
'.pools[]|(.name,.stats.objects,.stats.bytes_used,.stats.max_avail)' | 
paste - - - -) | column -t


```


Then you can select your PGs for this pool:


```

(header="pg_id pg_used_mbytes pg_used_objects" ; echo "$header" ; echo 
"$header" | tr '[[:alpha:]_]' '-' ; ceph pg ls-by-pool  
--format=json | jq 'sort_by(.stat_sum.num_bytes) | .[] | (.pgid, 
.stat_sum.num_bytes/1024/1024, .stat_sum.num_objects)' | paste - - -) | 
column -t


```


And then upmap your biggest PG's to lower filled osd's.

Or another way, list PG's of your already nearfull osds like this `ceph 
pg ls-by-osd osd.0` and upmap it from this osd to lower filled osd's.




gl,

k


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nautilus: significant increase in cephfs metadata pool usage

2019-07-25 Thread Konstantin Shalygin

we just recently upgraded our cluster from luminous 12.2.10 to nautilus
14.2.1 and I noticed a massive increase of the space used on the cephfs
metadata pool although the used space in the 2 data pools  basically did
not change. See the attached graph (NOTE: log10 scale on y-axis)

Is there any reason that explains this?


Dietmar, how your metadata usage now? Is stop growing?




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSD daemon possibly causes network card issues

2019-07-19 Thread Konstantin Shalygin

On 7/19/19 5:59 PM, Geoffrey Rhodes wrote:


Holding thumbs this helps however I still don't understand why the 
issue only occurs on ceph-osd nodes.
ceph-mon and ceph-mds nodes and even a cech client with the same 
adapters do not have these issues.


Because osd hosts actually do data storage work and your 1G nics under 
heavy loaded.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Legacy BlueStore stats reporting?

2019-07-19 Thread Konstantin Shalygin

Using Ceph-Ansible stable-4.0 I did a rolling update from latest Mimic to 
Nautilus 14.2.2 on a cluster yesterday, and the update ran to completion 
successfully.

However, in ceph status I see a warning of the form "Legacy BlueStore stats 
reporting detected” for all OSDs in the cluster.

Can anyone help me with what has gone wrong, and what should be done to fix it?

I thin you should start to run repair for your OSD's - [1]



[1] 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-July/035889.html


k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSD daemon causes network card issues

2019-07-18 Thread Konstantin Shalygin

On 7/18/19 7:43 PM, Geoffrey Rhodes wrote:

Sure, also attached.


Try to disable flow control via `ethtool -K  rx off tx off`.



k


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Changing the release cadence

2019-07-18 Thread Konstantin Shalygin

Arch Linux packager for Ceph here o/


I take this opportunity to consider the possibility of the appearance 
not in Ceph packaging, but Archlinux+Ceph related.
Currently with Archlinux packaging impossible to build "Samba CTDB 
Cluster with CephFS backend". This caused by lack of build options, 
ticket requests for this ignored for a years: [1], [2]. AFAIK all 
distro's lack of full RADOS support, only exception is SUSE - because 
most of RADOS features comes to Samba from SUSE employees. For CentOS7 
this covered here [3].


May be Thore can raise this question among samba package maintainers.



[1] https://bugs.archlinux.org/task/53467
[2] https://bugs.archlinux.org/task/49356
[3] https://lists.samba.org/archive/samba/2019-July/224288.html

Thanks,
k
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSD daemon causes network card issues

2019-07-18 Thread Konstantin Shalygin

I've been having an issue since upgrading my cluster to Mimic 6 months ago
(previously installed with Luminous 12.2.1).
All nodes that have the same PCIe network card seem to loose network
connectivity randomly. (frequency ranges from a few days to weeks per host
node)
The affected nodes only have the Intel 82576 LAN Card in common, different
motherboards, installed drives, RAM and even PSUs.
Nodes that have the Intel I350 cards are not affected by the Mimic upgrade.
Each host node has recommended RAM installed and has between 4 and 6 OSDs /
sata hard drives installed.
The cluster operated for over a year (Luminous) without a single issue,
only after the Mimic upgrade did the issues begin with these nodes.
The cluster is only used for CephFS (file storage, low intensity usage) and
makes use of erasure data pool (K=4, M=2).

I've tested many things, different kernel versions, different Ubuntu LTS
releases, re-installation and even CENTOS 7, different releases of Mimic,
different igb drivers.
If I stop the ceph-osd daemons the issue does not occur.  If I swap out the
Intel 82576 card with the Intel I350 the issue is resolved.
I haven't any more ideas other than replacing the cards but I feel the
issue is linked to the ceph-osd daemon and a change in the Mimic release.
Below are the various software versions and drivers I've tried and a log
extract from a node that lost network connectivity. - Any help or
suggestions would be greatly appreciated.

*OS:*  Ubuntu 16.04 / 18.04 and recently CENTOS 7
*Ceph Version:*Mimic (currently 13.2.6)
*Network card:*4-PORT 1GB INTEL 82576 LAN CARD (AOC-SG-I4)
*Driver:  *   igb
*Driver Versions:* 5.3.0-k / 5.3.5.22s / 5.4.0-k
*Network Config:* 2 x bonded (LACP) 1GB nic for public net,   2 x
bonded (LACP) 1GB nic for private net
*Log errors:*
Jun 27 12:10:28 cephnode5 kernel: [497346.638608] igb :03:00.0
enp3s0f0: PCIe link lost, device now detached
Jun 27 12:10:28 cephnode5 kernel: [497346.686752] igb :04:00.1
enp4s0f1: PCIe link lost, device now detached
Jun 27 12:10:29 cephnode5 kernel: [497347.550473] igb :03:00.1
enp3s0f1: PCIe link lost, device now detached
Jun 27 12:10:29 cephnode5 kernel: [497347.646785] igb :04:00.0
enp4s0f0: PCIe link lost, device now detached
Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793
7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from 10.100.4.1:6809
osd.16 since back 2019-06
-27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27
12:10:23.796726)
Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793
7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from 10.100.6.1:6804
osd.20 since back 2019-06
-27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27
12:10:23.796726)
Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793
7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from 10.100.7.1:6803
osd.25 since back 2019-06
-27 12:10:23.338012 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27
12:10:23.796726)
Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793
7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from 10.100.8.1:6803
osd.30 since back 2019-06
-27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27
12:10:23.796726)
Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793
7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from 10.100.9.1:6808
osd.43 since back 2019-06
-27 12:10:23.338012 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27
12:10:23.796726)


Paste your `ethtool -S `, `ethtool -i ` and `dmesg 
-TL | grep igb`.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BlueStore bitmap allocator under Luminous and Mimic

2019-07-09 Thread Konstantin Shalygin

On 5/28/19 5:16 PM, Marc Roos wrote:

I switched first of may, and did not notice to much difference in memory
usage. After the restart of the osd's on the node I see the memory
consumption gradually getting back to as before.
Can't say anything about latency.



Anybody else? Wido?

I see many patches from Igor comes to Luminous. And also bitmap 
allocator (default in Nautilus) is tries to kill Brett Chancellor 
cluster for a week [1]




[1] 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-July/035726.html


k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3 OSDs stopped and unable to restart

2019-07-09 Thread Konstantin Shalygin

I'll give that a try.  Is it something like...
ceph tell 'osd.*' bluestore_allocator stupid
ceph tell 'osd.*' bluefs_allocator stupid

And should I expect any issues doing this?


You should place this to ceph.conf and restart your osds.

Otherwise, this should fix new bitmap allocator issue via stable stupid 
allocator.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Faux-Jewel Client Features

2019-07-04 Thread Konstantin Shalygin

Hi all,

Starting to make preparations for Nautilus upgrades from Mimic, and I'm looking 
over my client/session features and trying to fully grasp the situation.

>/$ ceph versions />/{ />/"mon": { />/"ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic 
(stable)": 3 }, />/"mgr": { />/"ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic 
(stable)": 3 }, />/"osd": { />/"ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic 
(stable)": 204 }, />/"mds": { />/"ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic 
(stable)": 2 }, />/"overall": { />/"ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic 
(stable)": 212 } />/} /


>/$ ceph features />/{ />/"mon": [ />/{ "features": "0x3ffddff8ffacfffb", "release": "luminous", "num": 3 } ], />/"mds": [ />/{ "features": "0x3ffddff8ffacfffb", "release": "luminous" "num": 2 } ], />/"osd": [ />/{ "features": "0x3ffddff8ffacfffb", "num": 204 } ], />/"client": [ />/{ "features": "0x7010fb86aa42ada", "release": "jewel", "num": 4 }, />/{ "features": "0x7018fb86aa42ada", "release": "jewel", "num": 1 }, />/{ "features": "0x3ffddff8eea4fffb", "release": "luminous", "num": 344 }, />/{ "features": "0x3ffddff8eeacfffb", "release": "luminous", "num": 200 }, />/{ "features": "0x3ffddff8ffa4fffb", "release": "luminous", "num": 49 }, />/{ "features": "0x3ffddff8ffacfffb", "release": "luminous", "num": 213 
} ], />/"mgr": [ />/{ "features": "0x3ffddff8ffacfffb", "release": "luminous", "num": 3 } ] />/} /

>/$ ceph osd dump | grep compat />/require_min_compat_client luminous 
/>/min_compat_client luminous /

I flattened the output to make it a bit more vertical scrolling friendly.

Diving into the actual clients with those features:
>/# ceph daemon mon.mon1 sessions | grep jewel />/"MonSession(client.1649789192 ip.2:0/3697083337 is open allow *, 
features 0x7010fb86aa42ada (jewel))", />/"MonSession(client.1656508179 ip.202:0/2664244117 is open allow *, 
features 0x7018fb86aa42ada (jewel))", />/"MonSession(client.1637479106 ip.250:0/1882319989 is open allow *, 
features 0x7010fb86aa42ada (jewel))", />/"MonSession(client.1662023903 ip.249:0/3198281565 is open allow *, 
features 0x7010fb86aa42ada (jewel))", />/"MonSession(client.1658312940 ip.251:0/3538168209 is open allow *, 
features 0x7010fb86aa42ada (jewel))", /

ip.2 is a cephfs kernel client with 4.15.0-51-generic
ip.202 is a krbd client with kernel 4.18.0-22-generic
ip.250 is a krbd client with kernel 4.15.0-43-generic
ip.249 is a krbd client with kernel 4.15.0-45-generic
ip.251 is a krbd client with kernel 4.15.0-45-generic

For the krbd clients, the features are " features: layering, exclusive-lock".

My min_compat and require_min_compat clients are already set to Luminous, 
however, I would love some reassurance that I'm not going to run into issues 
with the krbd/kcephfs clients, and trying to make use of new features like the 
PG autoscaler for instance.
I should have full upmap compatibility as the balancer in upmap mode has been 
functioning, and given that they are relatively recent kernels.

Just looking for some sanity checks to make sure I don't have any surprises for 
these 'jewel' clients come a nautilus rollout.


Your krbd (0x7010fb86aa42ada) is enough for upmap.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW: Is 'radosgw-admin reshard stale-instances rm' safe?

2019-06-25 Thread Konstantin Shalygin


On 6/25/19 12:46 AM, Rudenko Aleksandr wrote:


Hi, Konstantin.

Thanks for the reply.

I know about stale instances and that they remained from prior version.

I ask about “marker” of bucket. I have bucket “clx” and I can see his 
current marker in stale-instances list.


As I know, stale-instances list must contain only previous marker ids.



Good question! I CC'ed Casey for answer...



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Thoughts on rocksdb and erasurecode

2019-06-24 Thread Konstantin Shalygin

Hi

Have been thinking a bit about rocksdb and EC pools:

Since a RADOS object written to a EC(k+m) pool is split into several
minor pieces, then the OSD will receive many more smaller objects,
compared to the amount it would receive in a replicated setup.

This must mean that the rocksdb will also need to handle this more
entries, and will grow faster. This will have an impact when using
bluestore for slow HDD with DB on SSD drives, where the faster growing
rocksdb might result in spillover to slow store - if not taken into
consideration when designing the disk layout.

Are my thoughts on the right track or am I missing something?

Has somebody done any measurement on rocksdb growth, comparing replica
vs EC ?


If you want to be not affected on spillover of block.db - use 3/30/300 
GB partition for your block.db.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rebalancing ceph cluster

2019-06-24 Thread Konstantin Shalygin

Hello everyone,

We have some osd on the ceph.
Some osd's usage is more than 77% and another osd's usage is 39% in the same 
host.

I wonder why osd’s usage is different.(Difference is large) and how can i fix 
it?

ID  CLASS   WEIGHTREWEIGHT SIZEUSE AVAIL   %USE  VAR  PGS TYPE NAME
  -2  93.26010- 93.3TiB 52.3TiB 41.0TiB 56.04 0.98   - host 
serverA
…...
  33 HDD  9.09511  1.0 9.10TiB 3.55TiB 5.54TiB 39.08 0.68  66 osd.4
  45 HDD   7.27675  1.0 7.28TiB 5.64TiB 1.64TiB 77.53 1.36  81 osd.7
…...

-5  79.99017- 80.0TiB 47.7TiB 32.3TiB 59.62 1.04   - host 
serverB
   1 HDD   9.09511  1.0 9.10TiB 4.79TiB 4.31TiB 52.63 0.92  87 osd.1
   6 HDD   9.09511  1.0 9.10TiB 6.62TiB 2.48TiB 72.75 1.27  99 osd.6
  …...

Thank you


You can use upmap balancer since Ceph Liminous: 
http://docs.ceph.com/docs/luminous/mgr/balancer/




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW: Is 'radosgw-admin reshard stale-instances rm' safe?

2019-06-21 Thread Konstantin Shalygin

Hi, folks.

I have Luminous 12.2.12. Auto-resharding is enabled.

In stale instances list I have:

# radosgw-admin reshard stale-instances list | grep clx
 "clx:default.422998.196",

I have the same marker-id in bucket stats of this bucket:

# radosgw-admin bucket stats --bucket clx | grep marker
 "marker": "default.422998.196",
 "max_marker": 
"0#,1#,2#,3#,4#,5#,6#,7#,8#,9#,10#,11#,12#,13#,14#,15#,16#,17#,18#,19#,20#,21#,22#,23#,24#,25#,26#,27#,28#,29#,30#,31#,32#,33#,34#,35#,36#,37#,38#,39#,40#,41#,42#,43#,44#,45#,46#,47#,48#,49#,50#,51#,52#",

I think it is not correct. I think active marker (in bucket stats) must not 
match marker in stale instances list.

I have to run ‘radosgw-admin reshard stale-instances rm’ because I have large 
OMAP warning, but I am not sure.

Is it safe to run: radosgw-admin reshard stale-instances rm ?


Yes, this staled by dynamic resharding, mostly prior 12.2.11. At least I 
have not seen new one stales in my cluster.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Possible to move RBD volumes between pools?

2019-06-20 Thread Konstantin Shalygin

Both pools are in the same Ceph cluster. Do you have any documentation on
the live migration process? I'm running 14.2.1


Something like:

```

rbd migration prepare test1 rbd2/test2

rbd migration execute test1

rbd migration commit test1 --force

```



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Any way to modify Bluestore label ?

2019-06-13 Thread Konstantin Shalygin

Hello,

I would like to modify Bluestore label of an OSD, is there a way to do this
?

I so that we could diplay them with  "ceph-bluestore-tool show-label" but i
did not find anyway to modify them...

Is it possible ?
I changed LVM tags but that don't help with bluestore labels..

# ceph-bluestore-tool show-label --dev
/dev/ceph-dd64f696-4908-4088-8bea-9ed5e15dd3ce/osd-block-3a5e0c27-0bc3-4fb2-90d3-c6d2cd4e2f2e
{
"/dev/ceph-dd64f696-4908-4088-8bea-9ed5e15dd3ce/osd-block-3a5e0c27-0bc3-4fb2-90d3-c6d2cd4e2f2e":
{
"osd_uuid": "3a5e0c27-0bc3-4fb2-90d3-c6d2cd4e2f2e",
"size": 1073737629696,
"btime": "2019-06-11 17:18:12.935690",
"description": "main",
"bluefs": "1",
"ceph_fsid": "cf7017d0-bb78-44d9-9d99-dfe2c210a4fa",
"kv_backend": "rocksdb",
"magic": "ceph osd volume v026",
"mkfs_done": "yes",
"osd_key": "xx",
"ready": "ready",
"whoami": "1"
}
}


This possible like this:

ceph-bluestore-tool set-label-key --dev /var/lib/ceph/osd/ceph-1/block 
--key  --value <123>




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Sizing for DB/WAL: 4% for large drives?

2019-05-29 Thread Konstantin Shalygin

We have a similar setup, but 24 disks and 2x P4800X. And the 375GB NVME
drives are _not_ large enough:


2019-05-29 07:00:00.000108 mon.bcf-03 [WRN] overall HEALTH_WARN BlueFS
spillover detected on 22 OSD(s)

root at bcf-10  :~# 
parted /dev/nvme0n1 print
Model: NVMe Device (nvme)
Disk /dev/nvme0n1: 375GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number  Start   End Size    File system  Name  Flags
   1  1049kB  31.1GB  31.1GB
   2  31.1GB  62.3GB  31.1GB
   3  62.3GB  93.4GB  31.1GB
   4  93.4GB  125GB   31.1GB
   5  125GB   156GB   31.1GB
   6  156GB   187GB   31.1GB
   7  187GB   218GB   31.1GB
   8  218GB   249GB   31.1GB
   9  249GB   280GB   31.1GB
10  280GB   311GB   31.1GB
11  311GB   343GB   31.1GB
12  343GB   375GB   32.6GB


The second NVME has the same partition layout. The twelfth partition is
actually large enough to hold all the data, but the other 11 partitions
on this drive are a little bit too small. I'm still trying to calculate
the exact sweet spot


With 24 OSDs and two of them having a just-large-enough-db-partition, I
end up with 22 OSD not fully using their db partition and spilling over
into the slow disk...exactly as reported by ceph.

Details for one of the affected OSDs:

      "bluefs": {
      "gift_bytes": 0,
      "reclaim_bytes": 0,
      "db_total_bytes": 31138504704,
      "db_used_bytes": 2782912512,
      "wal_total_bytes": 0,
      "wal_used_bytes": 0,
      "slow_total_bytes": 320062095360,
      "slow_used_bytes": 5838471168,
      "num_files": 135,
      "log_bytes": 13295616,
      "log_compactions": 9,
      "logged_bytes": 338104320,
      "files_written_wal": 2,
      "files_written_sst": 5066,
      "bytes_written_wal": 375879721287,
      "bytes_written_sst": 227201938586,
      "bytes_written_slow": 6516224,
      "max_bytes_wal": 0,
      "max_bytes_db": 5265940480,
      "max_bytes_slow": 7540310016
      },

Maybe it's just matter of shifting some megabytes. We are about to
deploy more of these nodes, so I would be grateful if anyone can comment
on the correct size of the DB partitions. Otherwise I'll have to use a
RAID-0 for two drives.


Regards,




Your block.db is 29Gb, should be 30Gb to prevent spillover to slow backend.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Sizing for DB/WAL: 4% for large drives?

2019-05-28 Thread Konstantin Shalygin

Dear All,

Quick question regarding SSD sizing for a DB/WAL...

I understand 4% is generally recommended for a DB/WAL.

Does this 4% continue for "large" 12TB drives, or can we  economise and
use a smaller DB/WAL?

Ideally I'd fit a smaller drive providing a 266GB DB/WAL per 12TB OSD,
rather than 480GB. i.e. 2.2% rather than 4%.

Will "bad things" happen as the OSD fills with a smaller DB/WAL?

By the way the cluster will mainly be providing CephFS, fairly large
files, and will use erasure encoding.

many thanks for any advice,

Jake



block.db should be 30Gb or 300Gb - anything between is pointless. There 
is described why: 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-February/033286.html


This "4%" mean nothing actually.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous OSD: replace block.db partition

2019-05-28 Thread Konstantin Shalygin

On 5/28/19 5:16 PM, Igor Fedotov wrote:


LVM volume and raw file resizing is quite simple, while partition one 
might need manual data movement to another target via dd or something.



This also possible and tested, how-to is here https://bit.ly/2UFVO9Z



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BlueStore bitmap allocator under Luminous and Mimic

2019-05-28 Thread Konstantin Shalygin


Hi,

With the release of 12.2.12 the bitmap allocator for BlueStore is now
available under Mimic and Luminous.

[osd]
bluestore_allocator = bitmap
bluefs_allocator = bitmap

Before setting this in production: What might the implications be and
what should be thought of?

>/From what I've read the bitmap allocator seems to be better in read 
/performance and uses less memory.

In Nautilus bitmap is the default, but L and M still default to stupid.

Since the bitmap allocator was backported there must be a use-case to
use the bitmap allocator instead of stupid.

Thanks!

Wido



Wido, do you setted allocator to bitmap on L installations past this 
months? Any improvements?




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous OSD: replace block.db partition

2019-05-28 Thread Konstantin Shalygin

Hello - I have created an OSD with 20G block.db, now I wanted to change the
block.db to 100G size.
Please let us know if there is a process for the same.

PS: Ceph version 12.2.4 with bluestore backend.



You should upgrade to 12.2.11+ first! Expand your block.db via 
`ceph-bluestore-tool bluefs-bdev-expand --path 
/var/lib/ceph/osd/ceph-`




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] large omap object in usage_log_pool

2019-05-23 Thread Konstantin Shalygin

in the config.
```"rgw_override_bucket_index_max_shards": "8",```. Should this be
increased?


Should be decreased to default `0`, I think.

Modern Ceph releases resolve large omaps automatically via bucket 
dynamic resharding:


```

{
    "option": {
    "name": "rgw_dynamic_resharding",
    "type": "bool",
    "level": "basic",
    "desc": "Enable dynamic resharding",
    "long_desc": "If true, RGW will dynamicall increase the number 
of shards in buckets that have a high number of objects per shard.",

    "default": true,
    "daemon_default": "",
    "tags": [],
    "services": [
    "rgw"
    ],
    "see_also": [
    "rgw_max_objs_per_shard"
    ],
    "min": "",
    "max": ""
    }
}
```

```

{
    "option": {
    "name": "rgw_max_objs_per_shard",
    "type": "int64_t",
    "level": "basic",
    "desc": "Max objects per shard for dynamic resharding",
    "long_desc": "This is the max number of objects per bucket 
index shard that RGW will allow with dynamic resharding. RGW will 
trigger an automatic reshard operation on the bucket if it exceeds this 
number.",

    "default": 10,
    "daemon_default": "",
    "tags": [],
    "services": [
    "rgw"
    ],
    "see_also": [
    "rgw_dynamic_resharding"
    ],
    "min": "",
    "max": ""
    }
}
```


So when your bucket reached new 100k objects rgw will shard this bucket 
automatically.


Some old buckets may be not sharded, like your ancients from Giant. You 
can check fill status like this: `radosgw-admin bucket limit check | jq 
'.[]'`. If some buckets is not reshared you can shart it by hand via 
`radosgw-admin reshard add ...`. Also, there may be some stale reshard 
instances (fixed ~ in 12.2.11), you can check it via `radosgw-admin 
reshard stale-instances list` and then remove via `reshard 
stale-instances rm`.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW metadata pool migration

2019-05-23 Thread Konstantin Shalygin

What are the metadata pools in an RGW deployment that need to sit on the 
fastest medium to better the client experience from an access standpoint ?
Also is there an easy way to migrate these pools in a PROD scenario with 
minimal to no-outage if possible ?


Just change crush rule to place default.rgw.buckets.index pool on your 
fastest drives.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How do you deal with "clock skew detected"?

2019-05-15 Thread Konstantin Shalygin

how do you deal with the "clock skew detected" HEALTH_WARN message?

I think the internal RTC in most x86 servers does have 1 second resolution
only, but Ceph skew limit is much smaller than that. So every time I reboot
one of my mons (for kernel upgrade or something), I have to wait for several
minutes for the system clock to synchronize over NTP, even though ntpd
has been running before reboot and was started during the system boot again.


Definitely you should use chrony with iburst.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-15 Thread Konstantin Shalygin


On 5/15/19 1:49 PM, Kevin Flöh wrote:


since we have 3+1 ec I didn't try before. But when I run the command 
you suggested I get the following error:


ceph osd pool set ec31 min_size 2
Error EINVAL: pool min_size must be between 3 and 4



What is your current min size? `ceph osd pool get ec31 min_size`



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-14 Thread Konstantin Shalygin

  peering does not seem to be blocked anymore. But still there is no
recovery going on. Is there anything else we can try?



Try to reduce min_size for problem pool as 'health detail' suggested: 
`ceph osd pool set ec31 min_size 2`.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH rule device classes mystery

2019-05-07 Thread Konstantin Shalygin

Hi List,

I'm playing around with CRUSH rules and device classes and I'm puzzled
if it's working correctly. Platform specifics: Ubuntu Bionic with Ceph 14.2.1

I created two new device classes "cheaphdd" and "fasthdd". I made
sure these device classes are applied to the right OSDs and that the
(shadow) crush rule is correctly filtering the right classes for the
OSDs (ceph osd crush tree --show-shadow).

I then created two new crush rules:

ceph osd crush rule create-replicated fastdisks default host fasthdd
ceph osd crush rule create-replicated cheapdisks default host cheaphdd

# rules
rule replicated_rule {
 id 0
 type replicated
 min_size 1
 max_size 10
 step take default
 step chooseleaf firstn 0 type host
 step emit
}
rule fastdisks {
 id 1
 type replicated
 min_size 1
 max_size 10
 step take default class fasthdd
 step chooseleaf firstn 0 type host
 step emit
}
rule cheapdisks {
 id 2
 type replicated
 min_size 1
 max_size 10
 step take default class cheaphdd
 step chooseleaf firstn 0 type host
 step emit
}

After that I put the cephfs_metadata on the fastdisks CRUSH rule:

ceph osd pool set cephfs_metadata crush_rule fastdisks

Some data is moved to new osds, but strange enough there is still data on PGs
residing on OSDs in the "cheaphdd" class. I confirmed this with:

ceph pg ls-by-pool cephfs_data

Testing CRUSH rule nr. 1 gives me:

crushtool -i /tmp/crush_raw --test --show-mappings --rule 1 --min-x 1 --max-x 4 
 --num-rep 3
CRUSH rule 1 x 1 [0,3,6]
CRUSH rule 1 x 2 [3,6,0]
CRUSH rule 1 x 3 [0,6,3]
CRUSH rule 1 x 4 [0,6,3]

Which are indeed the OSDs in the fasthdd class.

Why is not all data moved to OSDs 0,3,6, but still spread on OSDs on the
cheaphhd class as well?


Because you set new crush rule only for `cephfs_metadata` pool and look 
for pg at `cephfs_data` pool.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] reshard list

2019-04-10 Thread Konstantin Shalygin

Hello,

I am have been managing a ceph cluster running 12.2.11.  This was running
12.2.5 until the recent upgrade three months ago.  We build another cluster
running 13.2.5 and synced the data between clusters and now would like to
run primarily off the 13.2.5 cluster.  The data is all S3 buckets.  There
are 15 buckets with more than 1 million objects in them. I attempted to
start sharding on the bucket indexes by using the following process from
the documentation.

Pulling the zonegroup

#radosgw-admin zonegroup get > zonegroup.json

Changing bucket_index_max_shards to a number other than 0 and then

#radosgw-admin zonegroup set < zonegroup.json

Update the period

This had no effect on existing buckets.  What is the methodology to enable
sharding on existing buckets.  Also I am not able to see the reshard list I
get the follwoing error.

2019-04-10 10:33:05.074 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.00
2019-04-10 10:33:05.078 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.01
2019-04-10 10:33:05.082 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.02
2019-04-10 10:33:05.082 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.03
2019-04-10 10:33:05.114 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.04
2019-04-10 10:33:05.118 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.05
2019-04-10 10:33:05.118 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.06
2019-04-10 10:33:05.122 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.07
2019-04-10 10:33:05.122 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.08
2019-04-10 10:33:05.122 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.09
2019-04-10 10:33:05.122 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.10
2019-04-10 10:33:05.126 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.11
2019-04-10 10:33:05.126 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.12
2019-04-10 10:33:05.126 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.13
2019-04-10 10:33:05.126 7fbd534cb300 -1 ERROR: failed to list reshard log
entries, oid=reshard.14

Any suggestions
Andrew, RGW dynamic resharding is enabling via `rgw_dynamic_resharding` 
and ruled by `rgw_max_objs_per_shard`.


Or you may reshard bucket by hand via `radosgw-admin reshard add ...`.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] op_w_latency

2019-04-02 Thread Konstantin Shalygin

Hello Ceph Users,

I am finding that the write latency across my ceph clusters isn't great and I 
wanted to see what other people are getting for op_w_latency. Generally I am 
getting 70-110ms latency.

I am using: ceph --admin-daemon /var/run/ceph/ceph-osd.102.asok perf dump | grep -A3 
'\"op_w_latency' | grep 'avgtime'


Better like this:

ceph daemon osd.102 perf dump | jq '.osd.op_w_latency.avgtime'


Ram, CPU and network don't seem to be the bottleneck. The drives are behind a 
dell H810p raid card with a 1GB writeback cache and battery. I have tried with 
LSI JBOD cards and haven't found it faster ( as you would expect with write 
cache ). The disks through iostat -xyz 1 show 10-30% usage with general service 
+ write latency around 3-4ms. Queue depth is normally less than one. RocksDB 
write latency is around 0.6ms, read 1-2ms. Usage is RBD backend for Cloudstack.


What is your hardware? Your CPU, RAM, Eth?



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nautilus upgrade but older releases reported by features

2019-03-27 Thread Konstantin Shalygin

We recently updated a cluster to the Nautlius release by updating Debian
packages from the Ceph site. Then rebooted all servers.

ceph features still reports older releases, for example the osd

 "osd": [
 {
 "features": "0x3ffddff8ffac",
 "release": "luminous",
 "num": 12
 }

I think I am not understanding what is exactly meant by release here.
Cn we alter the osd (mon, clients etc.) such that they report nautilus ??


Show your `ceph versions` please.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-03-22 Thread Konstantin Shalygin

On 3/23/19 12:20 AM, Mazzystr wrote:

inline...

On Fri, Mar 22, 2019 at 1:08 PM Konstantin Shalygin <mailto:k0...@k0ste.ru>> wrote:


On 3/22/19 11:57 PM, Mazzystr wrote:
> I am also seeing BlueFS spill since updating to Nautilus.  I
also see
> high slow_used_bytes and slow_total_bytes metrics.  It sure
looks to
> me that the only solution is to zap and rebuilt the osd.  I had to
> manually check 36 osds some of them traditional processes and some
> containerized.  The lack of tooling here is underwhelming...  As
soon
> as I rebuilt the osd the "BlueFS spill..." warning went away.
>
> I use 50Gb db partitions on an nvme with 3 or 6 Tb spinning
disks.  I
> don't understand the spillove

Wow, it's something new. What is your upgrade path?


I keep current with community.  All osds have all been rebuilt as of 
luminous.


Also, you record cluster metrics, like via prometheus? To see diff
between upgrades.

Unfortunately not.  I've only had prometheus running for about two 
weeks nd I had it turned off for a couple days for some unknown 
reason... :/


This is sad. Because it's was be good to see the nature of metrics on graph.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-03-22 Thread Konstantin Shalygin

On 3/22/19 11:57 PM, Mazzystr wrote:
I am also seeing BlueFS spill since updating to Nautilus.  I also see 
high slow_used_bytes and slow_total_bytes metrics.  It sure looks to 
me that the only solution is to zap and rebuilt the osd.  I had to 
manually check 36 osds some of them traditional processes and some 
containerized.  The lack of tooling here is underwhelming...  As soon 
as I rebuilt the osd the "BlueFS spill..." warning went away.


I use 50Gb db partitions on an nvme with 3 or 6 Tb spinning disks.  I 
don't understand the spillove


Wow, it's something new. What is your upgrade path?

Also, you record cluster metrics, like via prometheus? To see diff 
between upgrades.



Thanks,

k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Recovery Settings

2019-03-19 Thread Konstantin Shalygin

I setup an SSD Luminous 12.2.11 cluster and realized after data had been
added that pg_num was not set properly on the default.rgw.buckets.data pool
( where all the data goes ).  I adjusted the settings up, but recovery is
going really slow ( like 56-110MiB/s ) ticking down at .002 per log
entry(ceph -w).  These are all SSDs on luminous 12.2.11 ( no journal drives
) with a set of 2 10Gb fiber twinax in a bonded LACP config.  There are six
servers, 60 OSDs, each OSD is 2TB.  There was about 4TB of data ( 3 million
objects ) added to the cluster before I noticed the red blinking lights.

  


I tried adjusting the recovery to:

ceph tell 'osd.*' injectargs '--osd-max-backfills 16'

ceph tell 'osd.*' injectargs '--osd-recovery-max-active 30'

  


Which did help a little, but didn't seem to have the impact I was looking
for.  I have used the settings on HDD clusters before to speed things up (
using 8 backfills and 4 max active though ).  Did I miss something or is
this part of the pg expansion process.  Should I be doing something else
with SSD clusters?

  


Regards,

-Brent

  


Existing Clusters:

Test: Luminous 12.2.11 with 3 osd servers, 1 mon/man, 1 gateway ( all
virtual on SSD )

US Production(HDD): Jewel 10.2.11 with 5 osd servers, 3 mons, 3 gateways
behind haproxy LB

UK Production(HDD): Luminous 12.2.11 with 15 osd servers, 3 mons/man, 3
gateways behind haproxy LB

US Production(SSD): Luminous 12.2.11 with 6 osd servers, 3 mons/man, 3
gateways behind haproxy LB


Try to lower `osd_recovery_sleep*` options.

You can get your current values from ceph admin socket like this:

```

ceph daemon osd.0 config show | jq 'to_entries[] | if 
(.key|test("^(osd_recovery_sleep)(.*)")) then (.) else empty end'


```


k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v14.2.0 Nautilus released

2019-03-19 Thread Konstantin Shalygin

On 3/19/19 2:52 PM, Benjamin Cherian wrote:
>/Hi, />//>/I'm getting an error when trying to use the APT repo for Ubuntu bionic. />/Does anyone else have this issue? Is the mirror sync actually still in />/progress? Or was something setup incorrectly? />//>/E: Failed to fetch />/https://download.ceph.com/debian-nautilus/dists/bionic/main/binary-amd64/Packages.bz2 
/>/File has unexpected size (15515 != 15488). Mirror sync in progress? [IP: />/158.69.68.124 443] />/   Hashes of expected file: />/    - Filesize:15488 [weak] />/    - />/SHA256:d5ea08e095eeeaa5cc134b1661bfaf55280fcbf8a265d584a4af80d2a424ec17 />/    - SHA1:6da3a8aa17ed7f828f35f546cdcf923040e8e5b0 [weak] />/    - MD5Sum:7e5a4ecea4a4edc3f483623d48b6efa4 [weak] />/   Release file created at: Mon, 11 Mar 2019 18:44:46 + /

I'm getting the same error for `apt update` with

debhttps://download.ceph.com/debian-nautilus/  bionic main


I think you also affected with this [1] issue.


[1] http://tracker.ceph.com/issues/38763

k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Constant Compaction on one mimic node

2019-03-17 Thread Konstantin Shalygin

I am getting a huge number of messages on one out of three nodes showing Manual 
compaction starting all the time.  I see no such of log entries on the other 
nodes in the cluster.

Mar 16 06:40:11 storage1n1-chi docker[24502]: debug 2019-03-16 06:40:11.441 
7f6967af4700  4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.2/rpm/el7/BUILD/ceph-13.2.2/src/rocksdb/db/db_impl_compaction_flush.cc:1024]
[default] Manual compaction starting
Mar 16 06:40:11 storage1n1-chi docker[24502]: message repeated 4 times: [ debug 
2019-03-16 06:40:11.441 7f6967af4700  4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.2/rpm/el7/BUILD/ceph-13.2.2/src/rocksdb/db/db_impl_compaction_flush.cc:1024]
[default] Manual compaction starting]
Mar 16 06:42:21 storage1n1-chi docker[24502]: debug 2019-03-16 06:42:21.466 
7f6970305700  4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.2/rpm/el7/BUILD/ceph-13.2.2/src/rocksdb/db/db_impl_compaction_flush.cc:77]
[JOB 1021] Syncing log #194307

I am not sure what triggers those message on one node an not on the others.

Checking config on all mons

debug_leveldb 4/5  override
debug_memdb   4/5  override
debug_mgr 0/5  override
debug_mgrc0/5  override
debug_rocksdb 4/5  override

Documentation tells nothing about the compaction logs or at least I couldn't 
find anything specific to my issue.


You should look to docker side I think, because this is manual 
compaction, like `ceph daemon osd.0 compact` from admin socket or `ceph 
tell osd.0 compact` from admin cli.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-03-17 Thread Konstantin Shalygin

Yes, I was in a similar situation initially where I had deployed my OSD's with 
25GB DB partitions and after 3GB DB used, everything else was going into slowDB 
on disk. From memory 29GB was just enough to make the DB fit on flash, but 30GB 
is a safe round figure to aim for. With a 30GB DB partition with most RBD type 
workloads all data should reside on flash even for fairly large disks running 
erasure coding.

Nick


Nick, thank you! After upgrading to 12.2.11 I was expand blockDB and for 
a week after compaction slowDB is not used [1].



{
  "gift_bytes": 0,
  "reclaim_bytes": 0,
  "db_total_bytes": 32212897792,
  "db_used_bytes": 6572474368,
  "wal_total_bytes": 1074589696,
  "wal_used_bytes": 528482304,
  "slow_total_bytes": 240043163648,
  "slow_used_bytes": 0,
  "num_files": 113,
  "log_bytes": 8683520,
  "log_compactions": 3,
  "logged_bytes": 203821056,
  "files_written_wal": 2,
  "files_written_sst": 1138,
  "bytes_written_wal": 121626085396,
  "bytes_written_sst": 47053353874
}

I also writed how-to increase partition size for my case, will maybe 
useful for someone [2].


[1] https://ibb.co/tXGqbbt

[2] https://bit.ly/2UFVO9Z

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Error in Mimic repo for Ubunut 18.04

2019-03-15 Thread Konstantin Shalygin

This seems to be still a problem...

Is anybody looking into it?

Anybody of Ubuntu users is created ticket to devops [1] project? No...



[1] http://tracker.ceph.com/projects/devops/activity

k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Move from own crush map rule (SSD / HDD) to Luminous device class

2019-03-14 Thread Konstantin Shalygin

in the beginning, I create separate crush rules for SSD and HDD pool (
six Ceph nodes), following this HOWTO:

https://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/

Now I want to migrate to the standard crush rules, which comes with
Luminous. What is the procedure here ?

# ceph osd crush rule create-replicated ssdpool default-ssd host ssd
# ceph osd pool set ssd-pool crush_rule ssdpool

# ceph osd crush rule create-replicated satapool default-hdd host hdd
# ceph osd pool set sata-pool crush_rule satapool

and wait, until everything is done ?

It this case you still will be double rooted, I recommend you:

1. Assign classes to your osds.

2. Create crush rule for hdd in default root.

3. Set crush rule to your hdd pools.

4. Move your osds from default-hdd to default root.

5. When when data migration is finished and do the same for your ssds.

At Final you will be single rooted with device-classes crush rules.



k


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need clarification about RGW S3 Bucket Tagging

2019-03-14 Thread Konstantin Shalygin

On 3/14/19 8:58 PM, Matt Benjamin wrote:

Sorry, object tagging.  There's a bucket tagging question in another thread :)


Luminous works fine with object tagging, at least on 12.2.11 
getObjectTagging and putObjectTagging.



[k0ste@WorkStation]$ curl -s 
https://rwg_civetweb/my_bucket/empty-file.txt?tagging | xq '.Tagging[]'

"http://s3.amazonaws.com/doc/2006-03-01/;
{
  "Tag": {
    "Key": "User",
    "Value": "Bob"
  }
}



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need clarification about RGW S3 Bucket Tagging

2019-03-14 Thread Konstantin Shalygin

On 3/14/19 8:36 PM, Casey Bodley wrote:
The bucket policy documentation just lists which actions the policy 
engine understands. Bucket tagging isn't supported, so those requests 
were misinterpreted as normal PUT requests to create a bucket. I 
opened https://github.com/ceph/ceph/pull/26952 to return 405 Method 
Not Allowed there instead and update the doc to clarify that it's not 
supported.


As I understand correct, that:

- Luminous: support object tagging.

- Mimic+: support object tagging and lifecycle policing on this tags [1].

?


Thanks,

k

[1] https://tracker.ceph.com/issues/24011



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Need clarification about RGW S3 Bucket Tagging

2019-03-14 Thread Konstantin Shalygin

Hi.

I CC'ed Casey Bodley as new RGW tech lead.

Luminous doc [1] tells that s3:GetBucketTagging & s3:PutBucketTagging 
methods is supported.But actually PutBucketTagging fails on Luminous 
12.2.11 RGW with "provided input did not specify location constraint 
correctly", I think is issue [2], but why issue type for this ticket is 
changed to feature? Is that this mode is unsupported in Luminous and 
this is a doc bug or this really bug and should be fixed?



Thanks,

k

[1] http://docs.ceph.com/docs/luminous/radosgw/bucketpolicy/

[2] https://tracker.ceph.com/issues/24443


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] weight-set defined for some OSDs and not defined for the new installed ones

2019-03-14 Thread Konstantin Shalygin

On 3/14/19 2:15 PM, Massimo Sgaravatto wrote:

I have some clients running centos7.4 with kernel 3.10

I was told that the minimum requirements are kernel >=4.13 or CentOS 
>= 7.5.


Yes, this is correct.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] weight-set defined for some OSDs and not defined for the new installed ones

2019-03-14 Thread Konstantin Shalygin

On 3/14/19 2:10 PM, Massimo Sgaravatto wrote:

I am using Luminous everywhere


I'm mean, what is version of your kernel clients?



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] weight-set defined for some OSDs and not defined for the new installed ones

2019-03-14 Thread Konstantin Shalygin

On 3/14/19 2:09 PM, Massimo Sgaravatto wrote:

I plan to use upmap after having migrated all my clients to CentOS 7.6


What is your current release?



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] weight-set defined for some OSDs and not defined for the new installed ones

2019-03-14 Thread Konstantin Shalygin

On 3/14/19 2:02 PM, Massimo Sgaravatto wrote:

Oh, I missed this information.

So this means that, after having run once the balancer in compat mode, 
if you add new OSDs you MUST manually defined the weight-set for these 
newly added OSDs if you want to use the balancer, right ?


This is an important piece of information that IMHO should be in the 
ceph documentation when the balancer is discussed


Again, this is because - legacy.

Tell me, why you don't want to use upmap?



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] weight-set defined for some OSDs and not defined for the new installed ones

2019-03-14 Thread Konstantin Shalygin

On 3/14/19 1:53 PM, Massimo Sgaravatto wrote:


So if I try to run the balancer in the current compat mode, should 
this define the weight-set also for the new OSDs ?
But if I try to create a balancer plan, I get an error [*] (while it 
worked before adding the new OSDs).


Nope, balancer creates weights for compat mode only when no any compat 
weight-sets is present (i.e. it starts from scratch).




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] weight-set defined for some OSDs and not defined for the new installed ones

2019-03-14 Thread Konstantin Shalygin

On 3/14/19 1:11 PM, Massimo Sgaravatto wrote:

Thanks

I will try to set the weight-set for the new OSDs

But I am wondering what I did wrong to be in such scenario.


You don't. You just use legacy. But why? Jewel clients? Old kernels?



Is it normal that a new created OSD has no weight-set defined ?


Of course.


Who is supposed to initially set the weight-set for a OSD ?


Balancer by compat mode.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] weight-set defined for some OSDs and not defined for the new installed ones

2019-03-13 Thread Konstantin Shalygin

On 3/14/19 12:42 PM, Massimo Sgaravatto wrote:

[root@c-mon-01 /]# ceph osd df tree
ID  CLASS WEIGHT  REWEIGHT SIZE    USE     AVAIL   %USE VAR  PGS TYPE 
NAME
 -1       1.95190        - 1.95TiB 88.4GiB 1.87TiB    0 0   - root 
default
 -2             0        -      0B      0B      0B    0 0   -     rack 
Rack15-PianoAlto
 -3       0.39038        -  400GiB 18.9GiB  381GiB 4.74 1.07   -    
 host c-osd-1
  0   hdd 0.09760  1.0  100GiB 5.42GiB 94.5GiB 5.43 1.23  77      
   osd.0
  1   hdd 0.09760  1.0  100GiB 2.65GiB 97.3GiB 2.65 0.60  71      
   osd.1
  2   hdd 0.09760  1.0  100GiB 3.68GiB 96.3GiB 3.68 0.83  66      
   osd.2
  3   hdd 0.09760  1.0  100GiB 7.19GiB 92.8GiB 7.19 1.63  62      
   osd.3
 -4       0.39038        -  400GiB 17.0GiB  383GiB 4.24 0.96   -    
 host c-osd-2
  4   hdd 0.09760  1.0  100GiB 7.54GiB 92.4GiB 7.55 1.71  70      
   osd.4
  5   hdd 0.09760  1.0  100GiB 3.36GiB 96.6GiB 3.36 0.76  71      
   osd.5
  6   hdd 0.09760  1.0  100GiB 54.1MiB 99.9GiB 0.05 0.01  67      
   osd.6
  7   hdd 0.09760  1.0  100GiB 6.01GiB 93.9GiB 6.01 1.36  59      
   osd.7
 -5       0.39038        -  400GiB 20.7GiB  379GiB 5.17 1.17   -    
 host c-osd-3
  8   hdd 0.09760  1.0  100GiB 6.70GiB 93.2GiB 6.71 1.52  63      
   osd.8
  9   hdd 0.09760  1.0  100GiB 4.93GiB 95.0GiB 4.94 1.12  70      
   osd.9
 10   hdd 0.09760  1.0  100GiB 4.11GiB 95.8GiB 4.11 0.93  71      
   osd.10
 11   hdd 0.09760  1.0  100GiB 4.92GiB 95.0GiB 4.92 1.11  59      
   osd.11
-11       0.78076        -  800GiB 31.8GiB  768GiB 3.98 0.90   -    
 host c-osd-5
 12   hdd 0.09760  1.0  100GiB 4.39GiB 95.6GiB 4.39 0.99  47      
   osd.12
 13   hdd 0.09760  1.0  100GiB 4.48GiB 95.5GiB 4.48 1.01  41      
   osd.13
 14   hdd 0.09760  1.0  100GiB 3.69GiB 96.3GiB 3.69 0.84  45      
   osd.14
 15   hdd 0.09760  1.0  100GiB 3.63GiB 96.4GiB 3.63 0.82  39      
   osd.15
 16   hdd 0.09760  1.0  100GiB 3.48GiB 96.5GiB 3.48 0.79  47      
   osd.16
 17   hdd 0.09760  1.0  100GiB 4.35GiB 95.6GiB 4.35 0.98  44      
   osd.17
 18   hdd 0.09760  1.0  100GiB 3.57GiB 96.4GiB 3.57 0.81  46      
   osd.18
 19   hdd 0.09760  1.0  100GiB 4.23GiB 95.8GiB 4.23 0.96  37      
   osd.19

                     TOTAL 1.95TiB 88.4GiB 1.87TiB 4.42
MIN/MAX VAR: 0.01/1.71  STDDEV: 1.64
[root@c-mon-01 /]#


I think you need to remove your compat weight set via `osd crush 
weight-set rm-compat` and switch to upmap balancer mode.


Or instead this you can just add your new osds to this set via `osd 
crush weight-set reweight-compat  `.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] weight-set defined for some OSDs and not defined for the new installed ones

2019-03-13 Thread Konstantin Shalygin

I have a cluster where for some OSD the weight-set is defined, while for
other OSDs it is not [*].

The OSDs with weight-set defined are Filestore OSDs created years ago using
"ceph-disk prepare"

The OSDs where the weight set is not defined are Bluestore OSDs installed
recently using
ceph-volume


I think this problem explains while I am not able to use anymore the
balancer (after having added these new OSDs).

Any "clean" procedure to fix this mess is appreciated !

Thanks, Massimo



[*]

[root at c-mon-01    
/]# ceph osd crush tree
ID  CLASS WEIGHT  (compat) TYPE NAME
  -1   1.95190  root default
  -2 00 rack Rack15-PianoAlto
  -3   0.39038  0.38974 host c-osd-1
   0   hdd 0.09760  0.09988 osd.0
   1   hdd 0.09760  0.10040 osd.1
   2   hdd 0.09760  0.08945 osd.2
   3   hdd 0.09760  0.10001 osd.3
  -4   0.39038  0.39076 host c-osd-2
   4   hdd 0.09760  0.10275 osd.4
   5   hdd 0.09760  0.10081 osd.5
   6   hdd 0.09760  0.10135 osd.6
   7   hdd 0.09760  0.08585 osd.7
  -5   0.39038  0.39055 host c-osd-3
   8   hdd 0.09760  0.10622 osd.8
   9   hdd 0.09760  0.09148 osd.9
  10   hdd 0.09760  0.10164 osd.10
  11   hdd 0.09760  0.09122 osd.11
-11   0.78076  0.78076 host c-osd-5
  12   hdd 0.09760  osd.12
  13   hdd 0.09760  osd.13
  14   hdd 0.09760  osd.14
  15   hdd 0.09760  osd.15
  16   hdd 0.09760  osd.16
  17   hdd 0.09760  osd.17
  18   hdd 0.09760  osd.18
  19   hdd 0.09760  osd.19
[root at c-mon-01    /]#



Please, show your `ceph osd df tree`.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] S3 data on specific storage systems

2019-03-12 Thread Konstantin Shalygin

I have a cluster with SSD and HDD storage. I wonder how to configure S3
buckets on HDD storage backends only.
Do I need to create pools on this particular storage and define radosgw
placement with those or there is a better or easier way to achieve this ?


Just assign your "crush hdd rule" to your data poolpool via `ceph osd 
pool set  crush_rule `.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to identify pre-luminous rdb images

2019-03-11 Thread Konstantin Shalygin

Hello list,

I upgraded to mimic some time ago and want to make use of the upmap feature now.
But I can't do "ceph osd set-require-min-compat-client luminous" as there are 
still pre-luminous clients connected.

The cluster was originally created from jewel release.

When I run "ceph features", I see many connections from jewel clients though 
all systems have mimic installed and are rebooted since then.

08:44dk at mon03    
[fra]:~$ ceph features
[...]
 "client": [
 {
 "features": "0x7010fb86aa42ada",
 "release": "jewel",
 "num": 70
 },
 {
 "features": "0x3ffddff8eea4fffb",
 "release": "luminous",
 "num": 185
 },
 {
 "features": "0x3ffddff8ffa4fffb",
 "release": "luminous",
 "num": 403
 }
[...]

These client connections belong to mapped rbd images.
When I inspect the rbd images with "rbd info" I don't see any difference in 
format and features.

How can I determine which rbd images are affected and how can I transform them 
to luminous types if possible.


Do you have krbd clients? Because kernel clients still have 'jewel' 
release, but upmap is supported.


If so the kernel should be 4.13+ or EL 7.5. If that so you should append 
--yes-i-really-mean-it as safe workaround.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

2019-03-09 Thread Konstantin Shalygin

These results (800 MB/s writes, 1500 Mb/s reads, and 200 write IOPS, 400
read IOPS) seems incredibly low - particularly considering what the Optane
900p is meant to be capable of.

Is this in line with what you might expect on this hardware with Ceph
though?

Or is there some way to find out the source of bottleneck?


4Mbyte*200IOPS=800MB/s. What exactly bottleneck you meant?

Try to use 4K instead 4M for IOPS load.


k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph bug#2445 hitting version-12.2.4

2019-03-05 Thread Konstantin Shalygin

Hi - we are using ceph 12.2.4 and bug#24445 hitting, which caused 10
min IO pause on ceph cluster..

Is this bug fixed?
bug:https://tracker.ceph.com/issues/24445/


Seems this is a network issue, not ceph. Reporter of this ticket was 
never come backs.




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to use straw2 for new buckets

2019-02-25 Thread Konstantin Shalygin

A few weeks ago I converted everything from straw to straw2 (to be able to
use the balancer) using the command:

ceph osd crush set-all-straw-buckets-to-straw2

I have now just added a new rack bucket, and moved a couple of new osd
nodes in this rack, using the commands:

ceph osd crush add-bucket Rack12-PianoAlto rack
ceph osd crush move Rack12-PianoAlto root=default
ceph osd crush move ceph-osd-06 rack=Rack12-PianoAlto
ceph osd crush move ceph-osd-07 rack=Rack12-PianoAlto

Since I see that these new entries are still using straw,  I re-run:

ceph osd crush set-all-straw-buckets-to-straw2

I am trying to understand what should be changed to have new buckets
created using straw2.

Should I change this one:

tunable straw_calc_version 1

in the crushmap ?


You should set `ceph osd cursh tunables optimal`.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-22 Thread Konstantin Shalygin

Bluestore/RocksDB will only put the next level up size of DB on flash if the 
whole size will fit.
These sizes are roughly 3GB,30GB,300GB. Anything in-between those sizes are 
pointless. Only ~3GB of SSD will ever be used out of a
28GB partition. Likewise a 240GB partition is also pointless as only ~30GB will 
be used.

I'm currently running 30GB partitions on my cluster with a mix of 6,8,10TB 
disks. The 10TB's are about 75% full and use around 14GB,
this is on mainly 3x Replica RBD(4MB objects)

Nick


Can you explain more? You mean that I should increase my 28Gb to 30Gb 
and this do a trick?


How is your db_slow size? We should control it? You control it? How?



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw-admin reshard stale-instances rm experience

2019-02-21 Thread Konstantin Shalygin

My advise: Upgrade to 12.2.11 and run the stale-instances list asap and
see if you need to rm data.

This isn't available in 13.2.4, but should be in 13.2.5, so on Mimic you
will need to wait. But this might bite you at some point.

I hope I can prevent some admins from having sleepless nights about a
Ceph cluster flapping.


Thanks for sharing your experience!



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to change/anable/activate a different osd_memory_target value

2019-02-20 Thread Konstantin Shalygin

we run into some OSD node freezes with out of memory and eating all swap too. 
Till we get more physical RAM I’d like to reduce the osd_memory_target, but 
can’t find where and how to enable it.

We have 24 bluestore Disks in 64 GB centos nodes with Luminous v12.2.11
Just set value for `osd_memory_target` in your ceph.conf and restart 
your OSD's (`systemctl restart ceph-osd.target` to restart all your osd 
daemons on this host).




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-19 Thread Konstantin Shalygin

On 2/19/19 11:46 PM, David Turner wrote:
I don't know that there's anything that can be done to resolve this 
yet without rebuilding the OSD.  Based on a Nautilus tool being able 
to resize the DB device, I'm assuming that Nautilus is also capable of 
migrating the DB/WAL between devices.  That functionality would allow 
anyone to migrate their DB back off of their spinner which is what's 
happening to you.  I don't believe that sort of tooling exists yet, 
though, without compiling the Nautilus Beta tooling for yourself.


I think there you are wrong, initially bluestore tool can expand only 
wal/db devices [1]. With last releases of mimic and luminous this should 
work fine.


And only master received  feature for main device expanding [2].



[1] 
https://github.com/ceph/ceph/commit/2184e3077caa9de5f21cc901d26f6ecfb76de9e1


[2] 
https://github.com/ceph/ceph/commit/d07c10dfc02e4cdeda288bf39b8060b10da5bbf9


k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd: Can I only just update krbd module without updating kernal?

2019-02-19 Thread Konstantin Shalygin

Because of some reasons, I can update the kernal to higher version.
So I wonder if I can only just update krbd kernal module ? Has anyone
done this before?


Of course you can. You "just" need a make krbd patch from upstream 
kernel and apply it to your kernel tree.


It's a lot of work and may be you will stuck at some place, because krbd 
use linux block layer. In practical it's not


a good idea in technical and business aspect.



k


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-18 Thread Konstantin Shalygin

On 2/18/19 9:43 PM, David Turner wrote:
Do you have historical data from these OSDs to see when/if the DB used 
on osd.73 ever filled up?  To account for this OSD using the slow 
storage for DB, all we need to do is show that it filled up the fast 
DB at least once.  If that happened, then something spilled over to 
the slow storage and has been there ever since.


Yes, I have. Also I checked my JIRA records what I was do at this times 
and marked this on timeline: [1]


Another graph compared osd.(33|73) for a last year: [2]


[1] https://ibb.co/F7smCxW

[1] https://ibb.co/dKWWDzW

k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Placing replaced disks to correct buckets.

2019-02-16 Thread Konstantin Shalygin

I recently replaced failed HDDs and removed them from their respective
buckets as per procedure.

But I’m now facing an issue when trying to place new ones back into the
buckets. I’m getting an error of ‘osd nr not found’ OR ‘file or
directory not found’ OR command sintax error.

I have been using the commands below:

ceph osd crush set   
ceph osd crush  set   

I do however find the OSD number when i run command:

ceph osd find 

Your assistance/response to this will be highly appreciated.

Regards
John.


Please, paste your `ceph osd tree`, your version and what exactly error 
you get include osd number.


Less obfuscation is better in this, perhaps, simple case.


k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Openstack RBD EC pool

2019-02-16 Thread Konstantin Shalygin

### ceph.conf
[global]
fsid = b5e30221-a214-353c-b66b-8c37b4349123
mon host = ceph-mon.service.i.ewcs.ch
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
###


## ceph.ec.conf
[global]
fsid = b5e30221-a214-353c-b66b-8c37b4349123
mon host = ceph-mon.service.i..
auth cluster required = cephx
auth service required = cephx
auth client required = cephx

[client.cinder-ec]
rbd default data pool = ewos1-prod_cinder_ec
#
There is not necessary to split this settings to two files. Use one 
ceph.conf instead.



[client.cinder-ec]
rbd default data pool = ewos1-prod_cinder_ec


But your pool is:


ceph osd pool create cinder_ec 512 512 erasure ec32




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   3   4   >