Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-15 Thread Konstantin Shalygin

On 2/16/19 12:33 AM, David Turner wrote:
The answer is probably going to be in how big your DB partition is vs 
how big your HDD disk is.  From your output it looks like you have a 
6TB HDD with a 28GB Blocks.DB partition.  Even though the DB used size 
isn't currently full, I would guess that at some point since this OSD 
was created that it did fill up and what you're seeing is the part of 
the DB that spilled over to the data disk. This is why the official 
recommendation (that is quite cautious, but cautious because some use 
cases will use this up) for a blocks.db partition is 4% of the data 
drive.  For your 6TB disks that's a recommendation of 240GB per DB 
partition.  Of course the actual size of the DB needed is dependent on 
your use case.  But pretty much every use case for a 6TB disk needs a 
bigger partition than 28GB.



My current db size of osd.33 is 7910457344 bytes, and osd.73 is 
2013265920+4685037568 bytes. 7544Mbyte (24.56% of db_total_bytes) vs 
6388Mbyte (6.69% of db_total_bytes).


Why osd.33 is not used slow storage at this case?



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG_AVAILABILITY with one osd down?

2019-02-15 Thread jesper
Yesterday I saw this one.. it puzzles me:
2019-02-15 21:00:00.000126 mon.torsk1 mon.0 10.194.132.88:6789/0 604164 :
cluster [INF] overall HEALTH_OK
2019-02-15 21:39:55.793934 mon.torsk1 mon.0 10.194.132.88:6789/0 604304 :
cluster [WRN] Health check failed: 2 slow requests are blocked > 32 sec.
Implicated osds 58 (REQUEST_SLOW)
2019-02-15 21:40:00.887766 mon.torsk1 mon.0 10.194.132.88:6789/0 604305 :
cluster [WRN] Health check update: 6 slow requests are blocked > 32 sec.
Implicated osds 9,19,52,58,68 (REQUEST_SLOW)
2019-02-15 21:40:06.973901 mon.torsk1 mon.0 10.194.132.88:6789/0 604306 :
cluster [WRN] Health check update: 14 slow requests are blocked > 32 sec.
Implicated osds 3,9,19,29,32,52,55,58,68,69 (REQUEST_SLOW)
2019-02-15 21:40:08.466266 mon.torsk1 mon.0 10.194.132.88:6789/0 604307 :
cluster [INF] osd.29 failed (root=default,host=bison) (6 reporters from
different host after 33.862482 >= grace 29.247323)
2019-02-15 21:40:08.473703 mon.torsk1 mon.0 10.194.132.88:6789/0 604308 :
cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2019-02-15 21:40:09.489494 mon.torsk1 mon.0 10.194.132.88:6789/0 604310 :
cluster [WRN] Health check failed: Reduced data availability: 6 pgs
peering (PG_AVAILABILITY)
2019-02-15 21:40:11.008906 mon.torsk1 mon.0 10.194.132.88:6789/0 604312 :
cluster [WRN] Health check failed: Degraded data redundancy:
3828291/700353996 objects degraded (0.547%), 77 pgs degraded (PG_DEGRADED)
2019-02-15 21:40:13.474777 mon.torsk1 mon.0 10.194.132.88:6789/0 604313 :
cluster [WRN] Health check update: 9 slow requests are blocked > 32 sec.
Implicated osds 3,9,32,55,58,69 (REQUEST_SLOW)
2019-02-15 21:40:15.060165 mon.torsk1 mon.0 10.194.132.88:6789/0 604314 :
cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data
availability: 17 pgs peering)
2019-02-15 21:40:17.128185 mon.torsk1 mon.0 10.194.132.88:6789/0 604315 :
cluster [WRN] Health check update: Degraded data redundancy:
9897139/700354131 objects degraded (1.413%), 200 pgs degraded
(PG_DEGRADED)
2019-02-15 21:40:17.128219 mon.torsk1 mon.0 10.194.132.88:6789/0 604316 :
cluster [INF] Health check cleared: REQUEST_SLOW (was: 2 slow requests are
blocked > 32 sec. Implicated osds 32,55)
2019-02-15 21:40:22.137090 mon.torsk1 mon.0 10.194.132.88:6789/0 604317 :
cluster [WRN] Health check update: Degraded data redundancy:
9897140/700354194 objects degraded (1.413%), 200 pgs degraded
(PG_DEGRADED)
2019-02-15 21:40:27.249354 mon.torsk1 mon.0 10.194.132.88:6789/0 604318 :
cluster [WRN] Health check update: Degraded data redundancy:
9897142/700354287 objects degraded (1.413%), 200 pgs degraded
(PG_DEGRADED)
2019-02-15 21:40:33.335147 mon.torsk1 mon.0 10.194.132.88:6789/0 604322 :
cluster [WRN] Health check update: Degraded data redundancy:
9897143/700354356 objects degraded (1.413%), 200 pgs degraded
(PG_DEGRADED)
... shortened ..
2019-02-15 21:43:48.496536 mon.torsk1 mon.0 10.194.132.88:6789/0 604366 :
cluster [WRN] Health check update: Degraded data redundancy:
9897168/700356693 objects degraded (1.413%), 200 pgs degraded, 201 pgs
undersized (PG_DEGRADED)
2019-02-15 21:43:53.496924 mon.torsk1 mon.0 10.194.132.88:6789/0 604367 :
cluster [WRN] Health check update: Degraded data redundancy:
9897170/700356804 objects degraded (1.413%), 200 pgs degraded, 201 pgs
undersized (PG_DEGRADED)
2019-02-15 21:43:58.497313 mon.torsk1 mon.0 10.194.132.88:6789/0 604368 :
cluster [WRN] Health check update: Degraded data redundancy:
9897172/700356879 objects degraded (1.413%), 200 pgs degraded, 201 pgs
undersized (PG_DEGRADED)
2019-02-15 21:44:03.497696 mon.torsk1 mon.0 10.194.132.88:6789/0 604369 :
cluster [WRN] Health check update: Degraded data redundancy:
9897174/700356996 objects degraded (1.413%), 200 pgs degraded, 201 pgs
undersized (PG_DEGRADED)
2019-02-15 21:44:06.939331 mon.torsk1 mon.0 10.194.132.88:6789/0 604372 :
cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2019-02-15 21:44:06.965401 mon.torsk1 mon.0 10.194.132.88:6789/0 604373 :
cluster [INF] osd.29 10.194.133.58:6844/305358 boot
2019-02-15 21:44:08.498060 mon.torsk1 mon.0 10.194.132.88:6789/0 604376 :
cluster [WRN] Health check update: Degraded data redundancy:
9897174/700357056 objects degraded (1.413%), 200 pgs degraded, 201 pgs
undersized (PG_DEGRADED)
2019-02-15 21:44:08.996099 mon.torsk1 mon.0 10.194.132.88:6789/0 604377 :
cluster [WRN] Health check failed: Reduced data availability: 12 pgs
peering (PG_AVAILABILITY)
2019-02-15 21:44:13.498472 mon.torsk1 mon.0 10.194.132.88:6789/0 604378 :
cluster [WRN] Health check update: Degraded data redundancy: 55/700357161
objects degraded (0.000%), 33 pgs degraded (PG_DEGRADED)
2019-02-15 21:44:15.081437 mon.torsk1 mon.0 10.194.132.88:6789/0 604379 :
cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data
availability: 12 pgs peering)
2019-02-15 21:44:18.498808 mon.torsk1 mon.0 10.194.132.88:6789/0 604380 :
cluster [WRN] Health check update: Degraded data redundancy: 14/700357230
objects degraded 

[ceph-users] Openstack RBD EC pool

2019-02-15 Thread Florian Engelmann

Hi,

I tried to add a "archive" storage class to our Openstack environment by 
introducing a second storage backend offering RBD volumes having their 
data in an erasure coded pool. As I will have to specify a data-pool I 
tried it as follows:



### keyring files:
ceph.client.cinder.keyring
ceph.client.cinder-ec.keyring

### ceph.conf
[global]
fsid = b5e30221-a214-353c-b66b-8c37b4349123
mon host = ceph-mon.service.i.ewcs.ch
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
###


## ceph.ec.conf
[global]
fsid = b5e30221-a214-353c-b66b-8c37b4349123
mon host = ceph-mon.service.i..
auth cluster required = cephx
auth service required = cephx
auth client required = cephx

[client.cinder-ec]
rbd default data pool = ewos1-prod_cinder_ec
#

# cinder-volume.conf
...
[ceph1-rp3-1]
volume_backend_name = ceph1-rp3-1
volume_driver = cinder.volume.drivers.rbd.RBDDriver
rbd_ceph_conf = /etc/ceph/ceph.conf
rbd_user = cinder
rbd_secret_uuid = xxxcc8b-xx-ae16xx
rbd_pool = cinder
rbd_flatten_volume_from_snapshot = false
rbd_max_clone_depth = 5
rbd_store_chunk_size = 4
rados_connect_timeout = -1
report_discard_supported = true
rbd_exclusive_cinder_pool = true
enable_deferred_deletion = true
deferred_deletion_delay = 259200
deferred_deletion_purge_interval = 3600

[ceph1-ec-1]
volume_backend_name = ceph1-ec-1
volume_driver = cinder.volume.drivers.rbd.RBDDriver
rbd_ceph_conf = /etc/ceph/ceph.ec.conf
rbd_user = cinder-ec
rbd_secret_uuid = xxcc8b-xx-ae16xx
rbd_pool = cinder_ec_metadata
rbd_flatten_volume_from_snapshot = false
rbd_max_clone_depth = 3
rbd_store_chunk_size = 4
rados_connect_timeout = -1
report_discard_supported = true
rbd_exclusive_cinder_pool = true
enable_deferred_deletion = true
deferred_deletion_delay = 259200
deferred_deletion_purge_interval = 3600
##


I created three pools (for cinder) like:
ceph osd pool create cinder 512 512 replicated rack_replicated_rule
ceph osd pool create cinder_ec_metadata 6 6 replicated rack_replicated_rule
ceph osd pool create cinder_ec 512 512 erasure ec32
ceph osd pool set cinder_ec allow_ec_overwrites true


I am able to use backend ceph1-rp3-1 without any errors (create, attach, 
delete, snapshot). I am also able to create volumes via:


openstack volume create --size 100 --type ec1 myvolume_ec

but I am not able to attach it to any instance. I get erros like:

==> libvirtd.log <==
2019-02-15 22:23:01.771+: 27895: error : 
qemuMonitorJSONCheckError:392 : internal error: unable to execute QEMU 
command 'device_add': Property 'scsi-hd.drive' can't find value 
'drive-scsi0-0-0-3'


My instance got three disks (root,swap and one cinder replicated volume) 
amd looks like:



  instance-254e
  6d41c54b-753a-46c7-a573-bedf8822fbf5
  
xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0;>

  
  x3-1
  2019-02-15 21:18:24
  
16384
80
8192
0
4
  
  
...
  
  

...
  
hvm


...
  
/usr/bin/qemu-system-x86_64

  
  

  
  name='nova/6d41c54b-753a-46c7-a573-bedf8822fbf5_disk'>




  
  
  
  


  
  

  
  name='nova/6d41c54b-753a-46c7-a573-bedf8822fbf5_disk.swap'>




  
  
  
  


  
  

  
  name='cinder/volume-01e8cb68-1f86-4142-958c-fdd1c301833a'>




  
  
  
125829120
1000
  
  01e8cb68-1f86-4142-958c-fdd1c301833a
  
  


  
  function='0x0'/>


...


Any ideas?

All the best,
Florian


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS - read latency.

2019-02-15 Thread jesper
Hi.

I've got a bunch of "small" files moved onto CephFS as archive/bulk storage
and now I have the backup (to tape) to spool over them. A sample of the
single-threaded backup client delivers this very consistent pattern:

$ sudo strace -T -p 7307 2>&1 | grep -A 7 -B 3 open
write(111, "\377\377\377\377", 4)   = 4 <0.11>
openat(AT_FDCWD, "/ceph/cluster/rsyncbackups/fileshare.txt", O_RDONLY) =
38 <0.30>
write(111, "\0\0\0\021197418 2 67201568", 21) = 21 <0.36>
read(38,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\33\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
65536) = 65536 <0.049733>
write(111,
"\0\1\0\0CLC\0\0\0\0\2\0\0\0\0\0\0\0\33\0\0\0\0\0\0\0\0\0\0\0\0"...,
65540) = 65540 <0.37>
read(38, " $$ $$\16\33\16 \16\33"..., 65536) = 65536
<0.000199>
write(111, "\0\1\0\0 $$ $$\16\33\16 $$"..., 65540) = 65540
<0.26>
read(38, "$ \33  \16\33\25 \33\33\33   \33\33\33
\25\0\26\2\16NVDOLOVB"..., 65536) = 65536 <0.35>
write(111, "\0\1\0\0$ \33  \16\33\25 \33\33\33   \33\33\33
\25\0\26\2\16NVDO"..., 65540) = 65540 <0.24>

The pattern is very consistent, thus it is not one PG or one OSD being
contented.
$ sudo strace -T -p 7307 2>&1 | grep -A 3 open  |grep read
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 11968 <0.070917>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 23232 <0.039789>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0P\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 65536 <0.053598>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 28240 <0.105046>
read(41, "NZCA_FS_CLCGENOMICS, 1, 1\nNZCA_F"..., 65536) = 73 <0.061966>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 65536 <0.050943>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\30\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
65536) = 65536 <0.031217>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 7392 <0.052612>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 288 <0.075930>
read(41, "1316919290-DASPHYNBAAPe2218b"..., 65536) = 940 <0.040609>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 22400 <0.038423>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 11984 <0.039051>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 9040 <0.054161>
read(41, "NZCA_FS_CLCGENOMICS, 1, 1\nNZCA_F"..., 65536) = 73 <0.040654>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 22352 <0.031236>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0N\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 65536 <0.123424>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 49984 <0.052249>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 28176 <0.052742>
read(41,
"CLC\0\0\0\0\2\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536)
= 288 <0.092039>

Or to sum:
sudo strace -T -p 23748 2>&1 | grep -A 3 open  | grep read |  perl
-ane'/<(\d+\.\d+)>/; print $1 . "\n";' | head -n 1000 | ministat

N   Min   MaxMedian   AvgStddev
x 1000   3.2e-05  2.141551  0.054313   0.065834359   0.091480339


As can be seen the "initial" read averages at 65.8ms - which - if the
filesize is say 1MB and
the rest of the time is 0 - caps read performance mostly 20MB/s .. at that
pace, the journey
through double digit TB is long even with 72 OSD's backing.

Spec: Ceph Luminous 12.2.5 - Bluestore
6 OSD nodes, 10TB HDDs, 4+2 EC pool, 10GbitE

Locally the drives deliver latencies of approximately 6-8ms for a random
read. Any suggestion
on where to find out where the remaining 50ms is being spend would be
truely helpful.

Large files "just works" as read-ahead does a nice job in getting
performance up.

-- 
Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Nautilus Release T-shirt Design

2019-02-15 Thread Gregory Farnum
On Fri, Feb 15, 2019 at 1:39 AM Ilya Dryomov  wrote:

> On Fri, Feb 15, 2019 at 12:05 AM Mike Perez  wrote:
> >
> > Hi Marc,
> >
> > You can see previous designs on the Ceph store:
> >
> > https://www.proforma.com/sdscommunitystore
>
> Hi Mike,
>
> This site stopped working during DevConf and hasn't been working since.
> I think Greg has contacted some folks about this, but it would be great
> if you could follow up because it's been a couple of weeks now...


That’s odd because we thought this was resolved by Monday, but I do see
from the time stamps I was back in the USA when testing it. It must be
geographical as Dan says... :/

>
>
> Thanks,
>
> Ilya
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jewel10.2.11 EC pool out a osd, its PGs remap to the osds in the same host

2019-02-15 Thread Gregory Farnum
Actually I think I misread what this was doing, sorry.

Can you do a “ceph osd tree”? It’s hard to see the structure via the text
dumps.

On Wed, Feb 13, 2019 at 10:49 AM Gregory Farnum  wrote:

> Your CRUSH rule for EC spools is forcing that behavior with the line
>
> step chooseleaf indep 1 type ctnr
>
> If you want different behavior, you’ll need a different crush rule.
>
> On Tue, Feb 12, 2019 at 5:18 PM hnuzhoulin2  wrote:
>
>> Hi, cephers
>>
>>
>> I am building a ceph EC cluster.when a disk is error,I out it.But its all
>> PGs remap to the osds in the same host,which I think they should remap to
>> other hosts in the same rack.
>> test process is:
>>
>> ceph osd pool create .rgw.buckets.data 8192 8192 erasure ISA-4-2
>> site1_sata_erasure_ruleset 4
>> ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/1
>> /etc/init.d/ceph stop osd.2
>> ceph osd out 2
>> ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/2
>> diff /tmp/1 /tmp/2 -y --suppress-common-lines
>>
>> 0 1.0 1.0 118 osd.0   | 0 1.0 1.0 126 osd.0
>> 1 1.0 1.0 123 osd.1   | 1 1.0 1.0 139 osd.1
>> 2 1.0 1.0 122 osd.2   | 2 1.0 0 0 osd.2
>> 3 1.0 1.0 113 osd.3   | 3 1.0 1.0 131 osd.3
>> 4 1.0 1.0 122 osd.4   | 4 1.0 1.0 136 osd.4
>> 5 1.0 1.0 112 osd.5   | 5 1.0 1.0 127 osd.5
>> 6 1.0 1.0 114 osd.6   | 6 1.0 1.0 128 osd.6
>> 7 1.0 1.0 124 osd.7   | 7 1.0 1.0 136 osd.7
>> 8 1.0 1.0 95 osd.8   | 8 1.0 1.0 113 osd.8
>> 9 1.0 1.0 112 osd.9   | 9 1.0 1.0 119 osd.9
>> TOTAL 3073T 197G | TOTAL 3065T 197G
>> MIN/MAX VAR: 0.84/26.56 | MIN/MAX VAR: 0.84/26.52
>>
>>
>> some config info: (detail configs see:
>> https://gist.github.com/hnuzhoulin/575883dbbcb04dff448eea3b9384c125)
>> jewel 10.2.11  filestore+rocksdb
>>
>> ceph osd erasure-code-profile get ISA-4-2
>> k=4
>> m=2
>> plugin=isa
>> ruleset-failure-domain=ctnr
>> ruleset-root=site1-sata
>> technique=reed_sol_van
>>
>> part of ceph.conf is:
>>
>> [global]
>> fsid = 1CAB340D-E551-474F-B21A-399AC0F10900
>> auth cluster required = cephx
>> auth service required = cephx
>> auth client required = cephx
>> pid file = /home/ceph/var/run/$name.pid
>> log file = /home/ceph/log/$cluster-$name.log
>> mon osd nearfull ratio = 0.85
>> mon osd full ratio = 0.95
>> admin socket = /home/ceph/var/run/$cluster-$name.asok
>> osd pool default size = 3
>> osd pool default min size = 1
>> osd objectstore = filestore
>> filestore merge threshold = -10
>>
>> [mon]
>> keyring = /home/ceph/var/lib/$type/$cluster-$id/keyring
>> mon data = /home/ceph/var/lib/$type/$cluster-$id
>> mon cluster log file = /home/ceph/log/$cluster.log
>> [osd]
>> keyring = /home/ceph/var/lib/$type/$cluster-$id/keyring
>> osd data = /home/ceph/var/lib/$type/$cluster-$id
>> osd journal = /home/ceph/var/lib/$type/$cluster-$id/journal
>> osd journal size = 1
>> osd mkfs type = xfs
>> osd mount options xfs = rw,noatime,nodiratime,inode64,logbsize=256k
>> osd backfill full ratio = 0.92
>> osd failsafe full ratio = 0.95
>> osd failsafe nearfull ratio = 0.85
>> osd max backfills = 1
>> osd crush update on start = false
>> osd op thread timeout = 60
>> filestore split multiple = 8
>> filestore max sync interval = 15
>> filestore min sync interval = 5
>> [osd.0]
>> host = cld-osd1-56
>> addr = X
>> user = ceph
>> devs = /disk/link/osd-0/data
>> osd journal = /disk/link/osd-0/journal
>> …….
>> [osd.503]
>> host = cld-osd42-56
>> addr = 10.108.87.52
>> user = ceph
>> devs = /disk/link/osd-503/data
>> osd journal = /disk/link/osd-503/journal
>>
>>
>> crushmap is below:
>>
>> # begin crush map
>> tunable choose_local_tries 0
>> tunable choose_local_fallback_tries 0
>> tunable choose_total_tries 50
>> tunable chooseleaf_descend_once 1
>> tunable chooseleaf_vary_r 1
>> tunable straw_calc_version 1
>> tunable allowed_bucket_algs 54
>>
>> # devices
>> device 0 osd.0
>> device 1 osd.1
>> device 2 osd.2
>> 。。。
>> device 502 osd.502
>> device 503 osd.503
>>
>> # types
>> type 0 osd  # osd
>> type 1 ctnr # sata/ssd group by node, -101~1xx/-201~2xx
>> type 2 media# sata/ssd group by rack, -11~1x/-21~2x
>> type 3 mediagroup   # sata/ssd group by site, -5/-6
>> type 4 unit # site, -2
>> type 5 root # root, -1
>>
>> # buckets
>> ctnr cld-osd1-56-sata {
>> id -101  # do not change unnecessarily
>> # weight 10.000
>> alg straw2
>> hash 0   # rjenkins1
>> item osd.0 weight 1.000
>> item osd.1 weight 1.000
>> item osd.2 weight 1.000
>> item osd.3 weight 1.000
>> item osd.4 weight 1.000
>> item osd.5 weight 1.000
>> item osd.6 weight 1.000
>> item osd.7 weight 1.000
>> item osd.8 weight 1.000
>> item osd.9 weight 1.000
>> }
>> ctnr cld-osd1-56-ssd {
>> id -201  # do not change unnecessarily
>> # weight 2.000
>> alg straw2
>> hash 0   # rjenkins1
>> 

Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-15 Thread David Turner
The answer is probably going to be in how big your DB partition is vs how
big your HDD disk is.  From your output it looks like you have a 6TB HDD
with a 28GB Blocks.DB partition.  Even though the DB used size isn't
currently full, I would guess that at some point since this OSD was created
that it did fill up and what you're seeing is the part of the DB that
spilled over to the data disk.  This is why the official recommendation
(that is quite cautious, but cautious because some use cases will use this
up) for a blocks.db partition is 4% of the data drive.  For your 6TB disks
that's a recommendation of 240GB per DB partition.  Of course the actual
size of the DB needed is dependent on your use case.  But pretty much every
use case for a 6TB disk needs a bigger partition than 28GB.

On Thu, Feb 14, 2019 at 11:58 PM Konstantin Shalygin  wrote:

> Wrong metadata paste of osd.73 in previous message.
>
>
> {
>
>  "id": 73,
>  "arch": "x86_64",
>  "back_addr": "10.10.10.6:6804/175338",
>  "back_iface": "vlan3",
>  "bluefs": "1",
>  "bluefs_db_access_mode": "blk",
>  "bluefs_db_block_size": "4096",
>  "bluefs_db_dev": "259:22",
>  "bluefs_db_dev_node": "nvme2n1",
>  "bluefs_db_driver": "KernelDevice",
>  "bluefs_db_model": "INTEL SSDPEDMD400G4 ",
>  "bluefs_db_partition_path": "/dev/nvme2n1p11",
>  "bluefs_db_rotational": "0",
>  "bluefs_db_serial": "CVFT4324002Q400BGN  ",
>  "bluefs_db_size": "30064771072",
>  "bluefs_db_type": "nvme",
>  "bluefs_single_shared_device": "0",
>  "bluefs_slow_access_mode": "blk",
>  "bluefs_slow_block_size": "4096",
>  "bluefs_slow_dev": "8:176",
>  "bluefs_slow_dev_node": "sdl",
>  "bluefs_slow_driver": "KernelDevice",
>  "bluefs_slow_model": "TOSHIBA HDWE160 ",
>  "bluefs_slow_partition_path": "/dev/sdl2",
>  "bluefs_slow_rotational": "1",
>  "bluefs_slow_size": "6001069199360",
>  "bluefs_slow_type": "hdd",
>  "bluefs_wal_access_mode": "blk",
>  "bluefs_wal_block_size": "4096",
>  "bluefs_wal_dev": "259:22",
>  "bluefs_wal_dev_node": "nvme2n1",
>  "bluefs_wal_driver": "KernelDevice",
>  "bluefs_wal_model": "INTEL SSDPEDMD400G4 ",
>  "bluefs_wal_partition_path": "/dev/nvme2n1p12",
>  "bluefs_wal_rotational": "0",
>  "bluefs_wal_serial": "CVFT4324002Q400BGN  ",
>  "bluefs_wal_size": "1073741824",
>  "bluefs_wal_type": "nvme",
>  "bluestore_bdev_access_mode": "blk",
>  "bluestore_bdev_block_size": "4096",
>  "bluestore_bdev_dev": "8:176",
>  "bluestore_bdev_dev_node": "sdl",
>  "bluestore_bdev_driver": "KernelDevice",
>  "bluestore_bdev_model": "TOSHIBA HDWE160 ",
>  "bluestore_bdev_partition_path": "/dev/sdl2",
>  "bluestore_bdev_rotational": "1",
>  "bluestore_bdev_size": "6001069199360",
>  "bluestore_bdev_type": "hdd",
>  "ceph_version": "ceph version 12.2.10
> (177915764b752804194937482a39e95e0ca3de94) luminous (stable)",
>  "cpu": "Intel(R) Xeon(R) CPU E5-2609 v4 @ 1.70GHz",
>  "default_device_class": "hdd",
>  "distro": "centos",
>  "distro_description": "CentOS Linux 7 (Core)",
>  "distro_version": "7",
>  "front_addr": "172.16.16.16:6803/175338",
>  "front_iface": "vlan4",
>  "hb_back_addr": "10.10.10.6:6805/175338",
>  "hb_front_addr": "172.16.16.16:6805/175338",
>  "hostname": "ceph-osd5",
>  "journal_rotational": "0",
>  "kernel_description": "#1 SMP Tue Aug 14 21:49:04 UTC 2018",
>  "kernel_version": "3.10.0-862.11.6.el7.x86_64",
>  "mem_swap_kb": "0",
>  "mem_total_kb": "65724256",
>  "os": "Linux",
>  "osd_data": "/var/lib/ceph/osd/ceph-73",
>  "osd_objectstore": "bluestore",
>  "rotational": "1"
> }
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jewel10.2.11 EC pool out a osd, its PGs remap to the osds in the same host

2019-02-15 Thread David Turner
I'm leaving the response on the CRUSH rule for Gregory, but you have
another problem you're running into that is causing more of this data to
stay on this node than you intend.  While you `out` the OSD it is still
contributing to the Host's weight.  So the host is still set to receive
that amount of data and distribute it among the disks inside of it.  This
is the default behavior (even if you `destroy` the OSD) to minimize the
data movement for losing the disk and again for adding it back into the
cluster after you replace the device.  If you are really strapped for
space, though, then you might consider fully purging the OSD which will
reduce the Host weight to what the other OSDs are.  However if you do have
a problem in your CRUSH rule, then doing this won't change anything for you.

On Thu, Feb 14, 2019 at 11:15 PM hnuzhoulin2  wrote:

> Thanks. I read the your reply in
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg48717.html
> so using indep will do fewer data remap when osd failed.
> using firstn: 1, 2, 3, 4, 5 -> 1, 2, 4, 5, 6 , 60% data remap
> using indep :1, 2, 3, 4, 5 -> 1, 2, 6, 4, 5, 25% data remap
>
> am I right?
> if so, what recommend to do when a disk failed and the total available
> size of the rest disk in the machine is not enough(can not replace failed
> disk immediately). or I should reserve more available size in EC situation.
>
> On 02/14/2019 02:49,Gregory Farnum
>  wrote:
>
> Your CRUSH rule for EC spools is forcing that behavior with the line
>
> step chooseleaf indep 1 type ctnr
>
> If you want different behavior, you’ll need a different crush rule.
>
> On Tue, Feb 12, 2019 at 5:18 PM hnuzhoulin2  wrote:
>
>> Hi, cephers
>>
>>
>> I am building a ceph EC cluster.when a disk is error,I out it.But its all
>> PGs remap to the osds in the same host,which I think they should remap to
>> other hosts in the same rack.
>> test process is:
>>
>> ceph osd pool create .rgw.buckets.data 8192 8192 erasure ISA-4-2
>> site1_sata_erasure_ruleset 4
>> ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/1
>> /etc/init.d/ceph stop osd.2
>> ceph osd out 2
>> ceph osd df tree|awk '{print $1" "$2" "$3" "$9" "$10}'> /tmp/2
>> diff /tmp/1 /tmp/2 -y --suppress-common-lines
>>
>> 0 1.0 1.0 118 osd.0   | 0 1.0 1.0 126 osd.0
>> 1 1.0 1.0 123 osd.1   | 1 1.0 1.0 139 osd.1
>> 2 1.0 1.0 122 osd.2   | 2 1.0 0 0 osd.2
>> 3 1.0 1.0 113 osd.3   | 3 1.0 1.0 131 osd.3
>> 4 1.0 1.0 122 osd.4   | 4 1.0 1.0 136 osd.4
>> 5 1.0 1.0 112 osd.5   | 5 1.0 1.0 127 osd.5
>> 6 1.0 1.0 114 osd.6   | 6 1.0 1.0 128 osd.6
>> 7 1.0 1.0 124 osd.7   | 7 1.0 1.0 136 osd.7
>> 8 1.0 1.0 95 osd.8   | 8 1.0 1.0 113 osd.8
>> 9 1.0 1.0 112 osd.9   | 9 1.0 1.0 119 osd.9
>> TOTAL 3073T 197G | TOTAL 3065T 197G
>> MIN/MAX VAR: 0.84/26.56 | MIN/MAX VAR: 0.84/26.52
>>
>>
>> some config info: (detail configs see:
>> https://gist.github.com/hnuzhoulin/575883dbbcb04dff448eea3b9384c125)
>> jewel 10.2.11  filestore+rocksdb
>>
>> ceph osd erasure-code-profile get ISA-4-2
>> k=4
>> m=2
>> plugin=isa
>> ruleset-failure-domain=ctnr
>> ruleset-root=site1-sata
>> technique=reed_sol_van
>>
>> part of ceph.conf is:
>>
>> [global]
>> fsid = 1CAB340D-E551-474F-B21A-399AC0F10900
>> auth cluster required = cephx
>> auth service required = cephx
>> auth client required = cephx
>> pid file = /home/ceph/var/run/$name.pid
>> log file = /home/ceph/log/$cluster-$name.log
>> mon osd nearfull ratio = 0.85
>> mon osd full ratio = 0.95
>> admin socket = /home/ceph/var/run/$cluster-$name.asok
>> osd pool default size = 3
>> osd pool default min size = 1
>> osd objectstore = filestore
>> filestore merge threshold = -10
>>
>> [mon]
>> keyring = /home/ceph/var/lib/$type/$cluster-$id/keyring
>> mon data = /home/ceph/var/lib/$type/$cluster-$id
>> mon cluster log file = /home/ceph/log/$cluster.log
>> [osd]
>> keyring = /home/ceph/var/lib/$type/$cluster-$id/keyring
>> osd data = /home/ceph/var/lib/$type/$cluster-$id
>> osd journal = /home/ceph/var/lib/$type/$cluster-$id/journal
>> osd journal size = 1
>> osd mkfs type = xfs
>> osd mount options xfs = rw,noatime,nodiratime,inode64,logbsize=256k
>> osd backfill full ratio = 0.92
>> osd failsafe full ratio = 0.95
>> osd failsafe nearfull ratio = 0.85
>> osd max backfills = 1
>> osd crush update on start = false
>> osd op thread timeout = 60
>> filestore split multiple = 8
>> filestore max sync interval = 15
>> filestore min sync interval = 5
>> [osd.0]
>> host = cld-osd1-56
>> addr = X
>> user = ceph
>> devs = /disk/link/osd-0/data
>> osd journal = /disk/link/osd-0/journal
>> …….
>> [osd.503]
>> host = cld-osd42-56
>> addr = 10.108.87.52
>> user = ceph
>> devs = /disk/link/osd-503/data
>> osd journal = /disk/link/osd-503/journal
>>
>>
>> crushmap is below:
>>
>> # begin crush map
>> 

Re: [ceph-users] Problems with osd creation in Ubuntu 18.04, ceph 13.2.4-1bionic

2019-02-15 Thread David Turner
I have found that running a zap before all prepare/create commands with
ceph-volume helps things run smoother.  Zap is specifically there to clear
everything on a disk away to make the disk ready to be used as an OSD.
Your wipefs command is still fine, but then I would lvm zap the disk before
continuing.  I would run the commands like [1] this.  I also prefer the
single command lvm create as opposed to lvm prepare and lvm activate.  Try
that out and see if you still run into the problems creating the BlueStore
filesystem.

[1] ceph-volume lvm zap /dev/sdg
ceph-volume lvm prepare --bluestore --data /dev/sdg

On Thu, Feb 14, 2019 at 10:25 AM Rainer Krienke 
wrote:

> Hi,
>
> I am quite new to ceph and just try to set up a ceph cluster. Initially
> I used ceph-deploy for this but when I tried to create a BlueStore osd
> ceph-deploy fails. Next I tried the direct way on one of the OSD-nodes
> using ceph-volume to create the osd, but this also fails. Below you can
> see what  ceph-volume says.
>
> I ensured that there was no left over lvm VG and LV on the disk sdg
> before I started the osd creation for this disk. The very same error
> happens also on other disks not just for /dev/sdg. All the disk have 4TB
> in size and the linux system is Ubuntu 18.04 and finally ceph is
> installed in version 13.2.4-1bionic from this repo:
> https://download.ceph.com/debian-mimic.
>
> There is a VG and two LV's  on the system for the ubuntu system itself
> that is installed on two separate disks configured as software raid1 and
> lvm on top of the raid. But I cannot imagine that this might do any harm
> to cephs osd creation.
>
> Does anyone have an idea what might be wrong?
>
> Thanks for hints
> Rainer
>
> root@ceph1:~# wipefs -fa /dev/sdg
> root@ceph1:~# ceph-volume lvm prepare --bluestore --data /dev/sdg
> Running command: /usr/bin/ceph-authtool --gen-print-key
> Running command: /usr/bin/ceph --cluster ceph --name
> client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring
> -i - osd new 14d041d6-0beb-4056-8df2-3920e2febce0
> Running command: /sbin/vgcreate --force --yes
> ceph-1433ffd0-0a80-481a-91f5-d7a47b78e17b /dev/sdg
>  stdout: Physical volume "/dev/sdg" successfully created.
>  stdout: Volume group "ceph-1433ffd0-0a80-481a-91f5-d7a47b78e17b"
> successfully created
> Running command: /sbin/lvcreate --yes -l 100%FREE -n
> osd-block-14d041d6-0beb-4056-8df2-3920e2febce0
> ceph-1433ffd0-0a80-481a-91f5-d7a47b78e17b
>  stdout: Logical volume "osd-block-14d041d6-0beb-4056-8df2-3920e2febce0"
> created.
> Running command: /usr/bin/ceph-authtool --gen-print-key
> Running command: /bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-0
> --> Absolute path not found for executable: restorecon
> --> Ensure $PATH environment variable contains common executable locations
> Running command: /bin/chown -h ceph:ceph
>
> /dev/ceph-1433ffd0-0a80-481a-91f5-d7a47b78e17b/osd-block-14d041d6-0beb-4056-8df2-3920e2febce0
> Running command: /bin/chown -R ceph:ceph /dev/dm-8
> Running command: /bin/ln -s
>
> /dev/ceph-1433ffd0-0a80-481a-91f5-d7a47b78e17b/osd-block-14d041d6-0beb-4056-8df2-3920e2febce0
> /var/lib/ceph/osd/ceph-0/block
> Running command: /usr/bin/ceph --cluster ceph --name
> client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring
> mon getmap -o /var/lib/ceph/osd/ceph-0/activate.monmap
>  stderr: got monmap epoch 1
> Running command: /usr/bin/ceph-authtool /var/lib/ceph/osd/ceph-0/keyring
> --create-keyring --name osd.0 --add-key
> AQAAY2VcU968HxAAvYWMaJZmriUc4H9bCCp8XQ==
>  stdout: creating /var/lib/ceph/osd/ceph-0/keyring
> added entity osd.0 auth auth(auid = 18446744073709551615
> key=AQAAY2VcU968HxAAvYWMaJZmriUc4H9bCCp8XQ== with 0 caps)
> Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0/keyring
> Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0/
> Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore
> bluestore --mkfs -i 0 --monmap /var/lib/ceph/osd/ceph-0/activate.monmap
> --keyfile - --osd-data /var/lib/ceph/osd/ceph-0/ --osd-uuid
> 14d041d6-0beb-4056-8df2-3920e2febce0 --setuser ceph --setgroup ceph
>  stderr: 2019-02-14 13:45:54.788 7f3fcecb3240 -1
> bluestore(/var/lib/ceph/osd/ceph-0/) _read_fsid unparsable uuid
>  stderr: /build/ceph-13.2.4/src/os/bluestore/KernelDevice.cc: In
> function 'virtual int KernelDevice::read(uint64_t, uint64_t,
> ceph::bufferlist*, IOContext*, bool)' thread 7f3fcecb3240 time
> 2019-02-14 13:45:54.841130
>  stderr: /build/ceph-13.2.4/src/os/bluestore/KernelDevice.cc: 821:
> FAILED assert((uint64_t)r == len)
>  stderr: ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e)
> mimic (stable)
>  stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int,
> char const*)+0x102) [0x7f3fc60d33e2]
>  stderr: 2: (()+0x26d5a7) [0x7f3fc60d35a7]
>  stderr: 3: (KernelDevice::read(unsigned long, unsigned long,
> ceph::buffer::list*, IOContext*, bool)+0x4a7) [0x561371346817]
>  stderr: 4: 

[ceph-users] Second radosgw install

2019-02-15 Thread Adrian Nicolae

Hi,

I want to install a second radosgw to my existing ceph cluster (mimic) 
on another server. Should I create it like the first one, with 
'ceph-deploy rgw create' ?


I don't want to mess with the existing rgw system pools.

Thanks.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-15 Thread Wido den Hollander


On 2/15/19 2:54 PM, Alexandre DERUMIER wrote:
>>> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
>>> OSDs as well. Over time their latency increased until we started to 
>>> notice I/O-wait inside VMs. 
> 
> I'm also notice it in the vms. BTW, what it your nvme disk size ?

Samsung PM983 3.84TB SSDs in both clusters.

> 
> 
>>> A restart fixed it. We also increased memory target from 4G to 6G on 
>>> these OSDs as the memory would allow it. 
> 
> I have set memory to 6GB this morning, with 2 osds of 3TB for 6TB nvme.  
> (my last test was 8gb with 1osd of 6TB, but that didn't help)

There are 10 OSDs in these systems with 96GB of memory in total. We are
runnigh with memory target on 6G right now to make sure there is no
leakage. If this runs fine for a longer period we will go to 8GB per OSD
so it will max out on 80GB leaving 16GB as spare.

As these OSDs were all restarted earlier this week I can't tell how it
will hold up over a longer period. Monitoring (Zabbix) shows the latency
is fine at the moment.

Wido

> 
> 
> - Mail original -
> De: "Wido den Hollander" 
> À: "Alexandre Derumier" , "Igor Fedotov" 
> 
> Cc: "ceph-users" , "ceph-devel" 
> 
> Envoyé: Vendredi 15 Février 2019 14:50:34
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
> restart
> 
> On 2/15/19 2:31 PM, Alexandre DERUMIER wrote: 
>> Thanks Igor. 
>>
>> I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is 
>> different. 
>>
>> I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't 
>> see this latency problem. 
>>
>>
> 
> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
> OSDs as well. Over time their latency increased until we started to 
> notice I/O-wait inside VMs. 
> 
> A restart fixed it. We also increased memory target from 4G to 6G on 
> these OSDs as the memory would allow it. 
> 
> But we noticed this on two different 12.2.10/11 clusters. 
> 
> A restart made the latency drop. Not only the numbers, but the 
> real-world latency as experienced by a VM as well. 
> 
> Wido 
> 
>>
>>
>>
>>
>>
>> - Mail original - 
>> De: "Igor Fedotov"  
>> Cc: "ceph-users" , "ceph-devel" 
>>  
>> Envoyé: Vendredi 15 Février 2019 13:47:57 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
>> restart 
>>
>> Hi Alexander, 
>>
>> I've read through your reports, nothing obvious so far. 
>>
>> I can only see several times average latency increase for OSD write ops 
>> (in seconds) 
>> 0.002040060 (first hour) vs. 
>>
>> 0.002483516 (last 24 hours) vs. 
>> 0.008382087 (last hour) 
>>
>> subop_w_latency: 
>> 0.000478934 (first hour) vs. 
>> 0.000537956 (last 24 hours) vs. 
>> 0.003073475 (last hour) 
>>
>> and OSD read ops, osd_r_latency: 
>>
>> 0.000408595 (first hour) 
>> 0.000709031 (24 hours) 
>> 0.004979540 (last hour) 
>>
>> What's interesting is that such latency differences aren't observed at 
>> neither BlueStore level (any _lat params under "bluestore" section) nor 
>> rocksdb one. 
>>
>> Which probably means that the issue is rather somewhere above BlueStore. 
>>
>> Suggest to proceed with perf dumps collection to see if the picture 
>> stays the same. 
>>
>> W.r.t. memory usage you observed I see nothing suspicious so far - No 
>> decrease in RSS report is a known artifact that seems to be safe. 
>>
>> Thanks, 
>> Igor 
>>
>> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: 
>>> Hi Igor, 
>>>
>>> Thanks again for helping ! 
>>>
>>>
>>>
>>> I have upgrade to last mimic this weekend, and with new autotune memory, 
>>> I have setup osd_memory_target to 8G. (my nvme are 6TB) 
>>>
>>>
>>> I have done a lot of perf dump and mempool dump and ps of process to 
>> see rss memory at different hours, 
>>> here the reports for osd.0: 
>>>
>>> http://odisoweb1.odiso.net/perfanalysis/ 
>>>
>>>
>>> osd has been started the 12-02-2019 at 08:00 
>>>
>>> first report after 1h running 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt 
>>>
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt
>>  
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt 
>>>
>>>
>>>
>>> report after 24 before counter resets 
>>>
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt 
>>>
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt
>>  
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt 
>>>
>>> report 1h after counter reset 
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt 
>>>
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt
>>  
>>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt 
>>>
>>>
>>>
>>>
>>> I'm seeing the bluestore buffer bytes memory increasing up to 4G 
>> around 12-02-2019 at 14:00 
>>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png 
>>> Then after that, slowly 

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-15 Thread Alexandre DERUMIER
>>Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
>>OSDs as well. Over time their latency increased until we started to 
>>notice I/O-wait inside VMs. 

I'm also notice it in the vms. BTW, what it your nvme disk size ?


>>A restart fixed it. We also increased memory target from 4G to 6G on 
>>these OSDs as the memory would allow it. 

I have set memory to 6GB this morning, with 2 osds of 3TB for 6TB nvme.  
(my last test was 8gb with 1osd of 6TB, but that didn't help)


- Mail original -
De: "Wido den Hollander" 
À: "Alexandre Derumier" , "Igor Fedotov" 
Cc: "ceph-users" , "ceph-devel" 

Envoyé: Vendredi 15 Février 2019 14:50:34
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

On 2/15/19 2:31 PM, Alexandre DERUMIER wrote: 
> Thanks Igor. 
> 
> I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is 
> different. 
> 
> I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't 
> see this latency problem. 
> 
> 

Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
OSDs as well. Over time their latency increased until we started to 
notice I/O-wait inside VMs. 

A restart fixed it. We also increased memory target from 4G to 6G on 
these OSDs as the memory would allow it. 

But we noticed this on two different 12.2.10/11 clusters. 

A restart made the latency drop. Not only the numbers, but the 
real-world latency as experienced by a VM as well. 

Wido 

> 
> 
> 
> 
> 
> - Mail original - 
> De: "Igor Fedotov"  
> Cc: "ceph-users" , "ceph-devel" 
>  
> Envoyé: Vendredi 15 Février 2019 13:47:57 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
> restart 
> 
> Hi Alexander, 
> 
> I've read through your reports, nothing obvious so far. 
> 
> I can only see several times average latency increase for OSD write ops 
> (in seconds) 
> 0.002040060 (first hour) vs. 
> 
> 0.002483516 (last 24 hours) vs. 
> 0.008382087 (last hour) 
> 
> subop_w_latency: 
> 0.000478934 (first hour) vs. 
> 0.000537956 (last 24 hours) vs. 
> 0.003073475 (last hour) 
> 
> and OSD read ops, osd_r_latency: 
> 
> 0.000408595 (first hour) 
> 0.000709031 (24 hours) 
> 0.004979540 (last hour) 
> 
> What's interesting is that such latency differences aren't observed at 
> neither BlueStore level (any _lat params under "bluestore" section) nor 
> rocksdb one. 
> 
> Which probably means that the issue is rather somewhere above BlueStore. 
> 
> Suggest to proceed with perf dumps collection to see if the picture 
> stays the same. 
> 
> W.r.t. memory usage you observed I see nothing suspicious so far - No 
> decrease in RSS report is a known artifact that seems to be safe. 
> 
> Thanks, 
> Igor 
> 
> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: 
>> Hi Igor, 
>> 
>> Thanks again for helping ! 
>> 
>> 
>> 
>> I have upgrade to last mimic this weekend, and with new autotune memory, 
>> I have setup osd_memory_target to 8G. (my nvme are 6TB) 
>> 
>> 
>> I have done a lot of perf dump and mempool dump and ps of process to 
> see rss memory at different hours, 
>> here the reports for osd.0: 
>> 
>> http://odisoweb1.odiso.net/perfanalysis/ 
>> 
>> 
>> osd has been started the 12-02-2019 at 08:00 
>> 
>> first report after 1h running 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt 
>> 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt
>  
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt 
>> 
>> 
>> 
>> report after 24 before counter resets 
>> 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt 
>> 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt
>  
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt 
>> 
>> report 1h after counter reset 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt 
>> 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt
>  
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt 
>> 
>> 
>> 
>> 
>> I'm seeing the bluestore buffer bytes memory increasing up to 4G 
> around 12-02-2019 at 14:00 
>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png 
>> Then after that, slowly decreasing. 
>> 
>> 
>> Another strange thing, 
>> I'm seeing total bytes at 5G at 12-02-2018.13:30 
>> 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt
>  
>> Then is decreasing over time (around 3,7G this morning), but RSS is 
> still at 8G 
>> 
>> 
>> I'm graphing mempools counters too since yesterday, so I'll able to 
> track them over time. 
>> 
>> - Mail original - 
>> De: "Igor Fedotov"  
>> À: "Alexandre Derumier"  
>> Cc: "Sage Weil" , "ceph-users" 
> , "ceph-devel"  
>> Envoyé: Lundi 11 Février 2019 12:03:17 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
> until restart 
>> 
>> On 2/8/2019 6:57 

[ceph-users] mount.ceph replacement in Python

2019-02-15 Thread Jonas Jelten
Hi!

I've created a mount.ceph.c replacement in Python which also utilizes the 
kernel keyring and does name resolutions.
You can mount a CephFS without installing Ceph that way (and without using the 
legacy secret= mount option).

https://github.com/SFTtech/ceph-mount

When you place the script (or a symlink) in /sbin/mount.ceph, you can mount 
CephFS with systemd .mount units.
I hope it's useful for somebody here someday :)
Currently it's not optimized for proper packaging (no setup.py yet).

If things don't work or you wanna change something, just open bugs or pull 
requests please.


   -- Jonas

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Online disk resize with Qemu/KVM and Ceph

2019-02-15 Thread Gesiel Galvão Bernardes
Bingo! Changed disk to scsi and adapter to virtio is working perfectly.

Thank you Mark!

Regards,
Gesiel

Em sex, 15 de fev de 2019 às 10:21, Marc Roos 
escreveu:

>
> Use scsi disk and virtio adapter? I think that is recommended also for
> use with ceph rbd.
>
>
>
> -Original Message-
> From: Gesiel Galvão Bernardes [mailto:gesiel.bernar...@gmail.com]
> Sent: 15 February 2019 13:16
> To: Marc Roos
> Cc: ceph-users
> Subject: Re: [ceph-users] Online disk resize with Qemu/KVM and Ceph
>
> HI Marc,
>
> i tried this and the problem continue :-(
>
>
> Em sex, 15 de fev de 2019 às 10:04, Marc Roos 
> escreveu:
>
>
>
>
> And then in the windows vm
> cmd
> diskpart
> Rescan
>
> Linux vm
> echo 1 >  /sys/class/scsi_device/2\:0\:0\:0/device/rescan (sda)
> echo 1 >  /sys/class/scsi_device/2\:0\:3\:0/device/rescan (sdd)
>
>
>
> I have this to, have to do this to:
>
> virsh qemu-monitor-command vps-test2 --hmp "info block"
> virsh qemu-monitor-command vps-test2 --hmp "block_resize
> drive-scsi0-0-0-0 12G"
>
>
>
>
>
> -Original Message-
> From: Gesiel Galvão Bernardes [mailto:gesiel.bernar...@gmail.com]
> Sent: 15 February 2019 12:59
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Online disk resize with Qemu/KVM and Ceph
>
> Hi,
>
> I'm making a environment for VMs with qemu/kvm and Ceph using RBD,
> and
> I'm with the follow problem: The guest VM not recognizes disk
> resize
> (increase). The cenario is:
>
> Host:
> Centos 7.6
> Libvirt 4.5
> Ceph 13.2.4
>
> I follow the following steps to increase the disk (ex: disk 10Gb
> to
> 20Gb):
>
>
> # rbd resize --size 20480 mypool/vm_test # virsh blockresize
> --domain
> vm_test --path vda --size 20G
>
> But after this steps, the disk in VM continue with original size.
> For
> apply the change, is necessary reboot VM.
> If I use local datastore instead Ceph, the VM recognize new size
> imediatally.
>
> Does anyone have this?  Is this expected?
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-15 Thread Wido den Hollander
On 2/15/19 2:31 PM, Alexandre DERUMIER wrote:
> Thanks Igor.
> 
> I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is 
> different.
> 
> I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't 
> see this latency problem.
> 
> 

Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe
OSDs as well. Over time their latency increased until we started to
notice I/O-wait inside VMs.

A restart fixed it. We also increased memory target from 4G to 6G on
these OSDs as the memory would allow it.

But we noticed this on two different 12.2.10/11 clusters.

A restart made the latency drop. Not only the numbers, but the
real-world latency as experienced by a VM as well.

Wido

> 
> 
> 
> 
> 
> - Mail original -
> De: "Igor Fedotov" 
> Cc: "ceph-users" , "ceph-devel" 
> 
> Envoyé: Vendredi 15 Février 2019 13:47:57
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
> restart
> 
> Hi Alexander, 
> 
> I've read through your reports, nothing obvious so far. 
> 
> I can only see several times average latency increase for OSD write ops 
> (in seconds) 
> 0.002040060 (first hour) vs. 
> 
> 0.002483516 (last 24 hours) vs. 
> 0.008382087 (last hour) 
> 
> subop_w_latency: 
> 0.000478934 (first hour) vs. 
> 0.000537956 (last 24 hours) vs. 
> 0.003073475 (last hour) 
> 
> and OSD read ops, osd_r_latency: 
> 
> 0.000408595 (first hour) 
> 0.000709031 (24 hours) 
> 0.004979540 (last hour) 
> 
> What's interesting is that such latency differences aren't observed at 
> neither BlueStore level (any _lat params under "bluestore" section) nor 
> rocksdb one. 
> 
> Which probably means that the issue is rather somewhere above BlueStore. 
> 
> Suggest to proceed with perf dumps collection to see if the picture 
> stays the same. 
> 
> W.r.t. memory usage you observed I see nothing suspicious so far - No 
> decrease in RSS report is a known artifact that seems to be safe. 
> 
> Thanks, 
> Igor 
> 
> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: 
>> Hi Igor, 
>>
>> Thanks again for helping ! 
>>
>>
>>
>> I have upgrade to last mimic this weekend, and with new autotune memory, 
>> I have setup osd_memory_target to 8G. (my nvme are 6TB) 
>>
>>
>> I have done a lot of perf dump and mempool dump and ps of process to 
> see rss memory at different hours, 
>> here the reports for osd.0: 
>>
>> http://odisoweb1.odiso.net/perfanalysis/ 
>>
>>
>> osd has been started the 12-02-2019 at 08:00 
>>
>> first report after 1h running 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt 
>>
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt
>  
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt 
>>
>>
>>
>> report after 24 before counter resets 
>>
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt 
>>
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt
>  
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt 
>>
>> report 1h after counter reset 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt 
>>
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt
>  
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt 
>>
>>
>>
>>
>> I'm seeing the bluestore buffer bytes memory increasing up to 4G 
> around 12-02-2019 at 14:00 
>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png 
>> Then after that, slowly decreasing. 
>>
>>
>> Another strange thing, 
>> I'm seeing total bytes at 5G at 12-02-2018.13:30 
>>
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt
>  
>> Then is decreasing over time (around 3,7G this morning), but RSS is 
> still at 8G 
>>
>>
>> I'm graphing mempools counters too since yesterday, so I'll able to 
> track them over time. 
>>
>> - Mail original - 
>> De: "Igor Fedotov"  
>> À: "Alexandre Derumier"  
>> Cc: "Sage Weil" , "ceph-users" 
> , "ceph-devel"  
>> Envoyé: Lundi 11 Février 2019 12:03:17 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
> until restart 
>>
>> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: 
>>> another mempool dump after 1h run. (latency ok) 
>>>
>>> Biggest difference: 
>>>
>>> before restart 
>>> - 
>>> "bluestore_cache_other": { 
>>> "items": 48661920, 
>>> "bytes": 1539544228 
>>> }, 
>>> "bluestore_cache_data": { 
>>> "items": 54, 
>>> "bytes": 643072 
>>> }, 
>>> (other caches seem to be quite low too, like bluestore_cache_other 
> take all the memory) 
>>>
>>>
>>> After restart 
>>> - 
>>> "bluestore_cache_other": { 
>>> "items": 12432298, 
>>> "bytes": 500834899 
>>> }, 
>>> "bluestore_cache_data": { 
>>> "items": 40084, 
>>> "bytes": 1056235520 
>>> }, 
>>>
>> This is fine as cache is warming after restart and some rebalancing 
>> between data and metadata might occur. 
>>
>> What relates to allocator and most 

[ceph-users] Files in CephFS data pool

2019-02-15 Thread Ragan, Tj (Dr.)
Is there anyway to find out which files are stored in a CephFS data pool?  I 
know you can reference the extended attributes, but those are only relevant for 
files created after ceph.dir.layout.pool or ceph.file.layout.pool attributes 
are set - I need to know about all the files in a pool.

Thanks!

-TJ Ragan


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-15 Thread Alexandre DERUMIER
Thanks Igor.

I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is 
different.

I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't see 
this latency problem.







- Mail original -
De: "Igor Fedotov" 
Cc: "ceph-users" , "ceph-devel" 

Envoyé: Vendredi 15 Février 2019 13:47:57
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

Hi Alexander, 

I've read through your reports, nothing obvious so far. 

I can only see several times average latency increase for OSD write ops 
(in seconds) 
0.002040060 (first hour) vs. 

0.002483516 (last 24 hours) vs. 
0.008382087 (last hour) 

subop_w_latency: 
0.000478934 (first hour) vs. 
0.000537956 (last 24 hours) vs. 
0.003073475 (last hour) 

and OSD read ops, osd_r_latency: 

0.000408595 (first hour) 
0.000709031 (24 hours) 
0.004979540 (last hour) 

What's interesting is that such latency differences aren't observed at 
neither BlueStore level (any _lat params under "bluestore" section) nor 
rocksdb one. 

Which probably means that the issue is rather somewhere above BlueStore. 

Suggest to proceed with perf dumps collection to see if the picture 
stays the same. 

W.r.t. memory usage you observed I see nothing suspicious so far - No 
decrease in RSS report is a known artifact that seems to be safe. 

Thanks, 
Igor 

On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: 
> Hi Igor, 
> 
> Thanks again for helping ! 
> 
> 
> 
> I have upgrade to last mimic this weekend, and with new autotune memory, 
> I have setup osd_memory_target to 8G. (my nvme are 6TB) 
> 
> 
> I have done a lot of perf dump and mempool dump and ps of process to 
see rss memory at different hours, 
> here the reports for osd.0: 
> 
> http://odisoweb1.odiso.net/perfanalysis/ 
> 
> 
> osd has been started the 12-02-2019 at 08:00 
> 
> first report after 1h running 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt 
> 
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt
 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt 
> 
> 
> 
> report after 24 before counter resets 
> 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt 
> 
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt
 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt 
> 
> report 1h after counter reset 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt 
> 
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt
 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt 
> 
> 
> 
> 
> I'm seeing the bluestore buffer bytes memory increasing up to 4G 
around 12-02-2019 at 14:00 
> http://odisoweb1.odiso.net/perfanalysis/graphs2.png 
> Then after that, slowly decreasing. 
> 
> 
> Another strange thing, 
> I'm seeing total bytes at 5G at 12-02-2018.13:30 
> 
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt
 
> Then is decreasing over time (around 3,7G this morning), but RSS is 
still at 8G 
> 
> 
> I'm graphing mempools counters too since yesterday, so I'll able to 
track them over time. 
> 
> - Mail original - 
> De: "Igor Fedotov"  
> À: "Alexandre Derumier"  
> Cc: "Sage Weil" , "ceph-users" 
, "ceph-devel"  
> Envoyé: Lundi 11 Février 2019 12:03:17 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
until restart 
> 
> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: 
>> another mempool dump after 1h run. (latency ok) 
>> 
>> Biggest difference: 
>> 
>> before restart 
>> - 
>> "bluestore_cache_other": { 
>> "items": 48661920, 
>> "bytes": 1539544228 
>> }, 
>> "bluestore_cache_data": { 
>> "items": 54, 
>> "bytes": 643072 
>> }, 
>> (other caches seem to be quite low too, like bluestore_cache_other 
take all the memory) 
>> 
>> 
>> After restart 
>> - 
>> "bluestore_cache_other": { 
>> "items": 12432298, 
>> "bytes": 500834899 
>> }, 
>> "bluestore_cache_data": { 
>> "items": 40084, 
>> "bytes": 1056235520 
>> }, 
>> 
> This is fine as cache is warming after restart and some rebalancing 
> between data and metadata might occur. 
> 
> What relates to allocator and most probably to fragmentation growth is : 
> 
> "bluestore_alloc": { 
> "items": 165053952, 
> "bytes": 165053952 
> }, 
> 
> which had been higher before the reset (if I got these dumps' order 
> properly) 
> 
> "bluestore_alloc": { 
> "items": 210243456, 
> "bytes": 210243456 
> }, 
> 
> But as I mentioned - I'm not 100% sure this might cause such a huge 
> latency increase... 
> 
> Do you have perf counters dump after the restart? 
> 
> Could you collect some more dumps - for both mempool and perf counters? 
> 
> So ideally I'd like to have: 
> 
> 1) mempool/perf counters dumps after the restart (1hour is OK) 
> 
> 2) mempool/perf counters dumps in 24+ hours after restart 
> 
> 3) reset perf counters 

Re: [ceph-users] single OSDs cause cluster hickups

2019-02-15 Thread Igor Fedotov

Yeah.

I'm monitoring such issue reports for a while and it looks like 
something is definitely wrong with response times under certain 
circumstances. Mpt sure if all these reports have the same root cause 
though.


Scrubbing seems to be one of the trigger.

Perhaps we need more low-level detection/warning for high response times 
from HW and/or DB.


Planning to look how feasible is that warning means shortly.


Thanks,
Igor

On 2/15/2019 3:24 PM, Denny Kreische wrote:

Hi Igor,

Thanks for your reply.
I can verify, discard is disabled in our cluster:

10:03 root@node106b [fra]:~# ceph daemon osd.417 config show | grep discard
 "bdev_async_discard": "false",
 "bdev_enable_discard": "false",
[...]

So there must be something else causing the problems.

Thanks,
Denny



Am 15.02.2019 um 12:41 schrieb Igor Fedotov :

Hi Denny,

Do not remember exactly when discards appeared in BlueStore but they are 
disabled by default:

See bdev_enable_discard option.


Thanks,

Igor

On 2/15/2019 2:12 PM, Denny Kreische wrote:

Hi,

two weeks ago we upgraded one of our ceph clusters from luminous 12.2.8 to 
mimic 13.2.4, cluster is SSD-only, bluestore-only, 68 nodes, 408 OSDs.
somehow we see strange behaviour since then. Single OSDs seem to block for 
around 5 minutes and this causes the whole cluster and connected applications 
to hang. This happened 5 times during the last 10 days at irregular times, it 
didn't happen before the upgrade.

OSD log shows something like this (more log here: 
https://pastebin.com/6BYam5r4):

[...]
2019-02-14 23:53:39.754 7f379a368700 -1 osd.417 340516 get_health_metrics 
reporting 3 slow ops, oldest is osd_op(client.84226977.0:5112539976 0.dff 
0.1d783dff (undecoded) ondisk+read+known_if_redirected e340516)
2019-02-14 23:53:40.706 7f379a368700 -1 osd.417 340516 get_health_metrics 
reporting 7 slow ops, oldest is osd_op(client.84226977.0:5112539976 0.dff 
0.1d783dff (undecoded) ondisk+read+known_if_redirected e340516)
[...]

In this example osd.417 seems to have a problem. I can see same log line in 
other osd logs with placement groups related to osd.417.
I assume that all placement groups related to osd.417 are hanging or blocked 
when osd.417 is blocked.

How can I see in detail what might cause a certain OSD to stop working?

The cluster consists of 3 different SSD vendors (micron, samsung, intel), but 
only micron disks are affected until now. we earlier had problems with micron 
SSDs with filestore (xfs), it was fstrim to cause single OSDs to block for 
several minutes. we migrated to bluestore about a year ago. just in case, is 
there any kind of ssd trim/discard happening in bluestore since mimic?

Thanks,
Denny

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-15 Thread Igor Fedotov

Hi Alexander,

I've read through your reports, nothing obvious so far.

I can only see several times average latency increase for OSD write ops 
(in seconds)

0.002040060 (first hour) vs.

0.002483516 (last 24 hours) vs.
0.008382087 (last hour)

subop_w_latency:
0.000478934 (first hour) vs.
0.000537956 (last 24 hours) vs.
0.003073475 (last hour)

and OSD read ops, osd_r_latency:

0.000408595 (first hour)
0.000709031 (24 hours)
0.004979540 (last hour)

What's interesting is that such latency differences aren't observed at 
neither BlueStore level (any _lat params under "bluestore" section) nor 
rocksdb one.


Which probably means that the issue is rather somewhere above BlueStore.

Suggest to proceed with perf dumps collection to see if the picture 
stays the same.


W.r.t. memory usage you observed I see nothing suspicious so far - No 
decrease in RSS report is a known artifact that seems to be safe.


Thanks,
Igor

On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote:
> Hi Igor,
>
> Thanks again for helping !
>
>
>
> I have upgrade to last mimic this weekend, and with new autotune memory,
> I have setup osd_memory_target to 8G.  (my nvme are 6TB)
>
>
> I have done a lot of perf dump and mempool dump and ps of process to 
see rss memory at different hours,

> here the reports for osd.0:
>
> http://odisoweb1.odiso.net/perfanalysis/
>
>
> osd has been started the 12-02-2019 at 08:00
>
> first report after 1h running
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt
> 
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt

> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt
>
>
>
> report  after 24 before counter resets
>
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt
> 
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt

> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt
>
> report 1h after counter reset
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt
> 
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt

> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt
>
>
>
>
> I'm seeing the bluestore buffer bytes memory increasing up to 4G  
around 12-02-2019 at 14:00

> http://odisoweb1.odiso.net/perfanalysis/graphs2.png
> Then after that, slowly decreasing.
>
>
> Another strange thing,
> I'm seeing total bytes at 5G at 12-02-2018.13:30
> 
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt
> Then is decreasing over time (around 3,7G this morning), but RSS is 
still at 8G

>
>
> I'm graphing mempools counters too since yesterday, so I'll able to 
track them over time.

>
> - Mail original -
> De: "Igor Fedotov" 
> À: "Alexandre Derumier" 
> Cc: "Sage Weil" , "ceph-users" 
, "ceph-devel" 

> Envoyé: Lundi 11 Février 2019 12:03:17
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
until restart

>
> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote:
>> another mempool dump after 1h run. (latency ok)
>>
>> Biggest difference:
>>
>> before restart
>> -
>> "bluestore_cache_other": {
>> "items": 48661920,
>> "bytes": 1539544228
>> },
>> "bluestore_cache_data": {
>> "items": 54,
>> "bytes": 643072
>> },
>> (other caches seem to be quite low too, like bluestore_cache_other 
take all the memory)

>>
>>
>> After restart
>> -
>> "bluestore_cache_other": {
>> "items": 12432298,
>> "bytes": 500834899
>> },
>> "bluestore_cache_data": {
>> "items": 40084,
>> "bytes": 1056235520
>> },
>>
> This is fine as cache is warming after restart and some rebalancing
> between data and metadata might occur.
>
> What relates to allocator and most probably to fragmentation growth is :
>
> "bluestore_alloc": {
> "items": 165053952,
> "bytes": 165053952
> },
>
> which had been higher before the reset (if I got these dumps' order
> properly)
>
> "bluestore_alloc": {
> "items": 210243456,
> "bytes": 210243456
> },
>
> But as I mentioned - I'm not 100% sure this might cause such a huge
> latency increase...
>
> Do you have perf counters dump after the restart?
>
> Could you collect some more dumps - for both mempool and perf counters?
>
> So ideally I'd like to have:
>
> 1) mempool/perf counters dumps after the restart (1hour is OK)
>
> 2) mempool/perf counters dumps in 24+ hours after restart
>
> 3) reset perf counters after 2), wait for 1 hour (and without OSD
> restart) and dump mempool/perf counters again.
>
> So we'll be able to learn both allocator mem usage growth and operation
> latency distribution for the following periods:
>
> a) 1st hour after restart
>
> b) 25th hour.
>
>
> Thanks,
>
> Igor
>
>
>> full mempool dump after restart
>> ---
>>
>> {
>> "mempool": {
>> "by_pool": {
>> "bloom_filter": {
>> "items": 0,
>> "bytes": 0
>> },
>> "bluestore_alloc": {
>> "items": 165053952,
>> "bytes": 165053952
>> },
>> 

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-15 Thread Igor Fedotov

Hi Alexander,

I've read through your reports, nothing obvious so far.

I can only see several times average latency increase for OSD write ops 
(in seconds)


0.002040060 (first hour) vs.

0.002483516 (last 24 hours) vs.
0.008382087 (last hour)

subop_w_latency:
0.000478934 (first hour) vs.
0.000537956 (last 24 hours) vs.
0.003073475 (last hour)

and OSD read ops, osd_r_latency:

0.000408595 (first hour)
0.000709031 (24 hours)
0.004979540 (last hour)
  
What's interesting is that such latency differences aren't observed at neither BlueStore level (any _lat params under "bluestore" section) nor rocksdb one.


Which probably means that the issue is rather somewhere above BlueStore.

Suggest to proceed with perf dumps collection to see if the picture stays the 
same.

W.r.t. memory usage you observed I see nothing suspicious so far - No decrease 
in RSS report is a known artifact that seems to be safe.

Thanks,
Igor

On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote:

Hi Igor,

Thanks again for helping !



I have upgrade to last mimic this weekend, and with new autotune memory,
I have setup osd_memory_target to 8G.  (my nvme are 6TB)


I have done a lot of perf dump and mempool dump and ps of process to see rss 
memory at different hours,
here the reports for osd.0:

http://odisoweb1.odiso.net/perfanalysis/


osd has been started the 12-02-2019 at 08:00

first report after 1h running
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt



report  after 24 before counter resets

http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt

report 1h after counter reset
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt




I'm seeing the bluestore buffer bytes memory increasing up to 4G  around 
12-02-2019 at 14:00
http://odisoweb1.odiso.net/perfanalysis/graphs2.png
Then after that, slowly decreasing.


Another strange thing,
I'm seeing total bytes at 5G at 12-02-2018.13:30
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt
Then is decreasing over time (around 3,7G this morning), but RSS is still at 8G


I'm graphing mempools counters too since yesterday, so I'll able to track them 
over time.

- Mail original -
De: "Igor Fedotov" 
À: "Alexandre Derumier" 
Cc: "Sage Weil" , "ceph-users" , 
"ceph-devel" 
Envoyé: Lundi 11 Février 2019 12:03:17
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote:

another mempool dump after 1h run. (latency ok)

Biggest difference:

before restart
-
"bluestore_cache_other": {
"items": 48661920,
"bytes": 1539544228
},
"bluestore_cache_data": {
"items": 54,
"bytes": 643072
},
(other caches seem to be quite low too, like bluestore_cache_other take all the 
memory)


After restart
-
"bluestore_cache_other": {
"items": 12432298,
"bytes": 500834899
},
"bluestore_cache_data": {
"items": 40084,
"bytes": 1056235520
},


This is fine as cache is warming after restart and some rebalancing
between data and metadata might occur.

What relates to allocator and most probably to fragmentation growth is :

"bluestore_alloc": {
"items": 165053952,
"bytes": 165053952
},

which had been higher before the reset (if I got these dumps' order
properly)

"bluestore_alloc": {
"items": 210243456,
"bytes": 210243456
},

But as I mentioned - I'm not 100% sure this might cause such a huge
latency increase...

Do you have perf counters dump after the restart?

Could you collect some more dumps - for both mempool and perf counters?

So ideally I'd like to have:

1) mempool/perf counters dumps after the restart (1hour is OK)

2) mempool/perf counters dumps in 24+ hours after restart

3) reset perf counters after 2), wait for 1 hour (and without OSD
restart) and dump mempool/perf counters again.

So we'll be able to learn both allocator mem usage growth and operation
latency distribution for the following periods:

a) 1st hour after restart

b) 25th hour.


Thanks,

Igor



full mempool dump after restart
---

{
"mempool": {
"by_pool": {
"bloom_filter": {
"items": 0,
"bytes": 0
},
"bluestore_alloc": {
"items": 165053952,
"bytes": 165053952
},
"bluestore_cache_data": {
"items": 40084,
"bytes": 1056235520
},
"bluestore_cache_onode": {
"items": 5,
"bytes": 14935200
},
"bluestore_cache_other": {
"items": 12432298,
"bytes": 500834899
},
"bluestore_fsck": {
"items": 0,
"bytes": 0
},
"bluestore_txc": {
"items": 11,
"bytes": 8184
},

Re: [ceph-users] single OSDs cause cluster hickups

2019-02-15 Thread Denny Kreische
Hi Igor,

Thanks for your reply.
I can verify, discard is disabled in our cluster:

10:03 root@node106b [fra]:~# ceph daemon osd.417 config show | grep discard
"bdev_async_discard": "false",
"bdev_enable_discard": "false",
[...]

So there must be something else causing the problems.

Thanks,
Denny


> Am 15.02.2019 um 12:41 schrieb Igor Fedotov :
> 
> Hi Denny,
> 
> Do not remember exactly when discards appeared in BlueStore but they are 
> disabled by default:
> 
> See bdev_enable_discard option.
> 
> 
> Thanks,
> 
> Igor
> 
> On 2/15/2019 2:12 PM, Denny Kreische wrote:
>> Hi,
>> 
>> two weeks ago we upgraded one of our ceph clusters from luminous 12.2.8 to 
>> mimic 13.2.4, cluster is SSD-only, bluestore-only, 68 nodes, 408 OSDs.
>> somehow we see strange behaviour since then. Single OSDs seem to block for 
>> around 5 minutes and this causes the whole cluster and connected 
>> applications to hang. This happened 5 times during the last 10 days at 
>> irregular times, it didn't happen before the upgrade.
>> 
>> OSD log shows something like this (more log here: 
>> https://pastebin.com/6BYam5r4):
>> 
>> [...]
>> 2019-02-14 23:53:39.754 7f379a368700 -1 osd.417 340516 get_health_metrics 
>> reporting 3 slow ops, oldest is osd_op(client.84226977.0:5112539976 0.dff 
>> 0.1d783dff (undecoded) ondisk+read+known_if_redirected e340516)
>> 2019-02-14 23:53:40.706 7f379a368700 -1 osd.417 340516 get_health_metrics 
>> reporting 7 slow ops, oldest is osd_op(client.84226977.0:5112539976 0.dff 
>> 0.1d783dff (undecoded) ondisk+read+known_if_redirected e340516)
>> [...]
>> 
>> In this example osd.417 seems to have a problem. I can see same log line in 
>> other osd logs with placement groups related to osd.417.
>> I assume that all placement groups related to osd.417 are hanging or blocked 
>> when osd.417 is blocked.
>> 
>> How can I see in detail what might cause a certain OSD to stop working?
>> 
>> The cluster consists of 3 different SSD vendors (micron, samsung, intel), 
>> but only micron disks are affected until now. we earlier had problems with 
>> micron SSDs with filestore (xfs), it was fstrim to cause single OSDs to 
>> block for several minutes. we migrated to bluestore about a year ago. just 
>> in case, is there any kind of ssd trim/discard happening in bluestore since 
>> mimic?
>> 
>> Thanks,
>> Denny
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Denny Kreische
IT System Ingenieur und Consultant

Am Teichdamm 20
04680 Colditz

Telefon: 034381 55125
Mobil: 0176 2115 1457

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Online disk resize with Qemu/KVM and Ceph

2019-02-15 Thread Marc Roos
 
Use scsi disk and virtio adapter? I think that is recommended also for 
use with ceph rbd.



-Original Message-
From: Gesiel Galvão Bernardes [mailto:gesiel.bernar...@gmail.com] 
Sent: 15 February 2019 13:16
To: Marc Roos
Cc: ceph-users
Subject: Re: [ceph-users] Online disk resize with Qemu/KVM and Ceph

HI Marc,

i tried this and the problem continue :-(


Em sex, 15 de fev de 2019 às 10:04, Marc Roos  
escreveu:


 

And then in the windows vm
cmd
diskpart
Rescan

Linux vm
echo 1 >  /sys/class/scsi_device/2\:0\:0\:0/device/rescan (sda)
echo 1 >  /sys/class/scsi_device/2\:0\:3\:0/device/rescan (sdd)



I have this to, have to do this to:

virsh qemu-monitor-command vps-test2 --hmp "info block"
virsh qemu-monitor-command vps-test2 --hmp "block_resize 
drive-scsi0-0-0-0 12G"





-Original Message-
From: Gesiel Galvão Bernardes [mailto:gesiel.bernar...@gmail.com] 
Sent: 15 February 2019 12:59
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Online disk resize with Qemu/KVM and Ceph

Hi,

I'm making a environment for VMs with qemu/kvm and Ceph using RBD, 
and 
I'm with the follow problem: The guest VM not recognizes disk 
resize 
(increase). The cenario is:

Host:
Centos 7.6
Libvirt 4.5
Ceph 13.2.4

I follow the following steps to increase the disk (ex: disk 10Gb  
to 
20Gb):


# rbd resize --size 20480 mypool/vm_test # virsh blockresize 
--domain 
vm_test --path vda --size 20G

But after this steps, the disk in VM continue with original size. 
For 
apply the change, is necessary reboot VM. 
If I use local datastore instead Ceph, the VM recognize new size 
imediatally.

Does anyone have this?  Is this expected?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Online disk resize with Qemu/KVM and Ceph

2019-02-15 Thread Gesiel Galvão Bernardes
HI Marc,

i tried this and the problem continue :-(


Em sex, 15 de fev de 2019 às 10:04, Marc Roos 
escreveu:

>
>
> And then in the windows vm
> cmd
> diskpart
> Rescan
>
> Linux vm
> echo 1 >  /sys/class/scsi_device/2\:0\:0\:0/device/rescan (sda)
> echo 1 >  /sys/class/scsi_device/2\:0\:3\:0/device/rescan (sdd)
>
>
>
> I have this to, have to do this to:
>
> virsh qemu-monitor-command vps-test2 --hmp "info block"
> virsh qemu-monitor-command vps-test2 --hmp "block_resize
> drive-scsi0-0-0-0 12G"
>
>
>
>
>
> -Original Message-
> From: Gesiel Galvão Bernardes [mailto:gesiel.bernar...@gmail.com]
> Sent: 15 February 2019 12:59
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Online disk resize with Qemu/KVM and Ceph
>
> Hi,
>
> I'm making a environment for VMs with qemu/kvm and Ceph using RBD, and
> I'm with the follow problem: The guest VM not recognizes disk resize
> (increase). The cenario is:
>
> Host:
> Centos 7.6
> Libvirt 4.5
> Ceph 13.2.4
>
> I follow the following steps to increase the disk (ex: disk 10Gb  to
> 20Gb):
>
>
> # rbd resize --size 20480 mypool/vm_test # virsh blockresize --domain
> vm_test --path vda --size 20G
>
> But after this steps, the disk in VM continue with original size. For
> apply the change, is necessary reboot VM.
> If I use local datastore instead Ceph, the VM recognize new size
> imediatally.
>
> Does anyone have this?  Is this expected?
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Online disk resize with Qemu/KVM and Ceph

2019-02-15 Thread Marc Roos
 

And then in the windows vm
cmd
diskpart
Rescan

Linux vm
echo 1 >  /sys/class/scsi_device/2\:0\:0\:0/device/rescan (sda)
echo 1 >  /sys/class/scsi_device/2\:0\:3\:0/device/rescan (sdd)

 

I have this to, have to do this to:

virsh qemu-monitor-command vps-test2 --hmp "info block"
virsh qemu-monitor-command vps-test2 --hmp "block_resize 
drive-scsi0-0-0-0 12G"





-Original Message-
From: Gesiel Galvão Bernardes [mailto:gesiel.bernar...@gmail.com] 
Sent: 15 February 2019 12:59
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Online disk resize with Qemu/KVM and Ceph

Hi,

I'm making a environment for VMs with qemu/kvm and Ceph using RBD, and 
I'm with the follow problem: The guest VM not recognizes disk resize 
(increase). The cenario is:

Host:
Centos 7.6
Libvirt 4.5
Ceph 13.2.4

I follow the following steps to increase the disk (ex: disk 10Gb  to 
20Gb):


# rbd resize --size 20480 mypool/vm_test # virsh blockresize --domain 
vm_test --path vda --size 20G

But after this steps, the disk in VM continue with original size. For 
apply the change, is necessary reboot VM. 
If I use local datastore instead Ceph, the VM recognize new size 
imediatally.

Does anyone have this?  Is this expected?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Online disk resize with Qemu/KVM and Ceph

2019-02-15 Thread Marc Roos
 

I have this to, have to do this to:

virsh qemu-monitor-command vps-test2 --hmp "info block"
virsh qemu-monitor-command vps-test2 --hmp "block_resize 
drive-scsi0-0-0-0 12G"





-Original Message-
From: Gesiel Galvão Bernardes [mailto:gesiel.bernar...@gmail.com] 
Sent: 15 February 2019 12:59
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Online disk resize with Qemu/KVM and Ceph

Hi,

I'm making a environment for VMs with qemu/kvm and Ceph using RBD, and 
I'm with the follow problem: The guest VM not recognizes disk resize 
(increase). The cenario is:

Host:
Centos 7.6
Libvirt 4.5
Ceph 13.2.4

I follow the following steps to increase the disk (ex: disk 10Gb  to 
20Gb):


# rbd resize --size 20480 mypool/vm_test # virsh blockresize --domain 
vm_test --path vda --size 20G

But after this steps, the disk in VM continue with original size. For 
apply the change, is necessary reboot VM. 
If I use local datastore instead Ceph, the VM recognize new size 
imediatally.

Does anyone have this?  Is this expected?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Online disk resize with Qemu/KVM and Ceph

2019-02-15 Thread Gesiel Galvão Bernardes
Hi,

I'm making a environment for VMs with qemu/kvm and Ceph using RBD, and I'm
with the follow problem: The guest VM not recognizes disk resize
(increase). The cenario is:

Host:
Centos 7.6
Libvirt 4.5
Ceph 13.2.4

I follow the following steps to increase the disk (ex: disk 10Gb  to 20Gb):

# rbd resize --size 20480 mypool/vm_test
# virsh blockresize --domain vm_test --path vda --size 20G

But after this steps, the disk in VM continue with original size. For apply
the change, is necessary reboot VM.
If I use local datastore instead Ceph, the VM recognize new size
imediatally.

Does anyone have this?  Is this expected?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] single OSDs cause cluster hickups

2019-02-15 Thread Igor Fedotov

Hi Denny,

Do not remember exactly when discards appeared in BlueStore but they are 
disabled by default:


See bdev_enable_discard option.


Thanks,

Igor

On 2/15/2019 2:12 PM, Denny Kreische wrote:

Hi,

two weeks ago we upgraded one of our ceph clusters from luminous 12.2.8 to 
mimic 13.2.4, cluster is SSD-only, bluestore-only, 68 nodes, 408 OSDs.
somehow we see strange behaviour since then. Single OSDs seem to block for 
around 5 minutes and this causes the whole cluster and connected applications 
to hang. This happened 5 times during the last 10 days at irregular times, it 
didn't happen before the upgrade.

OSD log shows something like this (more log here: 
https://pastebin.com/6BYam5r4):

[...]
2019-02-14 23:53:39.754 7f379a368700 -1 osd.417 340516 get_health_metrics 
reporting 3 slow ops, oldest is osd_op(client.84226977.0:5112539976 0.dff 
0.1d783dff (undecoded) ondisk+read+known_if_redirected e340516)
2019-02-14 23:53:40.706 7f379a368700 -1 osd.417 340516 get_health_metrics 
reporting 7 slow ops, oldest is osd_op(client.84226977.0:5112539976 0.dff 
0.1d783dff (undecoded) ondisk+read+known_if_redirected e340516)
[...]

In this example osd.417 seems to have a problem. I can see same log line in 
other osd logs with placement groups related to osd.417.
I assume that all placement groups related to osd.417 are hanging or blocked 
when osd.417 is blocked.

How can I see in detail what might cause a certain OSD to stop working?

The cluster consists of 3 different SSD vendors (micron, samsung, intel), but 
only micron disks are affected until now. we earlier had problems with micron 
SSDs with filestore (xfs), it was fstrim to cause single OSDs to block for 
several minutes. we migrated to bluestore about a year ago. just in case, is 
there any kind of ssd trim/discard happening in bluestore since mimic?

Thanks,
Denny

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] single OSDs cause cluster hickups

2019-02-15 Thread Denny Kreische
Hi,

two weeks ago we upgraded one of our ceph clusters from luminous 12.2.8 to 
mimic 13.2.4, cluster is SSD-only, bluestore-only, 68 nodes, 408 OSDs.
somehow we see strange behaviour since then. Single OSDs seem to block for 
around 5 minutes and this causes the whole cluster and connected applications 
to hang. This happened 5 times during the last 10 days at irregular times, it 
didn't happen before the upgrade.

OSD log shows something like this (more log here: 
https://pastebin.com/6BYam5r4):

[...]
2019-02-14 23:53:39.754 7f379a368700 -1 osd.417 340516 get_health_metrics 
reporting 3 slow ops, oldest is osd_op(client.84226977.0:5112539976 0.dff 
0.1d783dff (undecoded) ondisk+read+known_if_redirected e340516)
2019-02-14 23:53:40.706 7f379a368700 -1 osd.417 340516 get_health_metrics 
reporting 7 slow ops, oldest is osd_op(client.84226977.0:5112539976 0.dff 
0.1d783dff (undecoded) ondisk+read+known_if_redirected e340516)
[...]

In this example osd.417 seems to have a problem. I can see same log line in 
other osd logs with placement groups related to osd.417.
I assume that all placement groups related to osd.417 are hanging or blocked 
when osd.417 is blocked.

How can I see in detail what might cause a certain OSD to stop working?

The cluster consists of 3 different SSD vendors (micron, samsung, intel), but 
only micron disks are affected until now. we earlier had problems with micron 
SSDs with filestore (xfs), it was fstrim to cause single OSDs to block for 
several minutes. we migrated to bluestore about a year ago. just in case, is 
there any kind of ssd trim/discard happening in bluestore since mimic?

Thanks,
Denny

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Nautilus Release T-shirt Design

2019-02-15 Thread Dan van der Ster
On Fri, Feb 15, 2019 at 12:01 PM Willem Jan Withagen  wrote:
>
> On 15/02/2019 11:56, Dan van der Ster wrote:
> > On Fri, Feb 15, 2019 at 11:40 AM Willem Jan Withagen  
> > wrote:
> >>
> >> On 15/02/2019 10:39, Ilya Dryomov wrote:
> >>> On Fri, Feb 15, 2019 at 12:05 AM Mike Perez  wrote:
> 
>  Hi Marc,
> 
>  You can see previous designs on the Ceph store:
> 
>  https://www.proforma.com/sdscommunitystore
> >>>
> >>> Hi Mike,
> >>>
> >>> This site stopped working during DevConf and hasn't been working since.
> >>> I think Greg has contacted some folks about this, but it would be great
> >>> if you could follow up because it's been a couple of weeks now...
> >>
> >> Ilya,
> >>
> >> The site is working for me.
> >> It only does not contain the Nautilus shirts (yet)
> >
> > I found in the past that the http redirection for www.proforma.com
> > doesn't work from over here in Europe.
> > If someone can post the redirection target then we can access it directly.
>
> Like:
>
> https://proformaprostores.com/Category
>
>
> at least, that is where I get directed to.

Exactly! That URL works here at CERN... www.proforma.com is stuck forever.

-- dan


>
> --WjW
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Nautilus Release T-shirt Design

2019-02-15 Thread Willem Jan Withagen

On 15/02/2019 11:56, Dan van der Ster wrote:

On Fri, Feb 15, 2019 at 11:40 AM Willem Jan Withagen  wrote:


On 15/02/2019 10:39, Ilya Dryomov wrote:

On Fri, Feb 15, 2019 at 12:05 AM Mike Perez  wrote:


Hi Marc,

You can see previous designs on the Ceph store:

https://www.proforma.com/sdscommunitystore


Hi Mike,

This site stopped working during DevConf and hasn't been working since.
I think Greg has contacted some folks about this, but it would be great
if you could follow up because it's been a couple of weeks now...


Ilya,

The site is working for me.
It only does not contain the Nautilus shirts (yet)


I found in the past that the http redirection for www.proforma.com
doesn't work from over here in Europe.
If someone can post the redirection target then we can access it directly.


Like:

https://proformaprostores.com/Category


at least, that is where I get directed to.

--WjW



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Nautilus Release T-shirt Design

2019-02-15 Thread Eugen Block

I have no issues opening that site from Germany.


Zitat von Dan van der Ster :


On Fri, Feb 15, 2019 at 11:40 AM Willem Jan Withagen  wrote:


On 15/02/2019 10:39, Ilya Dryomov wrote:
> On Fri, Feb 15, 2019 at 12:05 AM Mike Perez  wrote:
>>
>> Hi Marc,
>>
>> You can see previous designs on the Ceph store:
>>
>> https://www.proforma.com/sdscommunitystore
>
> Hi Mike,
>
> This site stopped working during DevConf and hasn't been working since.
> I think Greg has contacted some folks about this, but it would be great
> if you could follow up because it's been a couple of weeks now...

Ilya,

The site is working for me.
It only does not contain the Nautilus shirts (yet)


I found in the past that the http redirection for www.proforma.com
doesn't work from over here in Europe.
If someone can post the redirection target then we can access it directly.

-- dan




--WjW



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Nautilus Release T-shirt Design

2019-02-15 Thread Dan van der Ster
On Fri, Feb 15, 2019 at 11:40 AM Willem Jan Withagen  wrote:
>
> On 15/02/2019 10:39, Ilya Dryomov wrote:
> > On Fri, Feb 15, 2019 at 12:05 AM Mike Perez  wrote:
> >>
> >> Hi Marc,
> >>
> >> You can see previous designs on the Ceph store:
> >>
> >> https://www.proforma.com/sdscommunitystore
> >
> > Hi Mike,
> >
> > This site stopped working during DevConf and hasn't been working since.
> > I think Greg has contacted some folks about this, but it would be great
> > if you could follow up because it's been a couple of weeks now...
>
> Ilya,
>
> The site is working for me.
> It only does not contain the Nautilus shirts (yet)

I found in the past that the http redirection for www.proforma.com
doesn't work from over here in Europe.
If someone can post the redirection target then we can access it directly.

-- dan


>
> --WjW
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Nautilus Release T-shirt Design

2019-02-15 Thread Willem Jan Withagen

On 15/02/2019 10:39, Ilya Dryomov wrote:

On Fri, Feb 15, 2019 at 12:05 AM Mike Perez  wrote:


Hi Marc,

You can see previous designs on the Ceph store:

https://www.proforma.com/sdscommunitystore


Hi Mike,

This site stopped working during DevConf and hasn't been working since.
I think Greg has contacted some folks about this, but it would be great
if you could follow up because it's been a couple of weeks now...


Ilya,

The site is working for me.
It only does not contain the Nautilus shirts (yet)

--WjW


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mon_data_size_warn limits for large cluster

2019-02-15 Thread M Ranga Swami Reddy
today I again hit the warn with 30G also...

On Thu, Feb 14, 2019 at 7:39 PM Sage Weil  wrote:
>
> On Thu, 7 Feb 2019, Dan van der Ster wrote:
> > On Thu, Feb 7, 2019 at 12:17 PM M Ranga Swami Reddy
> >  wrote:
> > >
> > > Hi Dan,
> > > >During backfilling scenarios, the mons keep old maps and grow quite
> > > >quickly. So if you have balancing, pg splitting, etc. ongoing for
> > > >awhile, the mon stores will eventually trigger that 15GB alarm.
> > > >But the intended behavior is that once the PGs are all active+clean,
> > > >the old maps should be trimmed and the disk space freed.
> > >
> > > old maps not trimmed after cluster reached to "all+clean" state for all 
> > > PGs.
> > > Is there (known) bug here?
> > > As the size of dB showing > 15G, do I need to run the compact commands
> > > to do the trimming?
> >
> > Compaction isn't necessary -- you should only need to restart all
> > peon's then the leader. A few minutes later the db's should start
> > trimming.
>
> The next time someone sees this behavior, can you please
>
> - enable debug_mon = 20 on all mons (*before* restarting)
>ceph tell mon.* injectargs '--debug-mon 20'
> - wait for 10 minutes or so to generate some logs
> - add 'debug mon = 20' to ceph.conf (on mons only)
> - restart the monitors
> - wait for them to start trimming
> - remove 'debug mon = 20' from ceph.conf (on mons only)
> - tar up the log files, ceph-post-file them, and share them with ticket
> http://tracker.ceph.com/issues/38322
>
> Thanks!
> sage
>
>
>
>
> > -- dan
> >
> >
> > >
> > > Thanks
> > > Swami
> > >
> > > On Wed, Feb 6, 2019 at 6:24 PM Dan van der Ster  
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > With HEALTH_OK a mon data dir should be under 2GB for even such a large 
> > > > cluster.
> > > >
> > > > During backfilling scenarios, the mons keep old maps and grow quite
> > > > quickly. So if you have balancing, pg splitting, etc. ongoing for
> > > > awhile, the mon stores will eventually trigger that 15GB alarm.
> > > > But the intended behavior is that once the PGs are all active+clean,
> > > > the old maps should be trimmed and the disk space freed.
> > > >
> > > > However, several people have noted that (at least in luminous
> > > > releases) the old maps are not trimmed until after HEALTH_OK *and* all
> > > > mons are restarted. This ticket seems related:
> > > > http://tracker.ceph.com/issues/37875
> > > >
> > > > (Over here we're restarting mons every ~2-3 weeks, resulting in the
> > > > mon stores dropping from >15GB to ~700MB each time).
> > > >
> > > > -- Dan
> > > >
> > > >
> > > > On Wed, Feb 6, 2019 at 1:26 PM Sage Weil  wrote:
> > > > >
> > > > > Hi Swami
> > > > >
> > > > > The limit is somewhat arbitrary, based on cluster sizes we had seen 
> > > > > when
> > > > > we picked it.  In your case it should be perfectly safe to increase 
> > > > > it.
> > > > >
> > > > > sage
> > > > >
> > > > >
> > > > > On Wed, 6 Feb 2019, M Ranga Swami Reddy wrote:
> > > > >
> > > > > > Hello -  Are the any limits for mon_data_size for cluster with 2PB
> > > > > > (with 2000+ OSDs)?
> > > > > >
> > > > > > Currently it set as 15G. What is logic behind this? Can we increase
> > > > > > when we get the mon_data_size_warn messages?
> > > > > >
> > > > > > I am getting the mon_data_size_warn message even though there a 
> > > > > > ample
> > > > > > of free space on the disk (around 300G free disk)
> > > > > >
> > > > > > Earlier thread on the same discusion:
> > > > > > https://www.spinics.net/lists/ceph-users/msg42456.html
> > > > > >
> > > > > > Thanks
> > > > > > Swami
> > > > > >
> > > > > >
> > > > > >
> > > > > ___
> > > > > ceph-users mailing list
> > > > > ceph-users@lists.ceph.com
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Nautilus Release T-shirt Design

2019-02-15 Thread Ilya Dryomov
On Fri, Feb 15, 2019 at 12:05 AM Mike Perez  wrote:
>
> Hi Marc,
>
> You can see previous designs on the Ceph store:
>
> https://www.proforma.com/sdscommunitystore

Hi Mike,

This site stopped working during DevConf and hasn't been working since.
I think Greg has contacted some folks about this, but it would be great
if you could follow up because it's been a couple of weeks now...

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: NAS solution for CephFS

2019-02-15 Thread Jeff Layton
On Fri, 2019-02-15 at 15:34 +0800, Marvin Zhang wrote:
> Thanks Jeff.
> If I set Attr_Expiration_Time as zero in conf , deos it mean timeout
> is zero? If so, every client will see the change immediately. Will it
> decrease the performance hardly?
> I seems that GlusterFS FSAL use  UPCALL to invalidate the cache. How
> about the CephFS FSAL?
> 

We mostly suggest ganesha's attribute cache be disabled when exporting
FSAL_CEPH. libcephfs caches attributes too, and it knows the status of
those attributes better than ganesha can.

A call into libcephfs from ganesha to retrieve cached attributes is
mostly just in-memory copies within the same process, so any performance
overhead there is pretty minimal. If we need to go to the network to get
the attributes, then that was a case where the cache should have been
invalidated anyway, and we avoid having to check the validity of the
cache.


> On Thu, Feb 14, 2019 at 9:04 PM Jeff Layton  wrote:
> > On Thu, 2019-02-14 at 20:57 +0800, Marvin Zhang wrote:
> > > Here is the copy from https://tools.ietf.org/html/rfc7530#page-40
> > > Will Client query 'change' attribute every time before reading to know
> > > if the data has been changed?
> > > 
> > >   +-+++-+---+
> > >   | Name| ID | Data Type  | Acc | Defined in|
> > >   +-+++-+---+
> > >   | supported_attrs | 0  | bitmap4| R   | Section 5.8.1.1   |
> > >   | type| 1  | nfs_ftype4 | R   | Section 5.8.1.2   |
> > >   | fh_expire_type  | 2  | uint32_t   | R   | Section 5.8.1.3   |
> > >   | change  | 3  | changeid4  | R   | Section 5.8.1.4   |
> > >   | size| 4  | uint64_t   | R W | Section 5.8.1.5   |
> > >   | link_support| 5  | bool   | R   | Section 5.8.1.6   |
> > >   | symlink_support | 6  | bool   | R   | Section 5.8.1.7   |
> > >   | named_attr  | 7  | bool   | R   | Section 5.8.1.8   |
> > >   | fsid| 8  | fsid4  | R   | Section 5.8.1.9   |
> > >   | unique_handles  | 9  | bool   | R   | Section 5.8.1.10  |
> > >   | lease_time  | 10 | nfs_lease4 | R   | Section 5.8.1.11  |
> > >   | rdattr_error| 11 | nfsstat4   | R   | Section 5.8.1.12  |
> > >   | filehandle  | 19 | nfs_fh4| R   | Section 5.8.1.13  |
> > >   +-+++-+---+
> > > 
> > 
> > Not every time -- only when the cache needs revalidation.
> > 
> > In the absence of a delegation, that happens on a timeout (see the
> > acregmin/acregmax settings in nfs(5)), though things like opens and file
> > locking events also affect when the client revalidates.
> > 
> > When the v4 client does revalidate the cache, it relies heavily on NFSv4
> > change attribute. Cephfs's change attribute is cluster-coherent too, so
> > if the client does revalidate it should see changes made on other
> > servers.
> > 
> > > On Thu, Feb 14, 2019 at 8:29 PM Jeff Layton  
> > > wrote:
> > > > On Thu, 2019-02-14 at 19:49 +0800, Marvin Zhang wrote:
> > > > > Hi Jeff,
> > > > > Another question is about Client Caching when disabling delegation.
> > > > > I set breakpoint on nfs4_op_read, which is OP_READ process function in
> > > > > nfs-ganesha. Then I read a file, I found that it will hit only once on
> > > > > the first time, which means latter reading operation on this file will
> > > > > not trigger OP_READ. It will read the data from client side cache. Is
> > > > > it right?
> > > > 
> > > > Yes. In the absence of a delegation, the client will periodically query
> > > > for the inode attributes, and will serve reads from the cache if it
> > > > looks like the file hasn't changed.
> > > > 
> > > > > I also checked the nfs client code in linux kernel. Only
> > > > > cache_validity is NFS_INO_INVALID_DATA, it will send OP_READ again,
> > > > > like this:
> > > > > if (nfsi->cache_validity & NFS_INO_INVALID_DATA) {
> > > > > ret = nfs_invalidate_mapping(inode, mapping);
> > > > > }
> > > > > This about this senario, client1 connect ganesha1 and client2 connect
> > > > > ganesha2. I read /1.txt on client1 and client1 will cache the data.
> > > > > Then I modify this file on client2. At that time, how client1 know the
> > > > > file is modifed and how it will add NFS_INO_INVALID_DATA into
> > > > > cache_validity?
> > > > 
> > > > Once you modify the code on client2, ganesha2 will request the necessary
> > > > caps from the ceph MDS, and client1 will have its caps revoked. It'll
> > > > then make the change.
> > > > 
> > > > When client1 reads again it will issue a GETATTR against the file [1].
> > > > ganesha1 will then request caps to do the getattr, which will end up
> > > > revoking ganesha2's caps. client1 will then see the change in attributes
> > > > (the change attribute and mtime, most likely) and will invalidate the
> > > > mapping,