from:"Alexandre DERUMIER"

Re: [ceph-users] Optimizing terrible RBD performance

2019-10-04 Thread Alexandre DERUMIER

Hi,

>>dd if=/dev/zero of=/dev/rbd0 writes at 5MB/s -

you are testing with a single thread/iodepth=1 sequentially here.
Then only 1 disk at time, and you have network latency too.

rados bench is doing 16 concurrent write.


Try to test with fio for example, with bigger iodepth,  small block/big block , 
seq/rand.



- Mail original -
De: "Petr Bena" 
À: "ceph-users" 
Envoyé: Vendredi 4 Octobre 2019 17:06:48
Objet: [ceph-users] Optimizing terrible RBD performance

Hello, 

If this is too long for you, TL;DR; section on the bottom 

I created a CEPH cluster made of 3 SuperMicro servers, each with 2 OSD 
(WD RED spinning drives) and I would like to optimize the performance of 
RBD, which I believe is blocked by some wrong CEPH configuration, 
because from my observation all resources (CPU, RAM, network, disks) are 
basically unused / idling even when I put load on the RBD. 

Each drive should be 50MB/s read / write and when I run RADOS benchmark, 
I see values that are somewhat acceptable, interesting part is that when 
I run RADOS benchmark, I can see all disks read / write to their limits, 
I can see heavy network utilization and even some CPU utilization - on 
other hand, when I put any load on the RBD device, performance is 
terrible, reading is very slow (20MB/s) writing as well (5 - 20MB/s), 
running dd if=/dev/zero of=/dev/rbd0 writes at 5MB/s - and the most 
weird part - resources are almost unused - no CPU usage, no network 
traffic, minimal disk activity. 

It looks to me like if CEPH wasn't even trying to perform much as long 
as the access is via RBD, did anyone ever saw this kind of issue? Is 
there any way to track down why it is so slow? Here are some outputs: 

[root@ceph1 cephadm]# ceph --version 
ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
(stable) 
[root@ceph1 cephadm]# ceph health 
HEALTH_OK 

I would expect write speed to be at least the 50MB/s which is speed when 
writing to disks directly, rados bench does this speed (sometimes even 
more): 

[root@ceph1 cephadm]# rados bench -p testbench 10 write --no-cleanup 
hints = 1 
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 
4194304 for up to 10 seconds or 0 objects 
Object prefix: benchmark_data_ceph1.lan.insw.cz_60873 
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg 
lat(s) 
0 0 0 0 0 0 - 0 
1 16 22 6 23.9966 24 0.966194 0.565671 
2 16 37 21 41.9945 60 1.86665 0.720606 
3 16 54 38 50.6597 68 1.07856 0.797677 
4 16 70 54 53.9928 64 1.58914 0.86644 
5 16 83 67 53.5924 52 0.208535 0.884525 
6 16 97 81 53.9923 56 2.22661 0.932738 
7 16 111 95 54.2781 56 1.0294 0.964574 
8 16 133 117 58.4921 88 0.883543 1.03648 
9 16 143 127 56.4369 40 0.352169 1.00382 
10 16 154 138 55.1916 44 0.227044 1.04071 

Read speed is even higher as it's probably reading from multiple devices 
at once: 

[root@ceph1 cephadm]# rados bench -p testbench 100 seq 
hints = 1 
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg 
lat(s) 
0 0 0 0 0 0 - 0 
1 16 96 80 319.934 320 0.811192 0.174081 
2 13 161 148 295.952 272 0.606672 0.181417 


Running rbd bench show writes at 50MB/s (which is OK) and reads at 
20MB/s (not so OK), but the REAL performance is much worse - when I 
actually access the block device and try to write or read anything it's 
sometimes extremely low as in 5MB/s or 20MB/s only. 

Why is that? What can I do to debug / trace / optimize this issue? I 
don't know if there is any point in upgrading the hardware if according 
to monitoring current HW is basically not being utilized at all. 


TL;DR; 

I created a ceph cluster from 6 OSD (dedicated 1G net, 6 4TB spinning 
drives), the rados performance benchmark shows acceptable performance, 
but RBD peformance is absolutely terrible (very slow read and very slow 
write). When I put any kind of load on cluster almost all resources are 
unused / idling, so this makes me feel like software configuration issue. 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS snapshot for backup & disaster recovery

2019-08-08 Thread Alexandre DERUMIER

Hi,

>>I'm running a single-host Ceph cluster for CephFS and I'd like to keep 
>>backups in Amazon S3 for disaster recovery. Is there a simple way to extract 
>>a CephFS snapshot as a single file and/or to create a file that represents 
>>the incremental difference between two snapshots?

I think it's on the roadmap for next ceph version.

- Mail original -
De: "Eitan Mosenkis" 
À: "Vitaliy Filippov" 
Cc: "ceph-users" 
Envoyé: Lundi 5 Août 2019 18:43:00
Objet: Re: [ceph-users] CephFS snapshot for backup & disaster recovery

I'm using it for a NAS to make backups from the other machines on my home 
network. Since everything is in one location, I want to keep a copy offsite for 
disaster recovery. Running Ceph across the internet is not recommended and is 
also very expensive compared to just storing snapshots. 

On Sun, Aug 4, 2019 at 3:08 PM Виталий Филиппов < [ mailto:vita...@yourcmc.ru | 
vita...@yourcmc.ru ] > wrote: 

Afaik no. What's the idea of running a single-host cephfs cluster? 

4 августа 2019 г. 13:27:00 GMT+03:00, Eitan Mosenkis < [ 
mailto:ei...@mosenkis.net | ei...@mosenkis.net ] > пишет: 
BQ_BEGIN

I'm running a single-host Ceph cluster for CephFS and I'd like to keep backups 
in Amazon S3 for disaster recovery. Is there a simple way to extract a CephFS 
snapshot as a single file and/or to create a file that represents the 
incremental difference between two snapshots? 

-- 
With best regards, 
Vitaliy Filippov 

BQ_END

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] BlueStore bitmap allocator under Luminous and Mimic

2019-07-10 Thread Alexandre DERUMIER

> Can't say anything about latency.

>>Anybody else? Wido?

I'm running it on mimic since 1 month, no problem until now, and it's 
definility fixing the latency increasing over time. (aka need restart osd each 
week)

Memory usage is almost the same than before.


- Mail original -
De: "Konstantin Shalygin" 
À: "Marc Roos" , "Wido den Hollander" 
Cc: "ceph-users" 
Envoyé: Mercredi 10 Juillet 2019 05:56:35
Objet: Re: [ceph-users] BlueStore bitmap allocator under Luminous and Mimic

On 5/28/19 5:16 PM, Marc Roos wrote: 
> I switched first of may, and did not notice to much difference in memory 
> usage. After the restart of the osd's on the node I see the memory 
> consumption gradually getting back to as before. 
> Can't say anything about latency. 


Anybody else? Wido? 

I see many patches from Igor comes to Luminous. And also bitmap 
allocator (default in Nautilus) is tries to kill Brett Chancellor 
cluster for a week [1] 



[1] 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-July/035726.html 

k 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Changing the release cadence

2019-06-05 Thread Alexandre DERUMIER

Hi,


>>- November: If we release Octopus 9 months from the Nautilus release 
>>(planned for Feb, released in Mar) then we'd target this November. We 
>>could shift to a 12 months candence after that. 

For the 2 last debian releases, the freeze was around january-february,
november seem to be a good time for ceph release.

- Mail original -
De: "Sage Weil" 
À: "ceph-users" , "ceph-devel" 
, d...@ceph.io
Envoyé: Mercredi 5 Juin 2019 17:57:52
Objet: Changing the release cadence

Hi everyone, 

Since luminous, we have had the follow release cadence and policy: 
- release every 9 months 
- maintain backports for the last two releases 
- enable upgrades to move either 1 or 2 releases heads 
(e.g., luminous -> mimic or nautilus; mimic -> nautilus or octopus; ...) 

This has mostly worked out well, except that the mimic release received 
less attention that we wanted due to the fact that multiple downstream 
Ceph products (from Red Has and SUSE) decided to based their next release 
on nautilus. Even though upstream every release is an "LTS" release, as a 
practical matter mimic got less attention than luminous or nautilus. 

We've had several requests/proposals to shift to a 12 month cadence. This 
has several advantages: 

- Stable/conservative clusters only have to be upgraded every 2 years 
(instead of every 18 months) 
- Yearly releases are more likely to intersect with downstream 
distribution release (e.g., Debian). In the past there have been 
problems where the Ceph releases included in consecutive releases of a 
distro weren't easily upgradeable. 
- Vendors that make downstream Ceph distributions/products tend to 
release yearly. Aligning with those vendors means they are more likely 
to productize *every* Ceph release. This will help make every Ceph 
release an "LTS" release (not just in name but also in terms of 
maintenance attention). 

So far the balance of opinion seems to favor a shift to a 12 month 
cycle[1], especially among developers, so it seems pretty likely we'll 
make that shift. (If you do have strong concerns about such a move, now 
is the time to raise them.) 

That brings us to an important decision: what time of year should we 
release? Once we pick the timing, we'll be releasing at that time *every 
year* for each release (barring another schedule shift, which we want to 
avoid), so let's choose carefully! 

A few options: 

- November: If we release Octopus 9 months from the Nautilus release 
(planned for Feb, released in Mar) then we'd target this November. We 
could shift to a 12 months candence after that. 
- February: That's 12 months from the Nautilus target. 
- March: That's 12 months from when Nautilus was *actually* released. 

November is nice in the sense that we'd wrap things up before the 
holidays. It's less good in that users may not be inclined to install the 
new release when many developers will be less available in December. 

February kind of sucked in that the scramble to get the last few things 
done happened during the holidays. OTOH, we should be doing what we can 
to avoid such scrambles, so that might not be something we should factor 
in. March may be a bit more balanced, with a solid 3 months before when 
people are productive, and 3 months after before they disappear on holiday 
to address any post-release issues. 

People tend to be somewhat less available over the summer months due to 
holidays etc, so an early or late summer release might also be less than 
ideal. 

Thoughts? If we can narrow it down to a few options maybe we could do a 
poll to gauge user preferences. 

Thanks! 
sage 


[1] https://twitter.com/larsmb/status/1130010208971952129 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Poor performance for 512b aligned "partial" writes from Windows guests in OpenStack + potential fix

2019-05-16 Thread Alexandre DERUMIER

Many thanks for the analysis !


I'm going to test with 4K on heavy mssql database to see if I'm seeing 
improvement on ios/latency.
I'll report results in this thread.


- Mail original -
De: "Trent Lloyd" 
À: "ceph-users" 
Envoyé: Vendredi 10 Mai 2019 09:59:39
Objet: [ceph-users] Poor performance for 512b aligned "partial" writes from 
Windows guests in OpenStack + potential fix

I recently was investigating a performance problem for a reasonably sized 
OpenStack deployment having around 220 OSDs (3.5" 7200 RPM SAS HDD) with NVMe 
Journals. The primary workload is Windows guests backed by Cinder RBD volumes. 
This specific deployment is Ceph Jewel (FileStore + SimpleMessenger) which 
while it is EOL, the issue is reproducible on current versions and also on 
BlueStore however for different reasons than FileStore. 

Generally the Ceph cluster was suffering from very poor outlier performance, 
the numbers change a little bit depending on the exact situation but roughly 
80% of I/O was happening in a "reasonable" time of 0-200ms but 5-20% of I/O 
operations were taking excessively long anywhere from 500ms through to 10-20+ 
seconds. However the normal metrics for commit and apply latency were normal, 
and in fact, this latency was hard to spot in the performance metrics available 
in jewel. 

Previously I more simply considered FileStore to have the "commit" (to journal) 
stage where it was written to the journal and it is OK to return to the client 
and then the "apply" (to disk) stage where it was flushed to disk and confirmed 
so that the data could be purged from the journal. However there is really a 
third stage in the middle where FileStore submits the I/O to the operating 
system and this is done before the lock on the object is released. Until that 
succeeds another operation cannot write to the same object (generally being a 
4MB area of the disk). 

I found that the fstore_op threads would get stuck for hundreds of MS or more 
inside of pwritev() which was blocking inside of the kernel. Normally we expect 
pwritev() to be buffered I/O into the page cache and return quite fast however 
in this case the kernel was in a few percent of cases blocking with the stack 
trace included at the end of the e-mail [1]. My finding from that stack is that 
inside __block_write_begin_int we see a call to out_of_line_wait_on_bit call 
which is really an inlined call for wait_on_buffer which occurs in 
linux/fs/buffer.c in the section around line 2000-2024 with the comment "If we 
issued read requests - let them complete." ( [ 
https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b2559f7f827c/fs/buffer.c#L2002
 | 
https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b2559f7f827c/fs/buffer.c#L2002
 ] ) 

My interpretation of that code is that for Linux to store a write in the page 
cache, it has to have the entire 4K page as that is the granularity of which it 
tracks the dirty state and it needs the entire 4K page to later submit back to 
the disk. Since we wrote a part of the page, and the page wasn't already in the 
cache, it has to fetch the remainder of the page from the disk. When this 
happens, it blocks waiting for this read to complete before returning from the 
pwritev() call - hence our normally buffered write blocks. This holds up the 
tp_fstore_op thread, of which there are (by default) only 2-4 such threads 
trying to process several hundred operations per second. Additionally the size 
of the osd_op_queue is bounded, and operations do not clear out of this queue 
until the tp_fstore_op thread is done. Which ultimately means that not only are 
these partial writes delayed but it knocks on to delay other writes behind them 
because of the constrained thread pools. 

What was further confusing to this, is that I could easily reproduce this in a 
test deployment using an rbd benchmark that was only writing to a total disk 
size of 256MB which I would easily have expected to fit in the page cache: 
rbd create -p rbd --size=256M bench2 
rbd bench-write -p rbd bench2 --io-size 512 --io-threads 256 --io-total 256M 
--io-pattern rand 

This is explained by the fact that on secondary OSDs (at least, there was some 
refactoring of fadvise which I have not fully understood as of yet), FileStore 
is using fadvise FADVISE_DONTNEED on the objects after write which causes the 
kernel to immediately discard them from the page cache without any regard to 
their statistics of being recently/frequently used. The motivation for this 
addition appears to be that on a secondary OSD we don't service reads (only 
writes) and so therefor we can optimize memory usage by throwing away this 
object and in theory leaving more room in the page cache for objects which we 
are primary for and expect to actually service reads from a client for. 
Unfortunately this behavior does not take into account partial writes, where we 
now pathologically throw away the cached copy instantly such that a write

Re: [ceph-users] How do you deal with "clock skew detected"?

2019-05-15 Thread Alexandre DERUMIER

since I'm using chrony instead ntpd/openntpd, I don't have clock skew anymore.
 
(chrony is really faster to resync)

- Mail original -
De: "Jan Kasprzak" 
À: "ceph-users" 
Envoyé: Mercredi 15 Mai 2019 13:47:57
Objet: [ceph-users] How do you deal with "clock skew detected"?

Hello, Ceph users, 

how do you deal with the "clock skew detected" HEALTH_WARN message? 

I think the internal RTC in most x86 servers does have 1 second resolution 
only, but Ceph skew limit is much smaller than that. So every time I reboot 
one of my mons (for kernel upgrade or something), I have to wait for several 
minutes for the system clock to synchronize over NTP, even though ntpd 
has been running before reboot and was started during the system boot again. 

Thanks, 

-Yenya 

-- 
| Jan "Yenya" Kasprzak  | 
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | 
sir_clive> I hope you don't mind if I steal some of your ideas? 
laryross> As far as stealing... we call it sharing here. --from rcgroups 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] VM management setup

2019-04-24 Thread Alexandre DERUMIER

+1 for proxmox. (I'm contributor and I can say that ceph support is very good)

- Mail original -
De: jes...@krogh.cc
À: "ceph-users" 
Envoyé: Vendredi 5 Avril 2019 21:34:02
Objet: [ceph-users] VM management setup

Hi. Knowing this is a bit off-topic but seeking recommendations 
and advise anyway. 

We're seeking a "management" solution for VM's - currently in the 40-50 
VM - but would like to have better access in managing them and potintially 
migrate them across multiple hosts, setup block devices, etc, etc. 

This is only to be used internally in a department where a bunch of 
engineering people will manage it, no costumers and that kind of thing. 

Up until now we have been using virt-manager with kvm - and have been 
quite satisfied when we were in the "few vms", but it seems like the 
time to move on. 

Thus we're looking for something "simple" that can help manage a ceph+kvm 
based setup - the simpler and more to the point the better. 

Any recommendations? 

.. found a lot of names allready .. 
OpenStack 
CloudStack 
Proxmox 
.. 

But recommendations are truely welcome. 

Thanks. 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Intel D3-S4610 performance

2019-03-14 Thread Alexandre DERUMIER

Hi,

I'm running dc p4610 6TB (nvme), no performance problem.

not sure what is the difference with d3-s4610.



- Mail original -
De: "Kai Wembacher" 
À: "ceph-users" 
Envoyé: Mardi 12 Mars 2019 09:13:44
Objet: [ceph-users] Intel D3-S4610 performance



Hi everyone, 



I have an Intel D3-S4610 SSD with 1.92 TB here for testing and get some pretty 
bad numbers, when running the fio benchmark suggested by Sébastien Han ( [ 
http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
 | 
http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
 ] ): 



Intel D3-S4610 1.92 TB 

--numjobs=1 write: IOPS=3860, BW=15.1MiB/s (15.8MB/s)(905MiB/60001msec) 

--numjobs=2 write: IOPS=7138, BW=27.9MiB/s (29.2MB/s)(1673MiB/60001msec) 

--numjobs=4 write: IOPS=12.5k, BW=48.7MiB/s (51.0MB/s)(2919MiB/60002msec) 



Compared to our current Samsung SM863 SSDs the Intel one is about 6x slower. 



Has someone here tested this SSD and can give me some values for comparison? 



Many thanks in advance, 



Kai 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd cache limiting IOPS

2019-03-08 Thread Alexandre DERUMIER

>>(I think I see a PR about this on performance meeting pad some months ago) 

https://github.com/ceph/ceph/pull/25713


- Mail original -
De: "aderumier" 
À: "Engelmann Florian" 
Cc: "ceph-users" 
Envoyé: Vendredi 8 Mars 2019 15:03:23
Objet: Re: [ceph-users] rbd cache limiting IOPS

>>Which options do we have to increase IOPS while writeback cache is used? 

If I remember they are some kind of global lock/mutex with rbd cache, 

and I think they are some work currently to improve it. 

(I think I see a PR about this on performance meeting pad some months ago) 

- Mail original - 
De: "Engelmann Florian"  
À: "ceph-users"  
Envoyé: Jeudi 7 Mars 2019 11:41:41 
Objet: [ceph-users] rbd cache limiting IOPS 

Hi, 

we are running an Openstack environment with Ceph block storage. There 
are six nodes in the current Ceph cluster (12.2.10) with NVMe SSDs and a 
P4800X Optane for rocksdb and WAL. 
The decision was made to use rbd writeback cache with KVM/QEMU. The 
write latency is incredible good (~85 µs) and the read latency is still 
good (~0.6ms). But we are limited to ~23.000 IOPS in a KVM machine. So 
we did the same FIO benchmark after we disabled the rbd cache and got 
65.000 IOPS but of course the write latency (QD1) was increased to ~ 0.6ms. 
We tried to tune: 

rbd cache size -> 256MB 
rbd cache max dirty -> 192MB 
rbd cache target dirty -> 128MB 

but still we are locked at ~23.000 IOPS with enabled writeback cache. 

Right now we are not sure if the tuned settings have been honoured by 
libvirt. 

Which options do we have to increase IOPS while writeback cache is used? 

All the best, 
Florian 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rbd cache limiting IOPS

2019-03-08 Thread Alexandre DERUMIER

>>Which options do we have to increase IOPS while writeback cache is used?

If I remember they are some kind of global lock/mutex with rbd cache,

and I think they are some work currently to improve it.

(I think I see a PR about this on performance meeting pad some months ago)

- Mail original -
De: "Engelmann Florian" 
À: "ceph-users" 
Envoyé: Jeudi 7 Mars 2019 11:41:41
Objet: [ceph-users] rbd cache limiting IOPS

Hi, 

we are running an Openstack environment with Ceph block storage. There 
are six nodes in the current Ceph cluster (12.2.10) with NVMe SSDs and a 
P4800X Optane for rocksdb and WAL. 
The decision was made to use rbd writeback cache with KVM/QEMU. The 
write latency is incredible good (~85 µs) and the read latency is still 
good (~0.6ms). But we are limited to ~23.000 IOPS in a KVM machine. So 
we did the same FIO benchmark after we disabled the rbd cache and got 
65.000 IOPS but of course the write latency (QD1) was increased to ~ 0.6ms. 
We tried to tune: 

rbd cache size -> 256MB 
rbd cache max dirty -> 192MB 
rbd cache target dirty -> 128MB 

but still we are locked at ~23.000 IOPS with enabled writeback cache. 

Right now we are not sure if the tuned settings have been honoured by 
libvirt. 

Which options do we have to increase IOPS while writeback cache is used? 

All the best, 
Florian 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-03-01 Thread Alexandre DERUMIER

Hi,

some news, it seem that it's finally stable for me since 1week. (around 0,7ms 
of commit latency average)
 
http://odisoweb1.odiso.net/osdstable.png

The biggest change is the 18/02, where I have finished to rebuild all my osd, 
with 2 osd of 3TB for 1NVME 6TB.

(previously I only have done it on 1 node, so maybe with replication I didn't 
see the benefit)

I have also push bluestore_cache_kv_max to 1G, and keep osd_target_memory to 
default, and disable THP.

Differents buffers seem to be more constant too.  



But clearly, 2 x smaller 3TB osd with 3G osd_target_memory  vs 1 big osd 6TB 
with 6G osd_target_memory have a differents behaviour.
(maybe fragmentation, maybe rocksdb, maybe number of objects in cache, I really 
don't known)





- Mail original -
De: "Stefan Kooman" 
À: "Wido den Hollander" 
Cc: "aderumier" , "Igor Fedotov" , 
"ceph-users" , "ceph-devel" 

Envoyé: Jeudi 28 Février 2019 21:57:05
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

Quoting Wido den Hollander (w...@42on.com): 

> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
> OSDs as well. Over time their latency increased until we started to 
> notice I/O-wait inside VMs. 

On a Luminous 12.2.8 cluster with only SSDs we also hit this issue I 
guess. After restarting the OSD servers the latency would drop to normal 
values again. See https://owncloud.kooman.org/s/BpkUc7YM79vhcDj 

Reboots were finished at ~ 19:00. 

Gr. Stefan 

-- 
| BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351 
| GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph and TCP States

2019-02-25 Thread Alexandre DERUMIER

Hi,

sorry to bump this old thread,

but I had this problem recently, with a linux firewall between cephfs client 
and cluster

the problem was easy to reproduce with

#firewall is enable with


iptables -A FORWARD -m conntrack --ctstate INVALID -j DROP
iptables -A FORWARD -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT

(conntrack have cephfs connection)


then flushing rules
#iptables -F
#iptables -X

(Still working, conntrack still have the connection)


then, reenable rules
iptables -A FORWARD -m conntrack --ctstate INVALID -j DROP
iptables -A FORWARD -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT


And the cephfs mds connection is now hanging for 15minutes. (monitor connection 
is reestablished correctly in less than 1s)

Then reason was that conntrack flag packets as invalid, because of out of 
window sequence.
(as conntrack stop to track them when rules are flushed).

A simple workaround:
net.netfilter.nf_conntrack_tcp_be_liberal=1


Hope this help !








- Mail original -
De: "Nick Fisk" 
À: "ceph-users" 
Envoyé: Vendredi 21 Octobre 2016 16:19:03
Objet: [ceph-users] Ceph and TCP States

Hi, 

I'm just testing out using a Ceph client in a DMZ behind a FW from the main 
Ceph cluster. One thing I have noticed is that if the 
state table on the FW is emptied maybe by restarting it or just clearing the 
state table...etc. Then the Ceph client will hang for a 
long time as the TCP session can no longer pass through the FW and just gets 
blocked instead. 

I believe this behaviour can be adjusted by the "ms tcp read timeout" setting 
to limit its impact, but wondering if anybody has any 
other ideas. I'm also thinking of experimenting with either stateless FW rules 
for Ceph or getting the FW to send back RST packets 
instead of silently dropping packets. 

Thanks, 
Nick 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-20 Thread Alexandre DERUMIER

on osd.8, at 01:20 when latency begin to increase, I have a scrub running

2019-02-20 01:16:08.851 7f84d24d9700  0 log_channel(cluster) log [DBG] : 5.52 
scrub starts
2019-02-20 01:17:18.019 7f84ce4d1700  0 log_channel(cluster) log [DBG] : 5.52 
scrub ok
2019-02-20 01:20:31.944 7f84f036e700  0 -- 10.5.0.106:6820/2900 >> 
10.5.0.79:0/2442367265 conn(0x7e120300 :6820 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg 
accept replacing existing (lossy) channel (new one lossy=1)
2019-02-20 01:28:35.421 7f84d34db700  0 log_channel(cluster) log [DBG] : 5.c8 
scrub starts
2019-02-20 01:29:45.553 7f84cf4d3700  0 log_channel(cluster) log [DBG] : 5.c8 
scrub ok
2019-02-20 01:32:45.737 7f84d14d7700  0 log_channel(cluster) log [DBG] : 5.c4 
scrub starts
2019-02-20 01:33:56.137 7f84d14d7700  0 log_channel(cluster) log [DBG] : 5.c4 
scrub ok


I'll try to do test disabling scrubing (currently it's running the night 
between 01:00-05:00)

- Mail original -
De: "aderumier" 
À: "Igor Fedotov" 
Cc: "ceph-users" , "ceph-devel" 

Envoyé: Mercredi 20 Février 2019 12:09:08
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

Something interesting, 

when I have restarted osd.8 at 11:20, 

I'm seeing another osd.1 where latency is decreasing exactly at the same time. 
(without restart of this osd). 

http://odisoweb1.odiso.net/osd1.png 

onodes and cache_other are also going down for osd.1 at this time. 




- Mail original - 
De: "aderumier"  
À: "Igor Fedotov"  
Cc: "ceph-users" , "ceph-devel" 
 
Envoyé: Mercredi 20 Février 2019 11:39:34 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart 

Hi, 

I have hit the bug again, but this time only on 1 osd 

here some graphs: 
http://odisoweb1.odiso.net/osd8.png 

latency was good until 01:00 

Then I'm seeing nodes miss, bluestore onodes number is increasing (seem to be 
normal), 
after that latency is slowing increasing from 1ms to 3-5ms 

after osd restart, I'm between 0.7-1ms 


- Mail original - 
De: "aderumier"  
À: "Igor Fedotov"  
Cc: "ceph-users" , "ceph-devel" 
 
Envoyé: Mardi 19 Février 2019 17:03:58 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart 

>>I think op_w_process_latency includes replication times, not 100% sure 
>>though. 
>> 
>>So restarting other nodes might affect latencies at this specific OSD. 

Seem to be the case, I have compared with sub_op_latency. 

I have changed my graph, to clearly identify the osd where the latency is high. 


I have done some changes in my setup: 
- 2 osd by nvme (2x3TB by osd), with 6GB memory. (instead 1osd of 6TB with 12G 
memory). 
- disabling transparent hugepage 

Since 24h, latencies are still low (between 0.7-1.2ms). 

I'm also seeing that total memory used (#free), is lower than before (48GB 
(8osd x 6GB) vs 56GB (4osd x 12GB). 

I'll send more stats tomorrow. 

Alexandre 


- Mail original - 
De: "Igor Fedotov"  
À: "Alexandre Derumier" , "Wido den Hollander" 
 
Cc: "ceph-users" , "ceph-devel" 
 
Envoyé: Mardi 19 Février 2019 11:12:43 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart 

Hi Alexander, 

I think op_w_process_latency includes replication times, not 100% sure 
though. 

So restarting other nodes might affect latencies at this specific OSD. 


Thanks, 

Igot 

On 2/16/2019 11:29 AM, Alexandre DERUMIER wrote: 
>>> There are 10 OSDs in these systems with 96GB of memory in total. We are 
>>> runnigh with memory target on 6G right now to make sure there is no 
>>> leakage. If this runs fine for a longer period we will go to 8GB per OSD 
>>> so it will max out on 80GB leaving 16GB as spare. 
> Thanks Wido. I send results monday with my increased memory 
> 
> 
> 
> @Igor: 
> 
> I have also notice, that sometime when I have bad latency on an osd on node1 
> (restarted 12h ago for example). 
> (op_w_process_latency). 
> 
> If I restart osds on other nodes (last restart some days ago, so with bigger 
> latency), it's reducing latency on osd of node1 too. 
> 
> does "op_w_process_latency" counter include replication time ? 
> 
> - Mail original - 
> De: "Wido den Hollander"  
> À: "aderumier"  
> Cc: "Igor Fedotov" , "ceph-users" 
> , "ceph-devel"  
> Envoyé: Vendredi 15 Février 2019 14:59:30 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
> restart 
> 
> On 2/15/19 2:54 PM, Alexandre DERUMIER wrote: 
>>>> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
>>>> OSDs as well. Over time the

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-20 Thread Alexandre DERUMIER

Something interesting,

when I have restarted osd.8 at 11:20,

I'm seeing another osd.1 where latency is decreasing exactly at the same time. 
(without restart of this osd).

http://odisoweb1.odiso.net/osd1.png 

onodes and cache_other are also going down for osd.1 at this time. 




- Mail original -
De: "aderumier" 
À: "Igor Fedotov" 
Cc: "ceph-users" , "ceph-devel" 

Envoyé: Mercredi 20 Février 2019 11:39:34
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

Hi, 

I have hit the bug again, but this time only on 1 osd 

here some graphs: 
http://odisoweb1.odiso.net/osd8.png 

latency was good until 01:00 

Then I'm seeing nodes miss, bluestore onodes number is increasing (seem to be 
normal), 
after that latency is slowing increasing from 1ms to 3-5ms 

after osd restart, I'm between 0.7-1ms 


- Mail original - 
De: "aderumier"  
À: "Igor Fedotov"  
Cc: "ceph-users" , "ceph-devel" 
 
Envoyé: Mardi 19 Février 2019 17:03:58 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart 

>>I think op_w_process_latency includes replication times, not 100% sure 
>>though. 
>> 
>>So restarting other nodes might affect latencies at this specific OSD. 

Seem to be the case, I have compared with sub_op_latency. 

I have changed my graph, to clearly identify the osd where the latency is high. 


I have done some changes in my setup: 
- 2 osd by nvme (2x3TB by osd), with 6GB memory. (instead 1osd of 6TB with 12G 
memory). 
- disabling transparent hugepage 

Since 24h, latencies are still low (between 0.7-1.2ms). 

I'm also seeing that total memory used (#free), is lower than before (48GB 
(8osd x 6GB) vs 56GB (4osd x 12GB). 

I'll send more stats tomorrow. 

Alexandre 


- Mail original - 
De: "Igor Fedotov"  
À: "Alexandre Derumier" , "Wido den Hollander" 
 
Cc: "ceph-users" , "ceph-devel" 
 
Envoyé: Mardi 19 Février 2019 11:12:43 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart 

Hi Alexander, 

I think op_w_process_latency includes replication times, not 100% sure 
though. 

So restarting other nodes might affect latencies at this specific OSD. 


Thanks, 

Igot 

On 2/16/2019 11:29 AM, Alexandre DERUMIER wrote: 
>>> There are 10 OSDs in these systems with 96GB of memory in total. We are 
>>> runnigh with memory target on 6G right now to make sure there is no 
>>> leakage. If this runs fine for a longer period we will go to 8GB per OSD 
>>> so it will max out on 80GB leaving 16GB as spare. 
> Thanks Wido. I send results monday with my increased memory 
> 
> 
> 
> @Igor: 
> 
> I have also notice, that sometime when I have bad latency on an osd on node1 
> (restarted 12h ago for example). 
> (op_w_process_latency). 
> 
> If I restart osds on other nodes (last restart some days ago, so with bigger 
> latency), it's reducing latency on osd of node1 too. 
> 
> does "op_w_process_latency" counter include replication time ? 
> 
> - Mail original - 
> De: "Wido den Hollander"  
> À: "aderumier"  
> Cc: "Igor Fedotov" , "ceph-users" 
> , "ceph-devel"  
> Envoyé: Vendredi 15 Février 2019 14:59:30 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
> restart 
> 
> On 2/15/19 2:54 PM, Alexandre DERUMIER wrote: 
>>>> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
>>>> OSDs as well. Over time their latency increased until we started to 
>>>> notice I/O-wait inside VMs. 
>> I'm also notice it in the vms. BTW, what it your nvme disk size ? 
> Samsung PM983 3.84TB SSDs in both clusters. 
> 
>> 
>>>> A restart fixed it. We also increased memory target from 4G to 6G on 
>>>> these OSDs as the memory would allow it. 
>> I have set memory to 6GB this morning, with 2 osds of 3TB for 6TB nvme. 
>> (my last test was 8gb with 1osd of 6TB, but that didn't help) 
> There are 10 OSDs in these systems with 96GB of memory in total. We are 
> runnigh with memory target on 6G right now to make sure there is no 
> leakage. If this runs fine for a longer period we will go to 8GB per OSD 
> so it will max out on 80GB leaving 16GB as spare. 
> 
> As these OSDs were all restarted earlier this week I can't tell how it 
> will hold up over a longer period. Monitoring (Zabbix) shows the latency 
> is fine at the moment. 
> 
> Wido 
> 
>> 
>> - Mail original - 
>> De: "Wido den Hollander"  
>> À: "Alexandre Derumier" , "Igor Fedotov" 
>>  
>> Cc: "ceph-users"

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-20 Thread Alexandre DERUMIER

Hi,

I have hit the bug again, but this time only on 1 osd

here some graphs:
http://odisoweb1.odiso.net/osd8.png

latency was good until 01:00

Then I'm seeing nodes miss, bluestore onodes number is increasing (seem to be 
normal),
after that latency is slowing increasing from 1ms to 3-5ms

after osd restart, I'm between 0.7-1ms


- Mail original -
De: "aderumier" 
À: "Igor Fedotov" 
Cc: "ceph-users" , "ceph-devel" 

Envoyé: Mardi 19 Février 2019 17:03:58
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

>>I think op_w_process_latency includes replication times, not 100% sure 
>>though. 
>> 
>>So restarting other nodes might affect latencies at this specific OSD. 

Seem to be the case, I have compared with sub_op_latency. 

I have changed my graph, to clearly identify the osd where the latency is high. 


I have done some changes in my setup: 
- 2 osd by nvme (2x3TB by osd), with 6GB memory. (instead 1osd of 6TB with 12G 
memory). 
- disabling transparent hugepage 

Since 24h, latencies are still low (between 0.7-1.2ms). 

I'm also seeing that total memory used (#free), is lower than before (48GB 
(8osd x 6GB) vs 56GB (4osd x 12GB). 

I'll send more stats tomorrow. 

Alexandre 


- Mail original - 
De: "Igor Fedotov"  
À: "Alexandre Derumier" , "Wido den Hollander" 
 
Cc: "ceph-users" , "ceph-devel" 
 
Envoyé: Mardi 19 Février 2019 11:12:43 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart 

Hi Alexander, 

I think op_w_process_latency includes replication times, not 100% sure 
though. 

So restarting other nodes might affect latencies at this specific OSD. 


Thanks, 

Igot 

On 2/16/2019 11:29 AM, Alexandre DERUMIER wrote: 
>>> There are 10 OSDs in these systems with 96GB of memory in total. We are 
>>> runnigh with memory target on 6G right now to make sure there is no 
>>> leakage. If this runs fine for a longer period we will go to 8GB per OSD 
>>> so it will max out on 80GB leaving 16GB as spare. 
> Thanks Wido. I send results monday with my increased memory 
> 
> 
> 
> @Igor: 
> 
> I have also notice, that sometime when I have bad latency on an osd on node1 
> (restarted 12h ago for example). 
> (op_w_process_latency). 
> 
> If I restart osds on other nodes (last restart some days ago, so with bigger 
> latency), it's reducing latency on osd of node1 too. 
> 
> does "op_w_process_latency" counter include replication time ? 
> 
> - Mail original - 
> De: "Wido den Hollander"  
> À: "aderumier"  
> Cc: "Igor Fedotov" , "ceph-users" 
> , "ceph-devel"  
> Envoyé: Vendredi 15 Février 2019 14:59:30 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
> restart 
> 
> On 2/15/19 2:54 PM, Alexandre DERUMIER wrote: 
>>>> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
>>>> OSDs as well. Over time their latency increased until we started to 
>>>> notice I/O-wait inside VMs. 
>> I'm also notice it in the vms. BTW, what it your nvme disk size ? 
> Samsung PM983 3.84TB SSDs in both clusters. 
> 
>> 
>>>> A restart fixed it. We also increased memory target from 4G to 6G on 
>>>> these OSDs as the memory would allow it. 
>> I have set memory to 6GB this morning, with 2 osds of 3TB for 6TB nvme. 
>> (my last test was 8gb with 1osd of 6TB, but that didn't help) 
> There are 10 OSDs in these systems with 96GB of memory in total. We are 
> runnigh with memory target on 6G right now to make sure there is no 
> leakage. If this runs fine for a longer period we will go to 8GB per OSD 
> so it will max out on 80GB leaving 16GB as spare. 
> 
> As these OSDs were all restarted earlier this week I can't tell how it 
> will hold up over a longer period. Monitoring (Zabbix) shows the latency 
> is fine at the moment. 
> 
> Wido 
> 
>> 
>> - Mail original - 
>> De: "Wido den Hollander"  
>> À: "Alexandre Derumier" , "Igor Fedotov" 
>>  
>> Cc: "ceph-users" , "ceph-devel" 
>>  
>> Envoyé: Vendredi 15 Février 2019 14:50:34 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
>> restart 
>> 
>> On 2/15/19 2:31 PM, Alexandre DERUMIER wrote: 
>>> Thanks Igor. 
>>> 
>>> I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is 
>>> different. 
>>> 
>>> I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't 
>>> see this latency problem. 
>>

Re: [ceph-users] Replicating CephFS between clusters

2019-02-19 Thread Alexandre DERUMIER

Hi,

I think that cephfs snap mirroring is coming for nautilus

https://www.openstack.org/assets/presentation-media/2018.11.15-openstack-ceph-data-services.pdf
(slide 26)

But I don't known if it's already ready is master ?



- Mail original -
De: "Vitaliy Filippov" 
À: "Marc Roos" , "Balazs.Soltesz" 
, "ceph-users" , "Wido 
den Hollander" 
Envoyé: Mardi 19 Février 2019 23:24:44
Objet: Re: [ceph-users] Replicating CephFS between clusters

> Ah, yes, good question. I don't know if there is a true upper limit, but 
> leaving old snapshot around could hurt you when replaying journals and 
> such. 

Is is still so in mimic? 

Should I live in fear if I keep old snapshots all the time (because I'm 
using them as "checkpoints")? :) 

-- 
With best regards, 
Vitaliy Filippov 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Intel P4600 3.2TB U.2 form factor NVMe firmware problems causing dead disks

2019-02-19 Thread Alexandre DERUMIER

I'm running some s4610 (SSDPE2KE064T8), with firmware VDV10140.

don't have any problem with them since 6months.

But I remember than around september 2017, supermicro has warned me about a
firmware bug on s4600. (don't known which firmware version)

- Mail original -
De: "David Turner"
À: "ceph-users"
Envoyé: Lundi 18 Février 2019 16:44:18
Objet: [ceph-users] Intel P4600 3.2TB U.2 form factor NVMe firmware problems
causing dead disks

We have 2 clusters of [1] these disks that have 2 Bluestore OSDs per disk
(partitioned), 3 disks per node, 5 nodes per cluster. The clusters are 12.2.4
running CephFS and RBDs. So in total we have 15 NVMe's per cluster and 30
NVMe's in total. They were all built at the same time and were running firmware
version QDV10130. On this firmware version we early on had 2 disks failures, a
few months later we had 1 more, and then a month after that (just a few weeks
ago) we had 7 disk failures in 1 week.

The failures are such that the disk is no longer visible to the OS. This holds
true beyond server reboots as well as placing the failed disks into a new
server. With a firmware upgrade tool we got an error that pretty much said
there's no way to get data back and to RMA the disk. We upgraded all of our
remaining disks' firmware to QDV101D1 and haven't had any problems since then.
Most of our failures happened while rebalancing the cluster after replacing
dead disks and we tested rigorously around that use case after upgrading the
firmware. This firmware version seems to have resolved whatever the problem
was.

We have about 100 more of these scattered among database servers and other
servers that have never had this problem while running the QDV10130 firmware as
well as firmwares between this one and the one we upgraded to. Bluestore on
Ceph is the only use case we've had so far with this sort of failure.

Has anyone else come across this issue before? Our current theory is that
Bluestore is accessing the disk in a way that is triggering a bug in the older
firmware version that isn't triggered by more traditional filesystems. We have
a scheduled call with Intel to discuss this, but their preliminary searches
into the bugfixes and known problems between firmware versions didn't indicate
the bug that we triggered. It would be good to have some more information about
what those differences for disk accessing might be to hopefully get a better
answer from them as to what the problem is.

[1] [
https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4600-series/dc-p4600-3-2tb-2-5inch-3d1.html
|
https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4600-series/dc-p4600-3-2tb-2-5inch-3d1.html
]

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-19 Thread Alexandre DERUMIER

>>I think op_w_process_latency includes replication times, not 100% sure 
>>though. 
>>
>>So restarting other nodes might affect latencies at this specific OSD. 

Seem to be the case, I have compared with sub_op_latency.

I have changed my graph, to clearly identify the osd where the latency is high. 


I have done some changes in my setup:
- 2 osd by nvme (2x3TB by osd), with 6GB memory. (instead 1osd of 6TB with 12G 
memory).
- disabling transparent hugepage

Since 24h, latencies are still low (between 0.7-1.2ms).

I'm also seeing that total memory used (#free), is lower than before (48GB 
(8osd x 6GB) vs 56GB (4osd x 12GB).

I'll send more stats tomorrow.

Alexandre


- Mail original -
De: "Igor Fedotov" 
À: "Alexandre Derumier" , "Wido den Hollander" 

Cc: "ceph-users" , "ceph-devel" 

Envoyé: Mardi 19 Février 2019 11:12:43
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

Hi Alexander, 

I think op_w_process_latency includes replication times, not 100% sure 
though. 

So restarting other nodes might affect latencies at this specific OSD. 


Thanks, 

Igot 

On 2/16/2019 11:29 AM, Alexandre DERUMIER wrote: 
>>> There are 10 OSDs in these systems with 96GB of memory in total. We are 
>>> runnigh with memory target on 6G right now to make sure there is no 
>>> leakage. If this runs fine for a longer period we will go to 8GB per OSD 
>>> so it will max out on 80GB leaving 16GB as spare. 
> Thanks Wido. I send results monday with my increased memory 
> 
> 
> 
> @Igor: 
> 
> I have also notice, that sometime when I have bad latency on an osd on node1 
> (restarted 12h ago for example). 
> (op_w_process_latency). 
> 
> If I restart osds on other nodes (last restart some days ago, so with bigger 
> latency), it's reducing latency on osd of node1 too. 
> 
> does "op_w_process_latency" counter include replication time ? 
> 
> - Mail original - 
> De: "Wido den Hollander"  
> À: "aderumier"  
> Cc: "Igor Fedotov" , "ceph-users" 
> , "ceph-devel"  
> Envoyé: Vendredi 15 Février 2019 14:59:30 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
> restart 
> 
> On 2/15/19 2:54 PM, Alexandre DERUMIER wrote: 
>>>> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
>>>> OSDs as well. Over time their latency increased until we started to 
>>>> notice I/O-wait inside VMs. 
>> I'm also notice it in the vms. BTW, what it your nvme disk size ? 
> Samsung PM983 3.84TB SSDs in both clusters. 
> 
>> 
>>>> A restart fixed it. We also increased memory target from 4G to 6G on 
>>>> these OSDs as the memory would allow it. 
>> I have set memory to 6GB this morning, with 2 osds of 3TB for 6TB nvme. 
>> (my last test was 8gb with 1osd of 6TB, but that didn't help) 
> There are 10 OSDs in these systems with 96GB of memory in total. We are 
> runnigh with memory target on 6G right now to make sure there is no 
> leakage. If this runs fine for a longer period we will go to 8GB per OSD 
> so it will max out on 80GB leaving 16GB as spare. 
> 
> As these OSDs were all restarted earlier this week I can't tell how it 
> will hold up over a longer period. Monitoring (Zabbix) shows the latency 
> is fine at the moment. 
> 
> Wido 
> 
>> 
>> - Mail original - 
>> De: "Wido den Hollander"  
>> À: "Alexandre Derumier" , "Igor Fedotov" 
>>  
>> Cc: "ceph-users" , "ceph-devel" 
>>  
>> Envoyé: Vendredi 15 Février 2019 14:50:34 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
>> restart 
>> 
>> On 2/15/19 2:31 PM, Alexandre DERUMIER wrote: 
>>> Thanks Igor. 
>>> 
>>> I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is 
>>> different. 
>>> 
>>> I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't 
>>> see this latency problem. 
>>> 
>>> 
>> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
>> OSDs as well. Over time their latency increased until we started to 
>> notice I/O-wait inside VMs. 
>> 
>> A restart fixed it. We also increased memory target from 4G to 6G on 
>> these OSDs as the memory would allow it. 
>> 
>> But we noticed this on two different 12.2.10/11 clusters. 
>> 
>> A restart made the latency drop. Not only the numbers, but the 
>> real-world latency as experienced by a VM as well. 
>> 
>> Wido 
>

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-16 Thread Alexandre DERUMIER

>>There are 10 OSDs in these systems with 96GB of memory in total. We are 
>>runnigh with memory target on 6G right now to make sure there is no 
>>leakage. If this runs fine for a longer period we will go to 8GB per OSD 
>>so it will max out on 80GB leaving 16GB as spare. 

Thanks Wido. I send results monday with my increased memory



@Igor:

I have also notice, that sometime when I have bad latency on an osd on node1 
(restarted 12h ago for example).
(op_w_process_latency).

If I restart osds on other nodes (last restart some days ago, so with bigger 
latency), it's reducing latency on osd of node1 too.

does "op_w_process_latency" counter include replication time ?

- Mail original -
De: "Wido den Hollander" 
À: "aderumier" 
Cc: "Igor Fedotov" , "ceph-users" 
, "ceph-devel" 
Envoyé: Vendredi 15 Février 2019 14:59:30
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

On 2/15/19 2:54 PM, Alexandre DERUMIER wrote: 
>>> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
>>> OSDs as well. Over time their latency increased until we started to 
>>> notice I/O-wait inside VMs. 
> 
> I'm also notice it in the vms. BTW, what it your nvme disk size ? 

Samsung PM983 3.84TB SSDs in both clusters. 

> 
> 
>>> A restart fixed it. We also increased memory target from 4G to 6G on 
>>> these OSDs as the memory would allow it. 
> 
> I have set memory to 6GB this morning, with 2 osds of 3TB for 6TB nvme. 
> (my last test was 8gb with 1osd of 6TB, but that didn't help) 

There are 10 OSDs in these systems with 96GB of memory in total. We are 
runnigh with memory target on 6G right now to make sure there is no 
leakage. If this runs fine for a longer period we will go to 8GB per OSD 
so it will max out on 80GB leaving 16GB as spare. 

As these OSDs were all restarted earlier this week I can't tell how it 
will hold up over a longer period. Monitoring (Zabbix) shows the latency 
is fine at the moment. 

Wido 

> 
> 
> - Mail original - 
> De: "Wido den Hollander"  
> À: "Alexandre Derumier" , "Igor Fedotov" 
>  
> Cc: "ceph-users" , "ceph-devel" 
>  
> Envoyé: Vendredi 15 Février 2019 14:50:34 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
> restart 
> 
> On 2/15/19 2:31 PM, Alexandre DERUMIER wrote: 
>> Thanks Igor. 
>> 
>> I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is 
>> different. 
>> 
>> I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't 
>> see this latency problem. 
>> 
>> 
> 
> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
> OSDs as well. Over time their latency increased until we started to 
> notice I/O-wait inside VMs. 
> 
> A restart fixed it. We also increased memory target from 4G to 6G on 
> these OSDs as the memory would allow it. 
> 
> But we noticed this on two different 12.2.10/11 clusters. 
> 
> A restart made the latency drop. Not only the numbers, but the 
> real-world latency as experienced by a VM as well. 
> 
> Wido 
> 
>> 
>> 
>> 
>> 
>> 
>> - Mail original - 
>> De: "Igor Fedotov"  
>> Cc: "ceph-users" , "ceph-devel" 
>>  
>> Envoyé: Vendredi 15 Février 2019 13:47:57 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
>> restart 
>> 
>> Hi Alexander, 
>> 
>> I've read through your reports, nothing obvious so far. 
>> 
>> I can only see several times average latency increase for OSD write ops 
>> (in seconds) 
>> 0.002040060 (first hour) vs. 
>> 
>> 0.002483516 (last 24 hours) vs. 
>> 0.008382087 (last hour) 
>> 
>> subop_w_latency: 
>> 0.000478934 (first hour) vs. 
>> 0.000537956 (last 24 hours) vs. 
>> 0.003073475 (last hour) 
>> 
>> and OSD read ops, osd_r_latency: 
>> 
>> 0.000408595 (first hour) 
>> 0.000709031 (24 hours) 
>> 0.004979540 (last hour) 
>> 
>> What's interesting is that such latency differences aren't observed at 
>> neither BlueStore level (any _lat params under "bluestore" section) nor 
>> rocksdb one. 
>> 
>> Which probably means that the issue is rather somewhere above BlueStore. 
>> 
>> Suggest to proceed with perf dumps collection to see if the picture 
>> stays the same. 
>> 
>> W.r.t. memory usage you observed I see nothing suspicious so far - No 
>> decrease in RSS report is a known artifact that seems to be safe. 
>&g

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-15 Thread Alexandre DERUMIER

>>Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
>>OSDs as well. Over time their latency increased until we started to 
>>notice I/O-wait inside VMs. 

I'm also notice it in the vms. BTW, what it your nvme disk size ?


>>A restart fixed it. We also increased memory target from 4G to 6G on 
>>these OSDs as the memory would allow it. 

I have set memory to 6GB this morning, with 2 osds of 3TB for 6TB nvme.  
(my last test was 8gb with 1osd of 6TB, but that didn't help)


- Mail original -
De: "Wido den Hollander" 
À: "Alexandre Derumier" , "Igor Fedotov" 
Cc: "ceph-users" , "ceph-devel" 

Envoyé: Vendredi 15 Février 2019 14:50:34
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

On 2/15/19 2:31 PM, Alexandre DERUMIER wrote: 
> Thanks Igor. 
> 
> I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is 
> different. 
> 
> I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't 
> see this latency problem. 
> 
> 

Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
OSDs as well. Over time their latency increased until we started to 
notice I/O-wait inside VMs. 

A restart fixed it. We also increased memory target from 4G to 6G on 
these OSDs as the memory would allow it. 

But we noticed this on two different 12.2.10/11 clusters. 

A restart made the latency drop. Not only the numbers, but the 
real-world latency as experienced by a VM as well. 

Wido 

> 
> 
> 
> 
> 
> - Mail original - 
> De: "Igor Fedotov"  
> Cc: "ceph-users" , "ceph-devel" 
>  
> Envoyé: Vendredi 15 Février 2019 13:47:57 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
> restart 
> 
> Hi Alexander, 
> 
> I've read through your reports, nothing obvious so far. 
> 
> I can only see several times average latency increase for OSD write ops 
> (in seconds) 
> 0.002040060 (first hour) vs. 
> 
> 0.002483516 (last 24 hours) vs. 
> 0.008382087 (last hour) 
> 
> subop_w_latency: 
> 0.000478934 (first hour) vs. 
> 0.000537956 (last 24 hours) vs. 
> 0.003073475 (last hour) 
> 
> and OSD read ops, osd_r_latency: 
> 
> 0.000408595 (first hour) 
> 0.000709031 (24 hours) 
> 0.004979540 (last hour) 
> 
> What's interesting is that such latency differences aren't observed at 
> neither BlueStore level (any _lat params under "bluestore" section) nor 
> rocksdb one. 
> 
> Which probably means that the issue is rather somewhere above BlueStore. 
> 
> Suggest to proceed with perf dumps collection to see if the picture 
> stays the same. 
> 
> W.r.t. memory usage you observed I see nothing suspicious so far - No 
> decrease in RSS report is a known artifact that seems to be safe. 
> 
> Thanks, 
> Igor 
> 
> On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: 
>> Hi Igor, 
>> 
>> Thanks again for helping ! 
>> 
>> 
>> 
>> I have upgrade to last mimic this weekend, and with new autotune memory, 
>> I have setup osd_memory_target to 8G. (my nvme are 6TB) 
>> 
>> 
>> I have done a lot of perf dump and mempool dump and ps of process to 
> see rss memory at different hours, 
>> here the reports for osd.0: 
>> 
>> http://odisoweb1.odiso.net/perfanalysis/ 
>> 
>> 
>> osd has been started the 12-02-2019 at 08:00 
>> 
>> first report after 1h running 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt 
>> 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt
>  
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt 
>> 
>> 
>> 
>> report after 24 before counter resets 
>> 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt 
>> 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt
>  
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt 
>> 
>> report 1h after counter reset 
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt 
>> 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt
>  
>> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt 
>> 
>> 
>> 
>> 
>> I'm seeing the bluestore buffer bytes memory increasing up to 4G 
> around 12-02-2019 at 14:00 
>> http://odisoweb1.odiso.net/perfanalysis/graphs2.png 
>> Then after that, slowly decreasing. 
>> 
>> 
>> Another strange thing, 
>> I'm seeing total bytes at 5G

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-15 Thread Alexandre DERUMIER

Thanks Igor.

I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is 
different.

I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't see 
this latency problem.







- Mail original -
De: "Igor Fedotov" 
Cc: "ceph-users" , "ceph-devel" 

Envoyé: Vendredi 15 Février 2019 13:47:57
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

Hi Alexander, 

I've read through your reports, nothing obvious so far. 

I can only see several times average latency increase for OSD write ops 
(in seconds) 
0.002040060 (first hour) vs. 

0.002483516 (last 24 hours) vs. 
0.008382087 (last hour) 

subop_w_latency: 
0.000478934 (first hour) vs. 
0.000537956 (last 24 hours) vs. 
0.003073475 (last hour) 

and OSD read ops, osd_r_latency: 

0.000408595 (first hour) 
0.000709031 (24 hours) 
0.004979540 (last hour) 

What's interesting is that such latency differences aren't observed at 
neither BlueStore level (any _lat params under "bluestore" section) nor 
rocksdb one. 

Which probably means that the issue is rather somewhere above BlueStore. 

Suggest to proceed with perf dumps collection to see if the picture 
stays the same. 

W.r.t. memory usage you observed I see nothing suspicious so far - No 
decrease in RSS report is a known artifact that seems to be safe. 

Thanks, 
Igor 

On 2/13/2019 11:42 AM, Alexandre DERUMIER wrote: 
> Hi Igor, 
> 
> Thanks again for helping ! 
> 
> 
> 
> I have upgrade to last mimic this weekend, and with new autotune memory, 
> I have setup osd_memory_target to 8G. (my nvme are 6TB) 
> 
> 
> I have done a lot of perf dump and mempool dump and ps of process to 
see rss memory at different hours, 
> here the reports for osd.0: 
> 
> http://odisoweb1.odiso.net/perfanalysis/ 
> 
> 
> osd has been started the 12-02-2019 at 08:00 
> 
> first report after 1h running 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt 
> 
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt
 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt 
> 
> 
> 
> report after 24 before counter resets 
> 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt 
> 
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt
 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt 
> 
> report 1h after counter reset 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt 
> 
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt
 
> http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt 
> 
> 
> 
> 
> I'm seeing the bluestore buffer bytes memory increasing up to 4G 
around 12-02-2019 at 14:00 
> http://odisoweb1.odiso.net/perfanalysis/graphs2.png 
> Then after that, slowly decreasing. 
> 
> 
> Another strange thing, 
> I'm seeing total bytes at 5G at 12-02-2018.13:30 
> 
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt
 
> Then is decreasing over time (around 3,7G this morning), but RSS is 
still at 8G 
> 
> 
> I'm graphing mempools counters too since yesterday, so I'll able to 
track them over time. 
> 
> - Mail original - 
> De: "Igor Fedotov"  
> À: "Alexandre Derumier"  
> Cc: "Sage Weil" , "ceph-users" 
, "ceph-devel"  
> Envoyé: Lundi 11 Février 2019 12:03:17 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
until restart 
> 
> On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: 
>> another mempool dump after 1h run. (latency ok) 
>> 
>> Biggest difference: 
>> 
>> before restart 
>> - 
>> "bluestore_cache_other": { 
>> "items": 48661920, 
>> "bytes": 1539544228 
>> }, 
>> "bluestore_cache_data": { 
>> "items": 54, 
>> "bytes": 643072 
>> }, 
>> (other caches seem to be quite low too, like bluestore_cache_other 
take all the memory) 
>> 
>> 
>> After restart 
>> - 
>> "bluestore_cache_other": { 
>> "items": 12432298, 
>> "bytes": 500834899 
>> }, 
>> "bluestore_cache_data": { 
>> "items": 40084, 
>> "bytes": 1056235520 
>> }, 
>> 
> This is fine as cache is warming after restart and some rebalancing 
> between data and metadata might occur. 
> 
> What relates to allocator and most probably to fragmentation growth is : 
> 
> "bluestore_alloc": { 
> "items": 165053952, 
> "bytes&quo

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-13 Thread Alexandre DERUMIER

Hi Igor, 

Thanks again for helping !



I have upgrade to last mimic this weekend, and with new autotune memory,
I have setup osd_memory_target to 8G.  (my nvme are 6TB)


I have done a lot of perf dump and mempool dump and ps of process to see rss 
memory at different hours,
here the reports for osd.0:

http://odisoweb1.odiso.net/perfanalysis/


osd has been started the 12-02-2019 at 08:00

first report after 1h running
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.perf.txt
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.dump_mempools.txt
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.09:30.ps.txt



report  after 24 before counter resets

http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.perf.txt
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.dump_mempools.txt
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.08:00.ps.txt

report 1h after counter reset
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.perf.txt
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.dump_mempools.txt
http://odisoweb1.odiso.net/perfanalysis/osd.0.13-02-2018.09:00.ps.txt




I'm seeing the bluestore buffer bytes memory increasing up to 4G  around 
12-02-2019 at 14:00
http://odisoweb1.odiso.net/perfanalysis/graphs2.png
Then after that, slowly decreasing.


Another strange thing,
I'm seeing total bytes at 5G at 12-02-2018.13:30
http://odisoweb1.odiso.net/perfanalysis/osd.0.12-02-2018.13:30.dump_mempools.txt
Then is decreasing over time (around 3,7G this morning), but RSS is still at 8G


I'm graphing mempools counters too since yesterday, so I'll able to track them 
over time.

- Mail original -
De: "Igor Fedotov" 
À: "Alexandre Derumier" 
Cc: "Sage Weil" , "ceph-users" , 
"ceph-devel" 
Envoyé: Lundi 11 Février 2019 12:03:17
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

On 2/8/2019 6:57 PM, Alexandre DERUMIER wrote: 
> another mempool dump after 1h run. (latency ok) 
> 
> Biggest difference: 
> 
> before restart 
> - 
> "bluestore_cache_other": { 
> "items": 48661920, 
> "bytes": 1539544228 
> }, 
> "bluestore_cache_data": { 
> "items": 54, 
> "bytes": 643072 
> }, 
> (other caches seem to be quite low too, like bluestore_cache_other take all 
> the memory) 
> 
> 
> After restart 
> - 
> "bluestore_cache_other": { 
> "items": 12432298, 
> "bytes": 500834899 
> }, 
> "bluestore_cache_data": { 
> "items": 40084, 
> "bytes": 1056235520 
> }, 
> 
This is fine as cache is warming after restart and some rebalancing 
between data and metadata might occur. 

What relates to allocator and most probably to fragmentation growth is : 

"bluestore_alloc": { 
"items": 165053952, 
"bytes": 165053952 
}, 

which had been higher before the reset (if I got these dumps' order 
properly) 

"bluestore_alloc": { 
"items": 210243456, 
"bytes": 210243456 
}, 

But as I mentioned - I'm not 100% sure this might cause such a huge 
latency increase... 

Do you have perf counters dump after the restart? 

Could you collect some more dumps - for both mempool and perf counters? 

So ideally I'd like to have: 

1) mempool/perf counters dumps after the restart (1hour is OK) 

2) mempool/perf counters dumps in 24+ hours after restart 

3) reset perf counters after 2), wait for 1 hour (and without OSD 
restart) and dump mempool/perf counters again. 

So we'll be able to learn both allocator mem usage growth and operation 
latency distribution for the following periods: 

a) 1st hour after restart 

b) 25th hour. 


Thanks, 

Igor 


> full mempool dump after restart 
> --- 
> 
> { 
> "mempool": { 
> "by_pool": { 
> "bloom_filter": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "bluestore_alloc": { 
> "items": 165053952, 
> "bytes": 165053952 
> }, 
> "bluestore_cache_data": { 
> "items": 40084, 
> "bytes": 1056235520 
> }, 
> "bluestore_cache_onode": { 
> "items": 5, 
> "bytes": 14935200 
> }, 
> "bluestore_cache_other": { 
> "items": 12432298, 
> "bytes": 500834899 
> }, 
> "bluestore_fsck": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "bluestore_txc": { 
> "items": 11, 
> "bytes": 8184 
> }, 
> "bluestore_writing_deferred": { 
> "items": 5047, 
> "bytes": 22673736 
> }, 
> &

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-08 Thread Alexandre DERUMIER

;op_w_prepare_latency": { 
"avgcount": 10548652, 
"sum": 2499.688609173, 
"avgtime": 0.000236967 
}, 
"op_rw": 23500, 
"op_rw_in_bytes": 64491092, 
"op_rw_out_bytes": 0, 
"op_rw_latency": { 
"avgcount": 23500, 
"sum": 574.395885734, 
"avgtime": 0.024442378 
}, 
"op_rw_process_latency": { 
"avgcount": 23500, 
"sum": 33.841218228, 
"avgtime": 0.001440051 
}, 
"op_rw_prepare_latency": { 
"avgcount": 24071, 
"sum": 7.301280372, 
"avgtime": 0.000303322 
}, 
"op_before_queue_op_lat": { 
"avgcount": 57892986, 
"sum": 1502.117718889, 
"avgtime": 0.25946 
}, 
"op_before_dequeue_op_lat": { 
"avgcount": 58091683, 
"sum": 45194.453254037, 
"avgtime": 0.000777984 
}, 
"subop": 19784758, 
"subop_in_bytes": 547174969754, 
"subop_latency": { 
"avgcount": 19784758, 
"sum": 13019.714424060, 
"avgtime": 0.000658067 
}, 
"subop_w": 19784758, 
"subop_w_in_bytes": 547174969754, 
"subop_w_latency": { 
"avgcount": 19784758, 
"sum": 13019.714424060, 
"avgtime": 0.000658067 
}, 
"subop_pull": 0, 
"subop_pull_latency": { 
"avgcount": 0, 
"sum": 0.0, 
"avgtime": 0.0 
}, 
"subop_push": 0, 
"subop_push_in_bytes": 0, 
"subop_push_latency": { 
"avgcount": 0, 
"sum": 0.0, 
"avgtime": 0.0 
}, 
"pull": 0, 
"push": 2003, 
"push_out_bytes": 5560009728, 
"recovery_ops": 1940, 
"loadavg": 118, 
"buffer_bytes": 0, 
"history_alloc_Mbytes": 0, 
"history_alloc_num": 0, 
"cached_crc": 0, 
"cached_crc_adjusted": 0, 
"missed_crc": 0, 
"numpg": 243, 
"numpg_primary": 82, 
"numpg_replica": 161, 
"numpg_stray": 0, 
"numpg_removing": 0, 
"heartbeat_to_peers": 10, 
"map_messages": 7013, 
"map_message_epochs": 7143, 
"map_message_epoch_dups": 6315, 
"messages_delayed_for_map": 0, 
"osd_map_cache_hit": 203309, 
"osd_map_cache_miss": 33, 
"osd_map_cache_miss_low": 0, 
"osd_map_cache_miss_low_avg": { 
"avgcount": 0, 
"sum": 0 
}, 
"osd_map_bl_cache_hit": 47012, 
"osd_map_bl_cache_miss": 1681, 
"stat_bytes": 6401248198656, 
"stat_bytes_used": 3777979072512, 
"stat_bytes_avail": 2623269126144, 
"copyfrom": 0, 
"tier_promote": 0, 
"tier_flush": 0, 
"tier_flush_fail": 0, 
"tier_try_flush": 0, 
"tier_try_flush_fail": 0, 
"tier_evict": 0, 
"tier_whiteout": 1631, 
"tier_dirty": 22360, 
"tier_clean": 0, 
"tier_delay": 0, 
"tier_proxy_read": 0, 
"tier_proxy_write": 0, 
"agent_wake": 0, 
"agent_skip": 0, 
"agent_flush": 0, 
"agent_evict": 0, 
"object_ctx_cache_hit": 16311156, 
"object_ctx_cache_total": 17426393, 
"op_cache_hit": 0, 
"osd_tier_flush_lat": { 
"avgcount": 0, 
"sum": 0.0, 
"avgtime": 0.0 
}, 
"osd_tier_promote_lat": { 
"avgcount": 0, 
"sum": 0.0, 
"avgtime": 0.0 
}, 
"osd_tier_r_lat": { 
"avgcount": 0, 
"sum": 0.0, 
"avgtime": 0.0 
}, 
"osd_pg_info": 30483113, 
"osd_pg_fastinfo": 29619885, 
"osd_pg_biginfo": 81703 
}, 
"recoverystate_perf": { 
"initial_latency": { 
"avgcount": 243, 
"sum": 6.869296500, 
"avgtime": 0.028268709 
}, 
"started_latency": { 
"avgcount": 1125, 
"sum": 13551384.917335850, 
"avgtime": 12045.675482076 
}, 
"reset_latency": { 
"avgcount": 1368, 
"sum": 1101.727799040, 
"avgtime": 0.805356578 
}, 
"start_latency": { 
"avgcount": 1368, 
"sum": 0.002014799, 
"avgtime": 0.01472 
}, 
"primary_latency": { 
"avgcount": 507, 
"sum": 4575560.638823428, 
"avgtime": 9024.774435549 
}, 
"peering_latency": { 
"avgcount": 550, 
"sum": 499.372283616, 
"avgtime": 0.907949606 
}, 
"backfilling_latency": { 
"avgcount": 0, 
"sum": 0.0, 
"avgtime": 0.0 
}, 
"waitremotebackfillreserved_latency": { 
"avgcount&

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-08 Thread Alexandre DERUMIER

"op_cache_hit": 0,
"osd_tier_flush_lat": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
},
"osd_tier_promote_lat": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
},
"osd_tier_r_lat": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
},
"osd_pg_info": 30483113,
"osd_pg_fastinfo": 29619885,
"osd_pg_biginfo": 81703
},
"recoverystate_perf": {
"initial_latency": {
"avgcount": 243,
"sum": 6.869296500,
"avgtime": 0.028268709
},
"started_latency": {
"avgcount": 1125,
"sum": 13551384.917335850,
    "avgtime": 12045.675482076
},
"reset_latency": {
"avgcount": 1368,
"sum": 1101.727799040,
"avgtime": 0.805356578
},
"start_latency": {
"avgcount": 1368,
"sum": 0.002014799,
"avgtime": 0.01472
},
"primary_latency": {
"avgcount": 507,
"sum": 4575560.638823428,
"avgtime": 9024.774435549
},
"peering_latency": {
"avgcount": 550,
"sum": 499.372283616,
"avgtime": 0.907949606
},
"backfilling_latency": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
},
"waitremotebackfillreserved_latency": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
},
"waitlocalbackfillreserved_latency": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
},
"notbackfilling_latency": {
    "avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
},
"repnotrecovering_latency": {
"avgcount": 1009,
"sum": 8975301.082274411,
"avgtime": 8895.243887288
},
"repwaitrecoveryreserved_latency": {
"avgcount": 420,
"sum": 99.846056520,
"avgtime": 0.237728706
},
"repwaitbackfillreserved_latency": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
},
"reprecovering_latency": {
"avgcount": 420,
"sum": 241.682764382,
"avgtime": 0.575435153
},
"activating_latency": {
"avgcount": 507,
"sum": 16.893347339,
"avgtime": 0.033320211
},
"waitlocalrecoveryreserved_latency": {
"avgcount": 199,
"sum": 672.335512769,
"avgtime": 3.378570415
},
"waitremoterecoveryreserved_latency": {
"avgcount": 199,
"sum": 213.536439363,
    "avgtime": 1.073047433
},
"recovering_latency": {
"avgcount": 199,
"sum": 79.007696479,
"avgtime": 0.397023600
},
"recovered_latency": {
"avgcount": 507,
"sum": 14.000732748,
"avgtime": 0.027614857
},
"clean_latency": {
"avgcount": 395,
"sum": 4574325.900371083,
"avgtime": 11580.571899673
},
"active_latency": {
"avgcount": 425,
"sum": 4575107.630123680,
"avgtime": 10764.959129702
},
"replicaactive_latency": {
"avgcount": 589,
"sum": 8975184.499049954,
"avgtime": 15238.004242869
},
"stray_latency": {
"avgcount": 818,
"sum": 800.729455666,
"avgtime": 0.978886865
},

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-08 Thread Alexandre DERUMIER

>>hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
>>it? 
yes
>>The same for other OSDs? 
yes



>>Wondering if you have OSD mempool monitoring (dump_mempools command 
>>output on admin socket) reports? Do you have any historic data? 

not currently (I only have perf dump), I'll add them in my monitoring stats.


>>If not may I have current output and say a couple more samples with 
>>8-12 hours interval? 

I'll do it next week.

Thanks again for helping.


- Mail original -
De: "Igor Fedotov" 
À: "aderumier" 
Cc: "Stefan Priebe, Profihost AG" , "Mark Nelson" 
, "Sage Weil" , "ceph-users" 
, "ceph-devel" 
Envoyé: Mardi 5 Février 2019 18:56:51
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

On 2/4/2019 6:40 PM, Alexandre DERUMIER wrote: 
>>> but I don't see l_bluestore_fragmentation counter. 
>>> (but I have bluestore_fragmentation_micros) 
> ok, this is the same 
> 
> b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros", 
> "How fragmented bluestore free space is (free extents / max possible number 
> of free extents) * 1000"); 
> 
> 
> Here a graph on last month, with bluestore_fragmentation_micros and latency, 
> 
> http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png 

hmm, so fragmentation grows eventually and drops on OSD restarts, isn't 
it? The same for other OSDs? 

This proves some issue with the allocator - generally fragmentation 
might grow but it shouldn't reset on restart. Looks like some intervals 
aren't properly merged in run-time. 

On the other side I'm not completely sure that latency degradation is 
caused by that - fragmentation growth is relatively small - I don't see 
how this might impact performance that high. 

Wondering if you have OSD mempool monitoring (dump_mempools command 
output on admin socket) reports? Do you have any historic data? 

If not may I have current output and say a couple more samples with 
8-12 hours interval? 


Wrt to backporting bitmap allocator to mimic - we haven't had such plans 
before that but I'll discuss this at BlueStore meeting shortly. 


Thanks, 

Igor 

> - Mail original - 
> De: "Alexandre Derumier"  
> À: "Igor Fedotov"  
> Cc: "Stefan Priebe, Profihost AG" , "Mark Nelson" 
> , "Sage Weil" , "ceph-users" 
> , "ceph-devel"  
> Envoyé: Lundi 4 Février 2019 16:04:38 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
> restart 
> 
> Thanks Igor, 
> 
>>> Could you please collect BlueStore performance counters right after OSD 
>>> startup and once you get high latency. 
>>> 
>>> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
> I'm already monitoring with 
> "ceph daemon osd.x perf dump ", (I have 2months history will all counters) 
> 
> but I don't see l_bluestore_fragmentation counter. 
> 
> (but I have bluestore_fragmentation_micros) 
> 
> 
>>> Also if you're able to rebuild the code I can probably make a simple 
>>> patch to track latency and some other internal allocator's paramter to 
>>> make sure it's degraded and learn more details. 
> Sorry, It's a critical production cluster, I can't test on it :( 
> But I have a test cluster, maybe I can try to put some load on it, and try to 
> reproduce. 
> 
> 
> 
>>> More vigorous fix would be to backport bitmap allocator from Nautilus 
>>> and try the difference... 
> Any plan to backport it to mimic ? (But I can wait for Nautilus) 
> perf results of new bitmap allocator seem very promising from what I've seen 
> in PR. 
> 
> 
> 
> - Mail original - 
> De: "Igor Fedotov"  
> À: "Alexandre Derumier" , "Stefan Priebe, Profihost AG" 
> , "Mark Nelson"  
> Cc: "Sage Weil" , "ceph-users" 
> , "ceph-devel"  
> Envoyé: Lundi 4 Février 2019 15:51:30 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
> restart 
> 
> Hi Alexandre, 
> 
> looks like a bug in StupidAllocator. 
> 
> Could you please collect BlueStore performance counters right after OSD 
> startup and once you get high latency. 
> 
> Specifically 'l_bluestore_fragmentation' parameter is of interest. 
> 
> Also if you're able to rebuild the code I can probably make a simple 
> patch to track latency and some other internal allocator's paramter to 
> make sure it's degraded and learn more details. 
> 
> 
> More vigorous fix would be to backport bitmap allocator from Nautilus 
> and try the d

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-04 Thread Alexandre DERUMIER

>>but I don't see l_bluestore_fragmentation counter.
>>(but I have bluestore_fragmentation_micros)

ok, this is the same

  b.add_u64(l_bluestore_fragmentation, "bluestore_fragmentation_micros",
"How fragmented bluestore free space is (free extents / max 
possible number of free extents) * 1000");


Here a graph on last month, with bluestore_fragmentation_micros and latency,

http://odisoweb1.odiso.net/latency_vs_fragmentation_micros.png

- Mail original -
De: "Alexandre Derumier" 
À: "Igor Fedotov" 
Cc: "Stefan Priebe, Profihost AG" , "Mark Nelson" 
, "Sage Weil" , "ceph-users" 
, "ceph-devel" 
Envoyé: Lundi 4 Février 2019 16:04:38
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

Thanks Igor, 

>>Could you please collect BlueStore performance counters right after OSD 
>>startup and once you get high latency. 
>> 
>>Specifically 'l_bluestore_fragmentation' parameter is of interest. 

I'm already monitoring with 
"ceph daemon osd.x perf dump ", (I have 2months history will all counters) 

but I don't see l_bluestore_fragmentation counter. 

(but I have bluestore_fragmentation_micros) 


>>Also if you're able to rebuild the code I can probably make a simple 
>>patch to track latency and some other internal allocator's paramter to 
>>make sure it's degraded and learn more details. 

Sorry, It's a critical production cluster, I can't test on it :( 
But I have a test cluster, maybe I can try to put some load on it, and try to 
reproduce. 



>>More vigorous fix would be to backport bitmap allocator from Nautilus 
>>and try the difference... 

Any plan to backport it to mimic ? (But I can wait for Nautilus) 
perf results of new bitmap allocator seem very promising from what I've seen in 
PR. 



- Mail original - 
De: "Igor Fedotov"  
À: "Alexandre Derumier" , "Stefan Priebe, Profihost AG" 
, "Mark Nelson"  
Cc: "Sage Weil" , "ceph-users" , 
"ceph-devel"  
Envoyé: Lundi 4 Février 2019 15:51:30 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart 

Hi Alexandre, 

looks like a bug in StupidAllocator. 

Could you please collect BlueStore performance counters right after OSD 
startup and once you get high latency. 

Specifically 'l_bluestore_fragmentation' parameter is of interest. 

Also if you're able to rebuild the code I can probably make a simple 
patch to track latency and some other internal allocator's paramter to 
make sure it's degraded and learn more details. 


More vigorous fix would be to backport bitmap allocator from Nautilus 
and try the difference... 


Thanks, 

Igor 


On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
> Hi again, 
> 
> I speak too fast, the problem has occured again, so it's not tcmalloc cache 
> size related. 
> 
> 
> I have notice something using a simple "perf top", 
> 
> each time I have this problem (I have seen exactly 4 times the same 
> behaviour), 
> 
> when latency is bad, perf top give me : 
> 
> StupidAllocator::_aligned_len 
> and 
> btree::btree_iterator long, unsigned long, std::less, mempoo 
> l::pool_allocator<(mempool::pool_index_t)1, std::pair unsigned long> >, 256> >, std::pair&, 
> std::pair const, unsigned long>*>::increment_slow() 
> 
> (around 10-20% time for both) 
> 
> 
> when latency is good, I don't see them at all. 
> 
> 
> I have used the Mark wallclock profiler, here the results: 
> 
> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
> 
> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
> 
> 
> here an extract of the thread with btree::btree_iterator && 
> StupidAllocator::_aligned_len 
> 
> 
> + 100.00% clone 
> + 100.00% start_thread 
> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) 
> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&, 
> ThreadPool::TPHandle&) 
> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr, 
> boost::intrusive_ptr, ThreadPool::TPHandle&) 
> | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr&, 
> ThreadPool::TPHandle&) 
> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr) 
> | | + 68.00% 
> ReplicatedBackend::_handle_message(boost::intrusive_ptr) 
> | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr) 
> | | + 67.00% non-virtual thunk to 
> PrimaryLogPG::queue_transactions(std::vector std::allocator >&, boost::intrusive_ptr) 
> | | | + 67.00% 
> BlueStore::queue_transactions(boost::intrusi

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-04 Thread Alexandre DERUMIER

Thanks Igor,

>>Could you please collect BlueStore performance counters right after OSD 
>>startup and once you get high latency. 
>>
>>Specifically 'l_bluestore_fragmentation' parameter is of interest. 

I'm already monitoring with
"ceph daemon osd.x perf dump ",  (I have 2months history will all counters)

but I don't see l_bluestore_fragmentation counter.

(but I have bluestore_fragmentation_micros)


>>Also if you're able to rebuild the code I can probably make a simple 
>>patch to track latency and some other internal allocator's paramter to 
>>make sure it's degraded and learn more details. 

Sorry, It's a critical production cluster, I can't test on it :(
But I have a test cluster, maybe I can try to put some load on it, and try to 
reproduce.



>>More vigorous fix would be to backport bitmap allocator from Nautilus 
>>and try the difference... 

Any plan to backport it to mimic ? (But I can wait for Nautilus)
perf results of new bitmap allocator seem very promising from what I've seen in 
PR.



- Mail original -
De: "Igor Fedotov" 
À: "Alexandre Derumier" , "Stefan Priebe, Profihost AG" 
, "Mark Nelson" 
Cc: "Sage Weil" , "ceph-users" , 
"ceph-devel" 
Envoyé: Lundi 4 Février 2019 15:51:30
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

Hi Alexandre, 

looks like a bug in StupidAllocator. 

Could you please collect BlueStore performance counters right after OSD 
startup and once you get high latency. 

Specifically 'l_bluestore_fragmentation' parameter is of interest. 

Also if you're able to rebuild the code I can probably make a simple 
patch to track latency and some other internal allocator's paramter to 
make sure it's degraded and learn more details. 


More vigorous fix would be to backport bitmap allocator from Nautilus 
and try the difference... 


Thanks, 

Igor 


On 2/4/2019 5:17 PM, Alexandre DERUMIER wrote: 
> Hi again, 
> 
> I speak too fast, the problem has occured again, so it's not tcmalloc cache 
> size related. 
> 
> 
> I have notice something using a simple "perf top", 
> 
> each time I have this problem (I have seen exactly 4 times the same 
> behaviour), 
> 
> when latency is bad, perf top give me : 
> 
> StupidAllocator::_aligned_len 
> and 
> btree::btree_iterator long, unsigned long, std::less, mempoo 
> l::pool_allocator<(mempool::pool_index_t)1, std::pair unsigned long> >, 256> >, std::pair&, 
> std::pair const, unsigned long>*>::increment_slow() 
> 
> (around 10-20% time for both) 
> 
> 
> when latency is good, I don't see them at all. 
> 
> 
> I have used the Mark wallclock profiler, here the results: 
> 
> http://odisoweb1.odiso.net/gdbpmp-ok.txt 
> 
> http://odisoweb1.odiso.net/gdbpmp-bad.txt 
> 
> 
> here an extract of the thread with btree::btree_iterator && 
> StupidAllocator::_aligned_len 
> 
> 
> + 100.00% clone 
> + 100.00% start_thread 
> + 100.00% ShardedThreadPool::WorkThreadSharded::entry() 
> + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int) 
> + 100.00% OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) 
> + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&, 
> ThreadPool::TPHandle&) 
> | + 70.00% OSD::dequeue_op(boost::intrusive_ptr, 
> boost::intrusive_ptr, ThreadPool::TPHandle&) 
> | + 70.00% PrimaryLogPG::do_request(boost::intrusive_ptr&, 
> ThreadPool::TPHandle&) 
> | + 68.00% PGBackend::handle_message(boost::intrusive_ptr) 
> | | + 68.00% 
> ReplicatedBackend::_handle_message(boost::intrusive_ptr) 
> | | + 68.00% ReplicatedBackend::do_repop(boost::intrusive_ptr) 
> | | + 67.00% non-virtual thunk to 
> PrimaryLogPG::queue_transactions(std::vector std::allocator >&, boost::intrusive_ptr) 
> | | | + 67.00% 
> BlueStore::queue_transactions(boost::intrusive_ptr&,
>  std::vector std::allocator >&, boost::intrusive_ptr, 
> ThreadPool::TPHandle*) 
> | | | + 66.00% BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
> ObjectStore::Transaction*) 
> | | | | + 66.00% BlueStore::_write(BlueStore::TransContext*, 
> boost::intrusive_ptr&, 
> boost::intrusive_ptr&, unsigned long, unsigned long, 
> ceph::buffer::list&, unsigned int) 
> | | | | + 66.00% BlueStore::_do_write(BlueStore::TransContext*, 
> boost::intrusive_ptr&, 
> boost::intrusive_ptr, unsigned long, unsigned long, 
> ceph::buffer::list&, unsigned int) 
> | | | | + 65.00% BlueStore::_do_alloc_write(BlueStore::TransContext*, 
> boost::intrusive_ptr, 
> boost::intrusive_ptr, BlueStore::WriteContext*) 
> | | | | | + 64.00% StupidAllocator::allocate(unsigned long, unsign

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-04 Thread Alexandre DERUMIER

Hi again,

I speak too fast, the problem has occured again, so it's not tcmalloc cache 
size related.


I have notice something using a simple "perf top",

each time I have this problem (I have seen exactly 4 times the same behaviour),

when latency is bad, perf top give me : 

StupidAllocator::_aligned_len
and
btree::btree_iterator, mempoo
l::pool_allocator<(mempool::pool_index_t)1, std::pair >, 256> >, std::pair&, 
std::pair*>::increment_slow()

(around 10-20% time for both)


when latency is good, I don't see them at all.


I have used the Mark wallclock profiler, here the results:

http://odisoweb1.odiso.net/gdbpmp-ok.txt

http://odisoweb1.odiso.net/gdbpmp-bad.txt


here an extract of the thread with btree::btree_iterator && 
StupidAllocator::_aligned_len


+ 100.00% clone
  + 100.00% start_thread
+ 100.00% ShardedThreadPool::WorkThreadSharded::entry()
  + 100.00% ShardedThreadPool::shardedthreadpool_worker(unsigned int)
+ 100.00% OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)
  + 70.00% PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr&, 
ThreadPool::TPHandle&)
  | + 70.00% OSD::dequeue_op(boost::intrusive_ptr, 
boost::intrusive_ptr, ThreadPool::TPHandle&)
  |   + 70.00% 
PrimaryLogPG::do_request(boost::intrusive_ptr&, 
ThreadPool::TPHandle&)
  | + 68.00% 
PGBackend::handle_message(boost::intrusive_ptr)
  | | + 68.00% 
ReplicatedBackend::_handle_message(boost::intrusive_ptr)
  | |   + 68.00% 
ReplicatedBackend::do_repop(boost::intrusive_ptr)
  | | + 67.00% non-virtual thunk to 
PrimaryLogPG::queue_transactions(std::vector >&, boost::intrusive_ptr)
  | | | + 67.00% 
BlueStore::queue_transactions(boost::intrusive_ptr&,
 std::vector 
>&, boost::intrusive_ptr, ThreadPool::TPHandle*)
  | | |   + 66.00% 
BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
ObjectStore::Transaction*)
  | | |   | + 66.00% 
BlueStore::_write(BlueStore::TransContext*, 
boost::intrusive_ptr&, 
boost::intrusive_ptr&, unsigned long, unsigned long, 
ceph::buffer::list&, unsigned int)
  | | |   |   + 66.00% 
BlueStore::_do_write(BlueStore::TransContext*, 
boost::intrusive_ptr&, 
boost::intrusive_ptr, unsigned long, unsigned long, 
ceph::buffer::list&, unsigned int)
  | | |   | + 65.00% 
BlueStore::_do_alloc_write(BlueStore::TransContext*, 
boost::intrusive_ptr, 
boost::intrusive_ptr, BlueStore::WriteContext*)
  | | |   | | + 64.00% StupidAllocator::allocate(unsigned 
long, unsigned long, unsigned long, long, std::vector >*)
  | | |   | | | + 64.00% 
StupidAllocator::allocate_int(unsigned long, unsigned long, long, unsigned 
long*, unsigned int*)
  | | |   | | |   + 34.00% 
btree::btree_iterator, 
mempool::pool_allocator<(mempool::pool_index_t)1, std::pair >, 256> >, std::pair&, std::pair*>::increment_slow()
  | | |   | | |   + 26.00% 
StupidAllocator::_aligned_len(interval_set, 
mempool::pool_allocator<(mempool::pool_index_t)1, std::pair >, 256> >::iterator, unsigned long)



- Mail original -
De: "Alexandre Derumier" 
À: "Stefan Priebe, Profihost AG" 
Cc: "Sage Weil" , "ceph-users" , 
"ceph-devel" 
Envoyé: Lundi 4 Février 2019 09:38:11
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

Hi, 

some news: 

I have tried with different transparent hugepage values (madvise, never) : no 
change 

I have tried to increase bluestore_cache_size_ssd to 8G: no change 

I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it 
seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to 
be sure) 


Note that this behaviour seem to happen really faster (< 2 days) on my big nvme 
drives (6TB), 
my others clusters user 1,6TB ssd. 

Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by 
osd), but I'll try this week with 2osd by nvme, to see if it's helping. 


BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 
2.26 (which have also thread cache) ? 


Regards, 

Alexandre 


- Mail original - 
De: "aderumier"  
À: "Stefan Priebe, Profihost AG"  
Cc: "Sage Weil" , "ceph-users" , 
"ceph-devel"  
Envoyé: Mercredi 30 Janvier 2019 19:58:15 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart 

>>Thanks. Is there any reason you monitor op_w_latency but not 
>>op_r_latency but instead op_latency? 
>> 
>>Also why do you monitor op_w_process_latency? but not op_r_process_latency? 

I monitor read too. (I have all metrics for osd sockets, and a lot o

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-04 Thread Alexandre DERUMIER

Hi,

some news:

I have tried with different transparent hugepage values (madvise, never) : no 
change

I have tried to increase bluestore_cache_size_ssd to 8G: no change

I have tried to increase TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES to 256mb : it 
seem to help, after 24h I'm still around 1,5ms. (need to wait some more days to 
be sure)


Note that this behaviour seem to happen really faster (< 2 days) on my big nvme 
drives (6TB),
my others clusters user 1,6TB ssd.

Currently I'm using only 1 osd by nvme (I don't have more than 5000iops by 
osd), but I'll try this week with 2osd by nvme, to see if it's helping.


BTW, does somebody have already tested ceph without tcmalloc, with glibc >= 
2.26 (which have also thread cache) ?


Regards,

Alexandre


- Mail original -
De: "aderumier" 
À: "Stefan Priebe, Profihost AG" 
Cc: "Sage Weil" , "ceph-users" , 
"ceph-devel" 
Envoyé: Mercredi 30 Janvier 2019 19:58:15
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

>>Thanks. Is there any reason you monitor op_w_latency but not 
>>op_r_latency but instead op_latency? 
>> 
>>Also why do you monitor op_w_process_latency? but not op_r_process_latency? 

I monitor read too. (I have all metrics for osd sockets, and a lot of graphs). 

I just don't see latency difference on reads. (or they are very very small vs 
the write latency increase) 



- Mail original - 
De: "Stefan Priebe, Profihost AG"  
À: "aderumier"  
Cc: "Sage Weil" , "ceph-users" , 
"ceph-devel"  
Envoyé: Mercredi 30 Janvier 2019 19:50:20 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart 

Hi, 

Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
> Hi Stefan, 
> 
>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>> like suggested. This report makes me a little nervous about my change. 
> Well,I'm really not sure that it's a tcmalloc bug. 
> maybe bluestore related (don't have filestore anymore to compare) 
> I need to compare with bigger latencies 
> 
> here an example, when all osd at 20-50ms before restart, then after restart 
> (at 21:15), 1ms 
> http://odisoweb1.odiso.net/latencybad.png 
> 
> I observe the latency in my guest vm too, on disks iowait. 
> 
> http://odisoweb1.odiso.net/latencybadvm.png 
> 
>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>> exact values out of the daemon do you use for bluestore? 
> 
> here my influxdb queries: 
> 
> It take op_latency.sum/op_latency.avgcount on last second. 
> 
> 
> SELECT non_negative_derivative(first("op_latency.sum"), 
> 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" 
> WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter 
> GROUP BY time($interval), "host", "id" fill(previous) 
> 
> 
> SELECT non_negative_derivative(first("op_w_latency.sum"), 
> 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" 
> WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ 
> AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
> 
> 
> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 
> 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM 
> "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ 
> /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
> fill(previous) 

Thanks. Is there any reason you monitor op_w_latency but not 
op_r_latency but instead op_latency? 

Also why do you monitor op_w_process_latency? but not op_r_process_latency? 

greets, 
Stefan 

> 
> 
> 
> 
> 
> - Mail original - 
> De: "Stefan Priebe, Profihost AG"  
> À: "aderumier" , "Sage Weil"  
> Cc: "ceph-users" , "ceph-devel" 
>  
> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
> restart 
> 
> Hi, 
> 
> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>> Hi, 
>> 
>> here some new results, 
>> different osd/ different cluster 
>> 
>> before osd restart latency was between 2-5ms 
>> after osd restart is around 1-1.5ms 
>> 
>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>> 
>> Fro

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-01-30 Thread Alexandre DERUMIER

>>Thanks. Is there any reason you monitor op_w_latency but not 
>>op_r_latency but instead op_latency? 
>>
>>Also why do you monitor op_w_process_latency? but not op_r_process_latency? 

I monitor read too. (I have all metrics for osd sockets, and a lot of graphs).

I just don't see latency difference on reads. (or they are very very small  vs 
the write latency increase)



- Mail original -
De: "Stefan Priebe, Profihost AG" 
À: "aderumier" 
Cc: "Sage Weil" , "ceph-users" , 
"ceph-devel" 
Envoyé: Mercredi 30 Janvier 2019 19:50:20
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

Hi, 

Am 30.01.19 um 14:59 schrieb Alexandre DERUMIER: 
> Hi Stefan, 
> 
>>> currently i'm in the process of switching back from jemalloc to tcmalloc 
>>> like suggested. This report makes me a little nervous about my change. 
> Well,I'm really not sure that it's a tcmalloc bug. 
> maybe bluestore related (don't have filestore anymore to compare) 
> I need to compare with bigger latencies 
> 
> here an example, when all osd at 20-50ms before restart, then after restart 
> (at 21:15), 1ms 
> http://odisoweb1.odiso.net/latencybad.png 
> 
> I observe the latency in my guest vm too, on disks iowait. 
> 
> http://odisoweb1.odiso.net/latencybadvm.png 
> 
>>> Also i'm currently only monitoring latency for filestore osds. Which 
>>> exact values out of the daemon do you use for bluestore? 
> 
> here my influxdb queries: 
> 
> It take op_latency.sum/op_latency.avgcount on last second. 
> 
> 
> SELECT non_negative_derivative(first("op_latency.sum"), 
> 1s)/non_negative_derivative(first("op_latency.avgcount"),1s) FROM "ceph" 
> WHERE "host" =~ /^([[host]])$/ AND "id" =~ /^([[osd]])$/ AND $timeFilter 
> GROUP BY time($interval), "host", "id" fill(previous) 
> 
> 
> SELECT non_negative_derivative(first("op_w_latency.sum"), 
> 1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s) FROM "ceph" 
> WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ /^([[osd]])$/ 
> AND $timeFilter GROUP BY time($interval), "host", "id" fill(previous) 
> 
> 
> SELECT non_negative_derivative(first("op_w_process_latency.sum"), 
> 1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s) FROM 
> "ceph" WHERE "host" =~ /^([[host]])$/ AND collection='osd' AND "id" =~ 
> /^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
> fill(previous) 

Thanks. Is there any reason you monitor op_w_latency but not 
op_r_latency but instead op_latency? 

Also why do you monitor op_w_process_latency? but not op_r_process_latency? 

greets, 
Stefan 

> 
> 
> 
> 
> 
> - Mail original - 
> De: "Stefan Priebe, Profihost AG"  
> À: "aderumier" , "Sage Weil"  
> Cc: "ceph-users" , "ceph-devel" 
>  
> Envoyé: Mercredi 30 Janvier 2019 08:45:33 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
> restart 
> 
> Hi, 
> 
> Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
>> Hi, 
>> 
>> here some new results, 
>> different osd/ different cluster 
>> 
>> before osd restart latency was between 2-5ms 
>> after osd restart is around 1-1.5ms 
>> 
>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
>> 
>> From what I see in diff, the biggest difference is in tcmalloc, but maybe 
>> I'm wrong. 
>> (I'm using tcmalloc 2.5-2.2) 
> 
> currently i'm in the process of switching back from jemalloc to tcmalloc 
> like suggested. This report makes me a little nervous about my change. 
> 
> Also i'm currently only monitoring latency for filestore osds. Which 
> exact values out of the daemon do you use for bluestore? 
> 
> I would like to check if i see the same behaviour. 
> 
> Greets, 
> Stefan 
> 
>> 
>> - Mail original - 
>> De: "Sage Weil"  
>> À: "aderumier"  
>> Cc: "ceph-users" , "ceph-devel" 
>>  
>> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
>> Objet: Re: ceph osd commit latency increase over time, until restart 
>> 
>> Can you capture a perf top or perf record to see where teh CPU time is 
>> going on one of the OSDs wth a high latency? 
>> 
>> Thanks! 
>> sage 
>> 
>> 
>&g

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-01-30 Thread Alexandre DERUMIER

>>If it does, probably only by accident. :) The autotuner in master is 
>>pretty dumb and mostly just grows/shrinks the caches based on the 
>>default ratios but accounts for the memory needed for rocksdb 
>>indexes/filters. It will try to keep the total OSD memory consumption 
>>below the specified limit. It doesn't do anything smart like monitor 
>>whether or not large caches may introduce more latency than small 
>>caches. It actually adds a small amount of additional overhead in the 
>>mempool thread to perform the calculations. If you had a static 
>>workload and tuned the bluestore cache size and ratios perfectly it 
>>would only add extra (albeit fairly minimal with the default settings) 
>>computational cost.

Ok, thanks for the explain !



>>If perf isn't showing anything conclusive, you might try my wallclock 
>>profiler: http://github.com/markhpc/gdbpmp 

I'll try, thanks


>>Some other things to watch out for are CPUs switching C states 

for cpu, c-state are disabled, cpu is running always at max frequency
(intel_pstate=disable intel_idle.max_cstate=0 processor.max_cstate=1)


and the 
>>effect of having transparent huge pages enabled (though I'd be more 
>>concerned about this in terms of memory usage). 

cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never


(also server have only 1 socket, so no numa problem)

- Mail original -
De: "Mark Nelson" 
À: "ceph-users" 
Envoyé: Mercredi 30 Janvier 2019 18:08:08
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

On 1/30/19 7:45 AM, Alexandre DERUMIER wrote: 
>>> I don't see any smoking gun here... :/ 
> I need to test to compare when latency are going very high, but I need to 
> wait more days/weeks. 
> 
> 
>>> The main difference between a warm OSD and a cold one is that on startup 
>>> the bluestore cache is empty. You might try setting the bluestore cache 
>>> size to something much smaller and see if that has an effect on the CPU 
>>> utilization? 
> I will try to test. I also wonder if the new auto memory tuning from Mark 
> could help too ? 
> (I'm still on mimic 13.2.1, planning to update to 13.2.5 next month) 
> 
> also, could check some bluestore related counters ? (onodes, 
> rocksdb,bluestore cache) 


If it does, probably only by accident. :) The autotuner in master is 
pretty dumb and mostly just grows/shrinks the caches based on the 
default ratios but accounts for the memory needed for rocksdb 
indexes/filters. It will try to keep the total OSD memory consumption 
below the specified limit. It doesn't do anything smart like monitor 
whether or not large caches may introduce more latency than small 
caches. It actually adds a small amount of additional overhead in the 
mempool thread to perform the calculations. If you had a static 
workload and tuned the bluestore cache size and ratios perfectly it 
would only add extra (albeit fairly minimal with the default settings) 
computational cost. 


If perf isn't showing anything conclusive, you might try my wallclock 
profiler: http://github.com/markhpc/gdbpmp 


Some other things to watch out for are CPUs switching C states and the 
effect of having transparent huge pages enabled (though I'd be more 
concerned about this in terms of memory usage). 


Mark 


> 
>>> Note that this doesn't necessarily mean that's what you want. Maybe the 
>>> reason why the CPU utilization is higher is because the cache is warm and 
>>> the OSD is serving more requests per second... 
> Well, currently, the server is really quiet 
> 
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await 
> w_await svctm %util 
> nvme0n1 2,00 515,00 48,00 1182,00 304,00 11216,00 18,73 0,01 0,00 0,00 0,00 
> 0,01 1,20 
> 
> %Cpu(s): 1,5 us, 1,0 sy, 0,0 ni, 97,2 id, 0,2 wa, 0,0 hi, 0,1 si, 0,0 st 
> 
> And this is only with writes, not reads 
> 
> 
> 
> ----- Mail original - 
> De: "Sage Weil"  
> À: "aderumier"  
> Cc: "ceph-users" , "ceph-devel" 
>  
> Envoyé: Mercredi 30 Janvier 2019 14:33:23 
> Objet: Re: ceph osd commit latency increase over time, until restart 
> 
> On Wed, 30 Jan 2019, Alexandre DERUMIER wrote: 
>> Hi, 
>> 
>> here some new results, 
>> different osd/ different cluster 
>> 
>> before osd restart latency was between 2-5ms 
>> after osd restart is around 1-1.5ms 
>> 
>> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
>> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
>> http://odisoweb1.odiso.net/cephperf2/diff.txt 
> I don't see any smoking gun here... :/ 
> 
> The main difference between a warm OSD

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-01-30 Thread Alexandre DERUMIER

Hi Stefan,

>>currently i'm in the process of switching back from jemalloc to tcmalloc 
>>like suggested. This report makes me a little nervous about my change. 
Well,I'm really not sure that it's a tcmalloc bug. 
maybe bluestore related (don't have filestore anymore to compare)
I need to compare with bigger latencies

here an example, when all osd at 20-50ms before restart, then after restart (at 
21:15), 1ms
http://odisoweb1.odiso.net/latencybad.png

I observe the latency in my guest vm too, on disks iowait.

http://odisoweb1.odiso.net/latencybadvm.png

>>Also i'm currently only monitoring latency for filestore osds. Which
>>exact values out of the daemon do you use for bluestore?

here my influxdb queries:

It take op_latency.sum/op_latency.avgcount on last second.


SELECT non_negative_derivative(first("op_latency.sum"), 
1s)/non_negative_derivative(first("op_latency.avgcount"),1s)   FROM "ceph" 
WHERE "host" =~  /^([[host]])$/  AND "id" =~ /^([[osd]])$/ AND $timeFilter 
GROUP BY time($interval), "host", "id" fill(previous)


SELECT non_negative_derivative(first("op_w_latency.sum"), 
1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s)   FROM "ceph" 
WHERE "host" =~ /^([[host]])$/  AND collection='osd'  AND  "id" =~ 
/^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
fill(previous)


SELECT non_negative_derivative(first("op_w_process_latency.sum"), 
1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s)   FROM 
"ceph" WHERE "host" =~ /^([[host]])$/  AND collection='osd'  AND  "id" =~ 
/^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
fill(previous)





- Mail original -
De: "Stefan Priebe, Profihost AG" 
À: "aderumier" , "Sage Weil" 
Cc: "ceph-users" , "ceph-devel" 

Envoyé: Mercredi 30 Janvier 2019 08:45:33
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

Hi, 

Am 30.01.19 um 08:33 schrieb Alexandre DERUMIER: 
> Hi, 
> 
> here some new results, 
> different osd/ different cluster 
> 
> before osd restart latency was between 2-5ms 
> after osd restart is around 1-1.5ms 
> 
> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
> http://odisoweb1.odiso.net/cephperf2/diff.txt 
> 
> From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm 
> wrong. 
> (I'm using tcmalloc 2.5-2.2) 

currently i'm in the process of switching back from jemalloc to tcmalloc 
like suggested. This report makes me a little nervous about my change. 

Also i'm currently only monitoring latency for filestore osds. Which 
exact values out of the daemon do you use for bluestore? 

I would like to check if i see the same behaviour. 

Greets, 
Stefan 

> 
> - Mail original - 
> De: "Sage Weil"  
> À: "aderumier"  
> Cc: "ceph-users" , "ceph-devel" 
>  
> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
> Objet: Re: ceph osd commit latency increase over time, until restart 
> 
> Can you capture a perf top or perf record to see where teh CPU time is 
> going on one of the OSDs wth a high latency? 
> 
> Thanks! 
> sage 
> 
> 
> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
> 
>> 
>> Hi, 
>> 
>> I have a strange behaviour of my osd, on multiple clusters, 
>> 
>> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
>> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd 
>> export-diff/snapshotdelete each day for backup 
>> 
>> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
>> 
>> But overtime, this latency increase slowly (maybe around 1ms by day), until 
>> reaching crazy 
>> values like 20-200ms. 
>> 
>> Some example graphs: 
>> 
>> http://odisoweb1.odiso.net/osdlatency1.png 
>> http://odisoweb1.odiso.net/osdlatency2.png 
>> 
>> All osds have this behaviour, in all clusters. 
>> 
>> The latency of physical disks is ok. (Clusters are far to be full loaded) 
>> 
>> And if I restart the osd, the latency come back to 0,5-1ms. 
>> 
>> That's remember me old tcmalloc bug, but maybe could it be a bluestore 
>> memory bug ? 
>> 
>> Any Hints for counters/logs to check ? 
>> 
>> 
>> Regards, 
>> 
>> Alexandre 
>> 
>> 
> 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-01-30 Thread Alexandre DERUMIER

>>I don't see any smoking gun here... :/ 

I need to test to compare when latency are going very high, but I need to wait 
more days/weeks.


>>The main difference between a warm OSD and a cold one is that on startup 
>>the bluestore cache is empty. You might try setting the bluestore cache 
>>size to something much smaller and see if that has an effect on the CPU 
>>utilization? 

I will try to test. I also wonder if the new auto memory tuning from Mark could 
help too ?
(I'm still on mimic 13.2.1, planning to update to 13.2.5 next month)

also, could check some bluestore related counters ? (onodes, rocksdb,bluestore 
cache)

>>Note that this doesn't necessarily mean that's what you want. Maybe the 
>>reason why the CPU utilization is higher is because the cache is warm and 
>>the OSD is serving more requests per second... 

Well, currently, the server is really quiet

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
nvme0n1   2,00   515,00   48,00 1182,00   304,00 11216,0018,73 
0,010,000,000,00   0,01   1,20

%Cpu(s):  1,5 us,  1,0 sy,  0,0 ni, 97,2 id,  0,2 wa,  0,0 hi,  0,1 si,  0,0 st

And this is only with writes, not reads



- Mail original -
De: "Sage Weil" 
À: "aderumier" 
Cc: "ceph-users" , "ceph-devel" 

Envoyé: Mercredi 30 Janvier 2019 14:33:23
Objet: Re: ceph osd commit latency increase over time, until restart

On Wed, 30 Jan 2019, Alexandre DERUMIER wrote: 
> Hi, 
> 
> here some new results, 
> different osd/ different cluster 
> 
> before osd restart latency was between 2-5ms 
> after osd restart is around 1-1.5ms 
> 
> http://odisoweb1.odiso.net/cephperf2/bad.txt (2-5ms) 
> http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms) 
> http://odisoweb1.odiso.net/cephperf2/diff.txt 

I don't see any smoking gun here... :/ 

The main difference between a warm OSD and a cold one is that on startup 
the bluestore cache is empty. You might try setting the bluestore cache 
size to something much smaller and see if that has an effect on the CPU 
utilization? 

Note that this doesn't necessarily mean that's what you want. Maybe the 
reason why the CPU utilization is higher is because the cache is warm and 
the OSD is serving more requests per second... 

sage 



> 
> >From what I see in diff, the biggest difference is in tcmalloc, but maybe 
> >I'm wrong. 
> 
> (I'm using tcmalloc 2.5-2.2) 
> 
> 
> - Mail original - 
> De: "Sage Weil"  
> À: "aderumier"  
> Cc: "ceph-users" , "ceph-devel" 
>  
> Envoyé: Vendredi 25 Janvier 2019 10:49:02 
> Objet: Re: ceph osd commit latency increase over time, until restart 
> 
> Can you capture a perf top or perf record to see where teh CPU time is 
> going on one of the OSDs wth a high latency? 
> 
> Thanks! 
> sage 
> 
> 
> On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 
> 
> > 
> > Hi, 
> > 
> > I have a strange behaviour of my osd, on multiple clusters, 
> > 
> > All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
> > workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd 
> > export-diff/snapshotdelete each day for backup 
> > 
> > When the osd are refreshly started, the commit latency is between 0,5-1ms. 
> > 
> > But overtime, this latency increase slowly (maybe around 1ms by day), until 
> > reaching crazy 
> > values like 20-200ms. 
> > 
> > Some example graphs: 
> > 
> > http://odisoweb1.odiso.net/osdlatency1.png 
> > http://odisoweb1.odiso.net/osdlatency2.png 
> > 
> > All osds have this behaviour, in all clusters. 
> > 
> > The latency of physical disks is ok. (Clusters are far to be full loaded) 
> > 
> > And if I restart the osd, the latency come back to 0,5-1ms. 
> > 
> > That's remember me old tcmalloc bug, but maybe could it be a bluestore 
> > memory bug ? 
> > 
> > Any Hints for counters/logs to check ? 
> > 
> > 
> > Regards, 
> > 
> > Alexandre 
> > 
> > 
> 
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-01-29 Thread Alexandre DERUMIER

Hi,

here some new results,
different osd/ different cluster

before osd restart latency was between 2-5ms
after osd restart is around 1-1.5ms

http://odisoweb1.odiso.net/cephperf2/bad.txt  (2-5ms)
http://odisoweb1.odiso.net/cephperf2/ok.txt (1-1.5ms)
http://odisoweb1.odiso.net/cephperf2/diff.txt


From what I see in diff, the biggest difference is in tcmalloc, but maybe I'm 
wrong.

(I'm using tcmalloc 2.5-2.2)


- Mail original -
De: "Sage Weil" 
À: "aderumier" 
Cc: "ceph-users" , "ceph-devel" 

Envoyé: Vendredi 25 Janvier 2019 10:49:02
Objet: Re: ceph osd commit latency increase over time, until restart

Can you capture a perf top or perf record to see where teh CPU time is 
going on one of the OSDs wth a high latency? 

Thanks! 
sage 


On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 

> 
> Hi, 
> 
> I have a strange behaviour of my osd, on multiple clusters, 
> 
> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd 
> export-diff/snapshotdelete each day for backup 
> 
> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
> 
> But overtime, this latency increase slowly (maybe around 1ms by day), until 
> reaching crazy 
> values like 20-200ms. 
> 
> Some example graphs: 
> 
> http://odisoweb1.odiso.net/osdlatency1.png 
> http://odisoweb1.odiso.net/osdlatency2.png 
> 
> All osds have this behaviour, in all clusters. 
> 
> The latency of physical disks is ok. (Clusters are far to be full loaded) 
> 
> And if I restart the osd, the latency come back to 0,5-1ms. 
> 
> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory 
> bug ? 
> 
> Any Hints for counters/logs to check ? 
> 
> 
> Regards, 
> 
> Alexandre 
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-01-27 Thread Alexandre DERUMIER

Hi,

currently I'm using telegraf + influxdb to monitor.


Note that this bug seem to be only occur on writes, I don't have latency 
increase on read.

counters are op_latency , op_w_latency, op_w_process_latency



SELECT non_negative_derivative(first("op_latency.sum"), 
1s)/non_negative_derivative(first("op_latency.avgcount"),1s)   FROM "ceph" 
WHERE "host" =~  /^([[host]])$/  AND "id" =~ /^([[osd]])$/ AND $timeFilter 
GROUP BY time($interval), "host", "id" fill(previous)


SELECT non_negative_derivative(first("op_w_latency.sum"), 
1s)/non_negative_derivative(first("op_w_latency.avgcount"),1s)   FROM "ceph" 
WHERE "host" =~ /^([[host]])$/  AND collection='osd'  AND  "id" =~ 
/^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
fill(previous)



SELECT non_negative_derivative(first("op_w_process_latency.sum"), 
1s)/non_negative_derivative(first("op_w_process_latency.avgcount"),1s)   FROM 
"ceph" WHERE "host" =~ /^([[host]])$/  AND collection='osd'  AND  "id" =~ 
/^([[osd]])$/ AND $timeFilter GROUP BY time($interval), "host", "id" 
fill(previous)


dashboard is here:

https://grafana.com/dashboards/7995





- Mail original -
De: "Marc Roos" 
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Dimanche 27 Janvier 2019 12:11:42
Objet: RE: [ceph-users] ceph osd commit latency increase over time, until 
restart



Hi Alexandre, 

I was curious if I had a similar issue, what value are you monitoring? I 
have quite a lot to choose from. 


Bluestore.commitLat 
Bluestore.kvLat 
Bluestore.readLat 
Bluestore.readOnodeMetaLat 
Bluestore.readWaitAioLat 
Bluestore.stateAioWaitLat 
Bluestore.stateDoneLat 
Bluestore.stateIoDoneLat 
Bluestore.submitLat 
Bluestore.throttleLat 
Osd.opBeforeDequeueOpLat 
Osd.opRProcessLatency 
Osd.opWProcessLatency 
Osd.subopLatency 
Osd.subopWLatency 
Rocksdb.getLatency 
Rocksdb.submitLatency 
Rocksdb.submitSyncLatency 
RecoverystatePerf.repnotrecoveringLatency 
RecoverystatePerf.waitupthruLatency 
Osd.opRwPrepareLatency 
RecoverystatePerf.primaryLatency 
RecoverystatePerf.replicaactiveLatency 
RecoverystatePerf.startedLatency 
RecoverystatePerf.getlogLatency 
RecoverystatePerf.initialLatency 
RecoverystatePerf.recoveringLatency 
ThrottleBluestoreThrottleBytes.wait 
RecoverystatePerf.waitremoterecoveryreservedLatency 



-Original Message- 
From: Alexandre DERUMIER [mailto:aderum...@odiso.com] 
Sent: vrijdag 25 januari 2019 17:40 
To: Sage Weil 
Cc: ceph-users; ceph-devel 
Subject: Re: [ceph-users] ceph osd commit latency increase over time, 
until restart 

also, here the result of "perf diff 1mslatency.perfdata 
3mslatency.perfdata" 

http://odisoweb1.odiso.net/perf_diff_ok_vs_bad.txt 




- Mail original - 
De: "aderumier"  
À: "Sage Weil"  
Cc: "ceph-users" , "ceph-devel" 
 
Envoyé: Vendredi 25 Janvier 2019 17:32:02 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, 
until restart 

Hi again, 

I was able to perf it today, 

before restart, commit latency was between 3-5ms 

after restart at 17:11, latency is around 1ms 

http://odisoweb1.odiso.net/osd3_latency_3ms_vs_1ms.png 


here some perf reports: 

with 3ms latency: 
- 
perf report by caller: http://odisoweb1.odiso.net/bad-caller.txt 
perf report by callee: http://odisoweb1.odiso.net/bad-callee.txt 


with 1ms latency 
- 
perf report by caller: http://odisoweb1.odiso.net/ok-caller.txt 
perf report by callee: http://odisoweb1.odiso.net/ok-callee.txt 



I'll retry next week, trying to have bigger latency difference. 

Alexandre 

- Mail original - 
De: "aderumier"  
À: "Sage Weil"  
Cc: "ceph-users" , "ceph-devel" 
 
Envoyé: Vendredi 25 Janvier 2019 11:06:51 
Objet: Re: ceph osd commit latency increase over time, until restart 

>>Can you capture a perf top or perf record to see where teh CPU time is 

>>going on one of the OSDs wth a high latency? 

Yes, sure. I'll do it next week and send result to the mailing list. 

Thanks Sage ! 

- Mail original - 
De: "Sage Weil"  
À: "aderumier"  
Cc: "ceph-users" , "ceph-devel" 
 
Envoyé: Vendredi 25 Janvier 2019 10:49:02 
Objet: Re: ceph osd commit latency increase over time, until restart 

Can you capture a perf top or perf record to see where teh CPU time is 
going on one of the OSDs wth a high latency? 

Thanks! 
sage 


On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 

> 
> Hi, 
> 
> I have a strange behaviour of my osd, on multiple clusters, 
> 
> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme 
> drivers, workload is rbd only, with qemu-kvm vms running with librbd + 

>

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-01-25 Thread Alexandre DERUMIER

also, here the result of "perf diff 1mslatency.perfdata  3mslatency.perfdata"

http://odisoweb1.odiso.net/perf_diff_ok_vs_bad.txt




- Mail original -
De: "aderumier" 
À: "Sage Weil" 
Cc: "ceph-users" , "ceph-devel" 

Envoyé: Vendredi 25 Janvier 2019 17:32:02
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

Hi again, 

I was able to perf it today, 

before restart, commit latency was between 3-5ms 

after restart at 17:11, latency is around 1ms 

http://odisoweb1.odiso.net/osd3_latency_3ms_vs_1ms.png 


here some perf reports: 

with 3ms latency: 
- 
perf report by caller: http://odisoweb1.odiso.net/bad-caller.txt 
perf report by callee: http://odisoweb1.odiso.net/bad-callee.txt 


with 1ms latency 
- 
perf report by caller: http://odisoweb1.odiso.net/ok-caller.txt 
perf report by callee: http://odisoweb1.odiso.net/ok-callee.txt 



I'll retry next week, trying to have bigger latency difference. 

Alexandre 

- Mail original - 
De: "aderumier"  
À: "Sage Weil"  
Cc: "ceph-users" , "ceph-devel" 
 
Envoyé: Vendredi 25 Janvier 2019 11:06:51 
Objet: Re: ceph osd commit latency increase over time, until restart 

>>Can you capture a perf top or perf record to see where teh CPU time is 
>>going on one of the OSDs wth a high latency? 

Yes, sure. I'll do it next week and send result to the mailing list. 

Thanks Sage ! 

- Mail original - 
De: "Sage Weil"  
À: "aderumier"  
Cc: "ceph-users" , "ceph-devel" 
 
Envoyé: Vendredi 25 Janvier 2019 10:49:02 
Objet: Re: ceph osd commit latency increase over time, until restart 

Can you capture a perf top or perf record to see where teh CPU time is 
going on one of the OSDs wth a high latency? 

Thanks! 
sage 


On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 

> 
> Hi, 
> 
> I have a strange behaviour of my osd, on multiple clusters, 
> 
> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd 
> export-diff/snapshotdelete each day for backup 
> 
> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
> 
> But overtime, this latency increase slowly (maybe around 1ms by day), until 
> reaching crazy 
> values like 20-200ms. 
> 
> Some example graphs: 
> 
> http://odisoweb1.odiso.net/osdlatency1.png 
> http://odisoweb1.odiso.net/osdlatency2.png 
> 
> All osds have this behaviour, in all clusters. 
> 
> The latency of physical disks is ok. (Clusters are far to be full loaded) 
> 
> And if I restart the osd, the latency come back to 0,5-1ms. 
> 
> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory 
> bug ? 
> 
> Any Hints for counters/logs to check ? 
> 
> 
> Regards, 
> 
> Alexandre 
> 
> 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-01-25 Thread Alexandre DERUMIER

Hi again,

I was able to perf it today,

before restart, commit latency was between 3-5ms

after restart at 17:11, latency is around 1ms

http://odisoweb1.odiso.net/osd3_latency_3ms_vs_1ms.png


here some perf reports:

with 3ms latency:
-
perf report by caller: http://odisoweb1.odiso.net/bad-caller.txt
perf report by callee: http://odisoweb1.odiso.net/bad-callee.txt


with 1ms latency
-
perf report by caller: http://odisoweb1.odiso.net/ok-caller.txt
perf report by callee: http://odisoweb1.odiso.net/ok-callee.txt



I'll retry next week, trying to have bigger latency difference.

Alexandre

- Mail original -
De: "aderumier" 
À: "Sage Weil" 
Cc: "ceph-users" , "ceph-devel" 

Envoyé: Vendredi 25 Janvier 2019 11:06:51
Objet: Re: ceph osd commit latency increase over time, until restart

>>Can you capture a perf top or perf record to see where teh CPU time is 
>>going on one of the OSDs wth a high latency? 

Yes, sure. I'll do it next week and send result to the mailing list. 

Thanks Sage ! 

- Mail original - 
De: "Sage Weil"  
À: "aderumier"  
Cc: "ceph-users" , "ceph-devel" 
 
Envoyé: Vendredi 25 Janvier 2019 10:49:02 
Objet: Re: ceph osd commit latency increase over time, until restart 

Can you capture a perf top or perf record to see where teh CPU time is 
going on one of the OSDs wth a high latency? 

Thanks! 
sage 


On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 

> 
> Hi, 
> 
> I have a strange behaviour of my osd, on multiple clusters, 
> 
> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd 
> export-diff/snapshotdelete each day for backup 
> 
> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
> 
> But overtime, this latency increase slowly (maybe around 1ms by day), until 
> reaching crazy 
> values like 20-200ms. 
> 
> Some example graphs: 
> 
> http://odisoweb1.odiso.net/osdlatency1.png 
> http://odisoweb1.odiso.net/osdlatency2.png 
> 
> All osds have this behaviour, in all clusters. 
> 
> The latency of physical disks is ok. (Clusters are far to be full loaded) 
> 
> And if I restart the osd, the latency come back to 0,5-1ms. 
> 
> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory 
> bug ? 
> 
> Any Hints for counters/logs to check ? 
> 
> 
> Regards, 
> 
> Alexandre 
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-01-25 Thread Alexandre DERUMIER

>>Can you capture a perf top or perf record to see where teh CPU time is 
>>going on one of the OSDs wth a high latency?

Yes, sure. I'll do it next week and send result to the mailing list.

Thanks Sage !
 
- Mail original -
De: "Sage Weil" 
À: "aderumier" 
Cc: "ceph-users" , "ceph-devel" 

Envoyé: Vendredi 25 Janvier 2019 10:49:02
Objet: Re: ceph osd commit latency increase over time, until restart

Can you capture a perf top or perf record to see where teh CPU time is 
going on one of the OSDs wth a high latency? 

Thanks! 
sage 


On Fri, 25 Jan 2019, Alexandre DERUMIER wrote: 

> 
> Hi, 
> 
> I have a strange behaviour of my osd, on multiple clusters, 
> 
> All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
> workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd 
> export-diff/snapshotdelete each day for backup 
> 
> When the osd are refreshly started, the commit latency is between 0,5-1ms. 
> 
> But overtime, this latency increase slowly (maybe around 1ms by day), until 
> reaching crazy 
> values like 20-200ms. 
> 
> Some example graphs: 
> 
> http://odisoweb1.odiso.net/osdlatency1.png 
> http://odisoweb1.odiso.net/osdlatency2.png 
> 
> All osds have this behaviour, in all clusters. 
> 
> The latency of physical disks is ok. (Clusters are far to be full loaded) 
> 
> And if I restart the osd, the latency come back to 0,5-1ms. 
> 
> That's remember me old tcmalloc bug, but maybe could it be a bluestore memory 
> bug ? 
> 
> Any Hints for counters/logs to check ? 
> 
> 
> Regards, 
> 
> Alexandre 
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] ceph osd commit latency increase over time, until restart

2019-01-25 Thread Alexandre DERUMIER



Hi, 

I have a strange behaviour of my osd, on multiple clusters, 

All cluster are running mimic 13.2.1,bluestore, with ssd or nvme drivers, 
workload is rbd only, with qemu-kvm vms running with librbd + snapshot/rbd 
export-diff/snapshotdelete each day for backup 

When the osd are refreshly started, the commit latency is between 0,5-1ms. 

But overtime, this latency increase slowly (maybe around 1ms by day), until 
reaching crazy 
values like 20-200ms. 

Some example graphs:

http://odisoweb1.odiso.net/osdlatency1.png
http://odisoweb1.odiso.net/osdlatency2.png

All osds have this behaviour, in all clusters. 

The latency of physical disks is ok. (Clusters are far to be full loaded) 

And if I restart the osd, the latency come back to 0,5-1ms. 

That's remember me old tcmalloc bug, but maybe could it be a bluestore memory 
bug ? 

Any Hints for counters/logs to check ? 


Regards, 

Alexandre 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS performance issue

2019-01-21 Thread Alexandre DERUMIER

>>How can you see that the cache is filling up and you need to execute 
>>"echo 2 > /proc/sys/vm/drop_caches"? 

you can monitor number of ceph dentry in slabinfo


here a small script I'm running in cron.



#!/bin/bash
if pidof -o %PPID -x "dropcephinodecache.sh">/dev/null; then
echo "Process already running"
exit 1;
fi

value=`cat /proc/slabinfo |grep 'ceph_dentry_info\|fuse_inode'|awk '/1/ {print 
$2}'|head -1`

if [ "$value" -gt 50 ];then
   echo "Flush inode cache"
   echo 2 > /proc/sys/vm/drop_caches
   
fi



- Mail original -
De: "Marc Roos" 
À: "transuranium.yue" , "Zheng Yan" 

Cc: "ceph-users" 
Envoyé: Lundi 21 Janvier 2019 15:53:17
Objet: Re: [ceph-users] MDS performance issue

How can you see that the cache is filling up and you need to execute 
"echo 2 > /proc/sys/vm/drop_caches"? 



-Original Message- 
From: Yan, Zheng [mailto:uker...@gmail.com] 
Sent: 21 January 2019 15:50 
To: Albert Yue 
Cc: ceph-users 
Subject: Re: [ceph-users] MDS performance issue 

On Mon, Jan 21, 2019 at 11:16 AM Albert Yue  
wrote: 
> 
> Dear Ceph Users, 
> 
> We have set up a cephFS cluster with 6 osd machines, each with 16 8TB 
harddisk. Ceph version is luminous 12.2.5. We created one data pool with 
these hard disks and created another meta data pool with 3 ssd. We 
created a MDS with 65GB cache size. 
> 
> But our users are keep complaining that cephFS is too slow. What we 
observed is cephFS is fast when we switch to a new MDS instance, once 
the cache fills up (which will happen very fast), client became very 
slow when performing some basic filesystem operation such as `ls`. 
> 

It seems that clients hold lots of unused inodes their icache, which 
prevent mds from trimming corresponding objects from its cache. mimic 
has command "ceph daemon mds.x cache drop" to ask client to drop its 
cache. I'm also working on a patch that make kclient client release 
unused inodes. 

For luminous, there is not much we can do, except periodically run 
"echo 2 > /proc/sys/vm/drop_caches" on each client. 


> What we know is our user are putting lots of small files into the 
cephFS, now there are around 560 Million files. We didn't see high CPU 
wait on MDS instance and meta data pool just used around 200MB space. 
> 
> My question is, what is the relationship between the metadata pool and 
MDS? Is this performance issue caused by the hardware behind meta data 
pool? Why the meta data pool only used 200MB space, and we saw 3k iops 
on each of these three ssds, why can't MDS cache all these 200MB into 
memory? 
> 
> Thanks very much! 
> 
> 
> Best Regards, 
> 
> Albert 
> 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 


___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS - Small file - single thread - read performance.

2019-01-18 Thread Alexandre DERUMIER

Hi,
I don't have so big latencies:

# time cat 50bytesfile > /dev/null

real0m0,002s
user0m0,001s
sys 0m0,000s


(It's on an ceph ssd cluster (mimic), kernel cephfs client (4.18), 10GB network 
with small latency too, client/server have 3ghz cpus)



- Mail original -
De: "Burkhard Linke" 
À: "ceph-users" 
Envoyé: Vendredi 18 Janvier 2019 15:29:45
Objet: Re: [ceph-users] CephFS - Small file - single thread - read performance.

Hi, 

On 1/18/19 3:11 PM, jes...@krogh.cc wrote: 
> Hi. 
> 
> We have the intention of using CephFS for some of our shares, which we'd 
> like to spool to tape as a part normal backup schedule. CephFS works nice 
> for large files but for "small" .. < 0.1MB .. there seem to be a 
> "overhead" on 20-40ms per file. I tested like this: 
> 
> root@abe:/nfs/home/jk# time cat /ceph/cluster/rsyncbackups/13kbfile > 
> /dev/null 
> 
> real 0m0.034s 
> user 0m0.001s 
> sys 0m0.000s 
> 
> And from local page-cache right after. 
> root@abe:/nfs/home/jk# time cat /ceph/cluster/rsyncbackups/13kbfile > 
> /dev/null 
> 
> real 0m0.002s 
> user 0m0.002s 
> sys 0m0.000s 
> 
> Giving a ~20ms overhead in a single file. 
> 
> This is about x3 higher than on our local filesystems (xfs) based on 
> same spindles. 
> 
> CephFS metadata is on SSD - everything else on big-slow HDD's (in both 
> cases). 
> 
> Is this what everyone else see? 


Each file access on client side requires the acquisition of a 
corresponding locking entity ('file capability') from the MDS. This adds 
an extra network round trip to the MDS. In the worst case the MDS needs 
to request a capability release from another client which still holds 
the cap (e.g. file is still in page cache), adding another extra network 
round trip. 


CephFS is not NFS, and has a strong consistency model. This comes at a 
price. 


Regards, 

Burkhard 


___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Encryption questions

2019-01-10 Thread Alexandre DERUMIER

>>1) Are RBD connections encrypted or is there an option to use encryption 
>>between clients and Ceph? From reading the documentation, I have the 
>>impression that the only option to guarantee encryption in >>transit is to 
>>force clients to encrypt volumes via dmcrypt. Is there another option? I know 
>>I could encrypt the OSDs but that's not going to solve the problem of 
>>encryption in transit.

not related to ceph, but if you use qemu, they are a luks driver for qemu, so 
you can encrypt from qemu process to storage.
https://people.redhat.com/berrange/kvm-forum-2016/kvm-forum-2016-security.pdf




- Mail original -
De: "Sergio A. de Carvalho Jr." 
À: "ceph-users" 
Envoyé: Jeudi 10 Janvier 2019 19:59:06
Objet: [ceph-users] Encryption questions

Hi everyone, I have some questions about encryption in Ceph. 
1) Are RBD connections encrypted or is there an option to use encryption 
between clients and Ceph? From reading the documentation, I have the impression 
that the only option to guarantee encryption in transit is to force clients to 
encrypt volumes via dmcrypt. Is there another option? I know I could encrypt 
the OSDs but that's not going to solve the problem of encryption in transit. 

2) I'm also struggling to understand if communication between Ceph daemons 
(monitors and OSDs) are encrypted or not. I came across a few references about 
msgr2 but I couldn't tell if it is already implemented. Can anyone confirm 
this? 

I'm currently starting a new project using Ceph Mimic but if there's something 
new in this space expected for Nautilus, it would be good to know as well. 

Regards, 

Sergio 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] v13.2.4 Mimic released

2019-01-07 Thread Alexandre DERUMIER

Hi,

>>* Ceph v13.2.2 includes a wrong backport, which may cause mds to go into 
>>'damaged' state when upgrading Ceph cluster from previous version. 
>>The bug is fixed in v13.2.3. If you are already running v13.2.2, 
>>upgrading to v13.2.3 does not require special action. 

Any special action for upgrading from 13.2.1 ?



- Mail original -
De: "Abhishek Lekshmanan" 
À: "ceph-announce" , "ceph-users" 
, ceph-maintain...@lists.ceph.com, "ceph-devel" 

Envoyé: Lundi 7 Janvier 2019 11:37:05
Objet: v13.2.4 Mimic released

This is the fourth bugfix release of the Mimic v13.2.x long term stable 
release series. This release includes two security fixes atop of v13.2.3 
We recommend all users upgrade to this version. If you've already 
upgraded to v13.2.3, the same restrictions from v13.2.2->v13.2.3 apply 
here as well. 

Notable Changes 
--- 

* CVE-2018-16846: rgw: enforce bounds on max-keys/max-uploads/max-parts 
(`issue#35994 `_) 
* CVE-2018-14662: mon: limit caps allowed to access the config store 

Notable Changes in v13.2.3 
--- 

* The default memory utilization for the mons has been increased 
somewhat. Rocksdb now uses 512 MB of RAM by default, which should 
be sufficient for small to medium-sized clusters; large clusters 
should tune this up. Also, the `mon_osd_cache_size` has been 
increase from 10 OSDMaps to 500, which will translate to an 
additional 500 MB to 1 GB of RAM for large clusters, and much less 
for small clusters. 

* Ceph v13.2.2 includes a wrong backport, which may cause mds to go into 
'damaged' state when upgrading Ceph cluster from previous version. 
The bug is fixed in v13.2.3. If you are already running v13.2.2, 
upgrading to v13.2.3 does not require special action. 

* The bluestore_cache_* options are no longer needed. They are replaced 
by osd_memory_target, defaulting to 4GB. BlueStore will expand 
and contract its cache to attempt to stay within this 
limit. Users upgrading should note this is a higher default 
than the previous bluestore_cache_size of 1GB, so OSDs using 
BlueStore will use more memory by default. 
For more details, see the `BlueStore docs 
`_.
 

* This version contains an upgrade bug, http://tracker.ceph.com/issues/36686, 
due to which upgrading during recovery/backfill can cause OSDs to fail. This 
bug can be worked around, either by restarting all the OSDs after the upgrade, 
or by upgrading when all PGs are in "active+clean" state. If you have already 
successfully upgraded to 13.2.2, this issue should not impact you. Going 
forward, we are working on a clean upgrade path for this feature. 


For more details please refer to the release blog at 
https://ceph.com/releases/13-2-4-mimic-released/ 

Getting ceph 
* Git at git://github.com/ceph/ceph.git 
* Tarball at http://download.ceph.com/tarballs/ceph-13.2.4.tar.gz 
* For packages, see http://docs.ceph.com/docs/master/install/get-packages/ 
* Release git sha1: b10be4d44915a4d78a8e06aa31919e74927b142e 

-- 
Abhishek Lekshmanan 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, 
HRB 21284 (AG Nürnberg) 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs : rsync backup create cache pressure on clients, filling caps

2019-01-06 Thread Alexandre DERUMIER

>>Answers for all questions are no.

ok, thanks.

>> feature that limits the number of caps for individual client is on our todo 
>> list.

That's great !

As workaround, I'm running now small bash script in cron in clients, to 
drop_cache when
ceph_dentry_info slab are too big. Since that, my mds seem to be pretty happy.

The only thing, is that drop_caches during a big rsync, is not fast enough 
(purging entries seem
slower than adding new ones), so it can take hours, and memory still increase a 
little bit.
Not sure how it's work, when mds try to revoke caps from the client, if it's 
the same behavior or not.




dropcephinodecache.sh

#!/bin/bash
if pidof -o %PPID -x "dropcephinodecache.sh">/dev/null; then
echo "Process already running"
exit 1;
fi

value=`cat /proc/slabinfo |grep 'ceph_dentry_info\|fuse_inode'|awk '/1/ {print 
$2}'|head -1`

if [ "$value" -gt 50 ];then
   echo "Flush inode cache"
   echo 2 > /proc/sys/vm/drop_caches
   
fi


- Mail original -
De: "Zheng Yan" 
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Lundi 7 Janvier 2019 06:51:14
Objet: Re: [ceph-users] cephfs : rsync backup create cache pressure on clients, 
filling caps

On Fri, Jan 4, 2019 at 11:40 AM Alexandre DERUMIER  wrote: 
> 
> Hi, 
> 
> I'm currently doing cephfs backup, through a dedicated clients mounting the 
> whole filesystem at root. 
> others clients are mounting part of the filesystem. (kernel cephfs clients) 
> 
> 
> I have around 22millions inodes, 
> 
> before backup, I have around 5M caps loaded by clients 
> 
> #ceph daemonperf mds.x.x 
> 
> ---mds --mds_cache--- ---mds_log -mds_mem- 
> --mds_server-- mds_ -objecter-- purg 
> req rlat fwd inos caps exi imi |stry recy recd|subm evts segs|ino dn |hcr hcs 
> hsr |sess|actv rd wr rdwr|purg| 
> 118 0 0 22M 5.3M 0 0 | 6 0 0 | 2 120k 130 | 22M 22M|118 0 0 |167 | 0 2 0 0 | 
> 0 
> 
> 
> 
> when backup is running, reading all the files, the caps are increasing to max 
> (and even a little bit more) 
> 
> # ceph daemonperf mds.x.x 
> ---mds --mds_cache--- ---mds_log -mds_mem- 
> --mds_server-- mds_ -objecter-- purg 
> req rlat fwd inos caps exi imi |stry recy recd|subm evts segs|ino dn |hcr hcs 
> hsr |sess|actv rd wr rdwr|purg| 
> 155 0 0 20M 22M 0 0 | 6 0 0 | 2 120k 129 | 20M 20M|155 0 0 |167 | 0 0 0 0 | 0 
> 
> then mds try recall caps to others clients, and I'm gettin some 
> 2019-01-04 01:13:11.173768 cluster [WRN] Health check failed: 1 clients 
> failing to respond to cache pressure (MDS_CLIENT_RECALL) 
> 2019-01-04 02:00:00.73 cluster [WRN] overall HEALTH_WARN 1 clients 
> failing to respond to cache pressure 
> 2019-01-04 03:00:00.69 cluster [WRN] overall HEALTH_WARN 1 clients 
> failing to respond to cache pressure 
> 
> 
> 
> Doing a simple 
> echo 2 | tee /proc/sys/vm/drop_caches 
> on backup server, is freeing caps again 
> 
> # ceph daemonperf x 
> ---mds --mds_cache--- ---mds_log -mds_mem- 
> --mds_server-- mds_ -objecter-- purg 
> req rlat fwd inos caps exi imi |stry recy recd|subm evts segs|ino dn |hcr hcs 
> hsr |sess|actv rd wr rdwr|purg| 
> 116 0 0 22M 4.8M 0 0 | 4 0 0 | 1 117k 131 | 22M 22M|116 1 0 |167 | 0 2 0 0 | 
> 0 
> 
> 
> 
> 
> Some questions here : 
> 
> ceph side 
> - 
> Is it possible to setup some kind of priority between clients, to force 
> retreive caps on a specific client ? 
> Is is possible to limit the number of caps for a client ? 
> 
> 
> client side 
> --- 
> I have tried to use vm.vfs_cache_pressure=4, to reclam inodes entries 
> more fast, but server have 128GB ram. 
> Is it possible to limit the number of inodes in cache on linux. 
> Is is possible to tune something on the ceph mount point ? 
> 

Answers for all questions are no. feature that limits the number of 
caps for individual client is on our todo list. 

> 
> Regards, 
> 
> Alexandre 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] cephfs : rsync backup create cache pressure on clients, filling caps

2019-01-03 Thread Alexandre DERUMIER

Hi,

I'm currently doing cephfs backup, through a dedicated clients mounting the 
whole filesystem at root.
others clients are mounting part of the filesystem. (kernel cephfs clients)


I have around 22millions inodes, 

before backup, I have around 5M caps loaded by clients

#ceph daemonperf mds.x.x

---mds --mds_cache--- ---mds_log -mds_mem- 
--mds_server-- mds_ -objecter-- purg 
req  rlat fwd  inos caps exi  imi |stry recy recd|subm evts segs|ino  dn  |hcr  
hcs  hsr |sess|actv rd   wr   rdwr|purg|
11800   22M 5.3M   00 |  600 |  2  120k 130 | 22M  22M|118  
  00 |167 |  0200 |  0 



when backup is running, reading all the files, the caps are increasing to max 
(and even a little bit more)

# ceph daemonperf mds.x.x
---mds --mds_cache--- ---mds_log -mds_mem- 
--mds_server-- mds_ -objecter-- purg 
req  rlat fwd  inos caps exi  imi |stry recy recd|subm evts segs|ino  dn  |hcr  
hcs  hsr |sess|actv rd   wr   rdwr|purg|
15500   20M  22M   00 |  600 |  2  120k 129 | 20M  20M|155  
  00 |167 |  0000 |  0 

then mds try recall caps to others clients, and I'm gettin some
2019-01-04 01:13:11.173768 cluster [WRN] Health check failed: 1 clients failing 
to respond to cache pressure (MDS_CLIENT_RECALL)
2019-01-04 02:00:00.73 cluster [WRN] overall HEALTH_WARN 1 clients failing 
to respond to cache pressure
2019-01-04 03:00:00.69 cluster [WRN] overall HEALTH_WARN 1 clients failing 
to respond to cache pressure



Doing a simple
echo 2 | tee /proc/sys/vm/drop_caches
on backup server, is freeing caps again

# ceph daemonperf x
---mds --mds_cache--- ---mds_log -mds_mem- 
--mds_server-- mds_ -objecter-- purg 
req  rlat fwd  inos caps exi  imi |stry recy recd|subm evts segs|ino  dn  |hcr  
hcs  hsr |sess|actv rd   wr   rdwr|purg|
11600   22M 4.8M   00 |  400 |  1  117k 131 | 22M  22M|116  
  10 |167 |  0200 |  0 




Some questions here :

ceph side
-
Is it possible to setup some kind of priority between clients, to force 
retreive caps on a specific client ?
Is is possible to limit the number of caps for a client ?


client side 
---
I have tried to use vm.vfs_cache_pressure=4, to reclam inodes entries more 
fast, but server have 128GB ram.
Is it possible to limit the number of inodes in cache on linux.
Is is possible to tune something on the ceph mount point ?


Regards,

Alexandre
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket closed (con state OPEN)

2018-11-08 Thread Alexandre DERUMIER

>>If you're using kernel client for cephfs, I strongly advise to have the 
>>client on the same subnet as the ceph public one i.e all traffic should be on 
>>the same subnet/VLAN. Even if your firewall situation is good, if you >>have 
>>to cross subnets or VLANs, you will run into weird problems later. 

Thanks. 

Currently client is in different vlan for security. (multiple differents 
customer, don't want that a customer have direct access to other customer or 
ceph).
But, as they are vm, I can manage to put them in the same vlan and do 
firewalling on the hypervisor.  (but I'll need firewalling in all cases)


>>Fuse has much better tolerance for that scenario. 

What's the difference ? 



- Mail original -
De: "Linh Vu" 
À: "aderumier" , "ceph-users" 
Envoyé: Vendredi 9 Novembre 2018 02:16:07
Objet: Re: cephfs kernel, hang with libceph: osdx X.X.X.X socket closed (con 
state OPEN)



If you're using kernel client for cephfs, I strongly advise to have the client 
on the same subnet as the ceph public one i.e all traffic should be on the same 
subnet/VLAN. Even if your firewall situation is good, if you have to cross 
subnets or VLANs, you will run into weird problems later. Fuse has much better 
tolerance for that scenario. 

From: ceph-users  on behalf of Alexandre 
DERUMIER  
Sent: Friday, 9 November 2018 12:06:43 PM 
To: ceph-users 
Subject: Re: [ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket 
closed (con state OPEN) 
Ok, 
It seem to come from firewall, 
I'm seeing dropped session exactly 15min before the log. 

The sessions are the session to osd, session to mon && mds are ok. 


Seem that keeplive2 is used to monitor the mon session 
[ https://patchwork.kernel.org/patch/7105641/ | 
https://patchwork.kernel.org/patch/7105641/ ] 

but I'm not sure about osd sessions ? 

- Mail original - 
De: "aderumier"  
À: "ceph-users"  
Cc: "Alexandre Bruyelles"  
Envoyé: Vendredi 9 Novembre 2018 01:12:25 
Objet: Re: [ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket 
closed (con state OPEN) 

To be more precise, 

the logs occurs when the hang is finished. 

I have looked at stats on 10 differents hang, and the duration is always around 
15 minutes. 

Maybe related to: 

ms tcp read timeout 
Description: If a client or daemon makes a request to another Ceph daemon and 
does not drop an unused connection, the ms tcp read timeout defines the 
connection as idle after the specified number of seconds. 
Type: Unsigned 64-bit Integer 
Required: No 
Default: 900 15 minutes. 

? 

Find a similar bug report with firewall too: 

[ http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/013841.html 
| http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/013841.html 
] 


- Mail original - 
De: "aderumier"  
À: "ceph-users"  
Envoyé: Jeudi 8 Novembre 2018 18:16:20 
Objet: [ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket 
closed (con state OPEN) 

Hi, 

we are currently test cephfs with kernel module (4.17 and 4.18) instead fuse 
(worked fine), 

and we have hang, iowait jump like crazy for around 20min. 

client is a qemu 2.12 vm with virtio-net interface. 


Is the client logs, we are seeing this kind of logs: 

[jeu. nov. 8 12:20:18 2018] libceph: osd14 x.x.x.x:6801 socket closed (con 
state OPEN) 
[jeu. nov. 8 12:42:03 2018] libceph: osd9 x.x.x.x:6821 socket closed (con state 
OPEN) 


and in osd logs: 

osd14: 
2018-11-08 12:20:25.247 7f31ffac8700 0 -- x.x.x.x:6801/1745 >> 
x.x.x.x:0/3678871522 conn(0x558c430ec300 :6801 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg 
accept replacing existing (lossy) channel (new one lossy=1) 

osd9: 
2018-11-08 12:42:09.820 7f7ca970e700 0 -- x.x.x.x:6821/1739 >> 
x.x.x.x:0/3678871522 conn(0x564fcbec5100 :6821 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg 
accept replacing existing (lossy) channel (new one lossy=1) 


cluster is ceph 13.2.1 

Note that we have a physical firewall between client and server, I'm not sure 
yet if the session could be dropped. (I don't have find any logs in the 
firewall). 

Any idea ? I would like to known if it's a network bug, or ceph bug (not sure 
how to understand the osd logs) 

Regards, 

Alexandre 



client ceph.conf 
 
[client] 
fuse_disable_pagecache = true 
client_reconnect_stale = true 


___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
[ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ] 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
[ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ]

Re: [ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket closed (con state OPEN)

2018-11-08 Thread Alexandre DERUMIER

Ok,
It seem to come from firewall,
I'm seeing dropped session exactly 15min before the log.

The sessions are the session to osd,  session to mon && mds are ok.


Seem that keeplive2 is used to monitor the mon session
https://patchwork.kernel.org/patch/7105641/

but I'm not sure about osd sessions ?

- Mail original -
De: "aderumier" 
À: "ceph-users" 
Cc: "Alexandre Bruyelles" 
Envoyé: Vendredi 9 Novembre 2018 01:12:25
Objet: Re: [ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket 
closed (con state OPEN)

To be more precise, 

the logs occurs when the hang is finished. 

I have looked at stats on 10 differents hang, and the duration is always around 
15 minutes. 

Maybe related to: 

ms tcp read timeout 
Description: If a client or daemon makes a request to another Ceph daemon and 
does not drop an unused connection, the ms tcp read timeout defines the 
connection as idle after the specified number of seconds. 
Type: Unsigned 64-bit Integer 
Required: No 
Default: 900 15 minutes. 

? 

Find a similar bug report with firewall too: 

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/013841.html 


- Mail original - 
De: "aderumier"  
À: "ceph-users"  
Envoyé: Jeudi 8 Novembre 2018 18:16:20 
Objet: [ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket 
closed (con state OPEN) 

Hi, 

we are currently test cephfs with kernel module (4.17 and 4.18) instead fuse 
(worked fine), 

and we have hang, iowait jump like crazy for around 20min. 

client is a qemu 2.12 vm with virtio-net interface. 


Is the client logs, we are seeing this kind of logs: 

[jeu. nov. 8 12:20:18 2018] libceph: osd14 x.x.x.x:6801 socket closed (con 
state OPEN) 
[jeu. nov. 8 12:42:03 2018] libceph: osd9 x.x.x.x:6821 socket closed (con state 
OPEN) 


and in osd logs: 

osd14: 
2018-11-08 12:20:25.247 7f31ffac8700 0 -- x.x.x.x:6801/1745 >> 
x.x.x.x:0/3678871522 conn(0x558c430ec300 :6801 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg 
accept replacing existing (lossy) channel (new one lossy=1) 

osd9: 
2018-11-08 12:42:09.820 7f7ca970e700 0 -- x.x.x.x:6821/1739 >> 
x.x.x.x:0/3678871522 conn(0x564fcbec5100 :6821 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg 
accept replacing existing (lossy) channel (new one lossy=1) 


cluster is ceph 13.2.1 

Note that we have a physical firewall between client and server, I'm not sure 
yet if the session could be dropped. (I don't have find any logs in the 
firewall). 

Any idea ? I would like to known if it's a network bug, or ceph bug (not sure 
how to understand the osd logs) 

Regards, 

Alexandre 



client ceph.conf 
 
[client] 
fuse_disable_pagecache = true 
client_reconnect_stale = true 


___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket closed (con state OPEN)

2018-11-08 Thread Alexandre DERUMIER

To be more precise,

the logs occurs when the hang is finished.

I have looked at stats on 10 differents hang, and the duration is always around 
15 minutes.

Maybe related to:

ms tcp read timeout
Description:If a client or daemon makes a request to another Ceph daemon 
and does not drop an unused connection, the ms tcp read timeout defines the 
connection as idle after the specified number of seconds.
Type:   Unsigned 64-bit Integer
Required:   No
Default:900 15 minutes.

?

Find a similar bug report with firewall too:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/013841.html


- Mail original -
De: "aderumier" 
À: "ceph-users" 
Envoyé: Jeudi 8 Novembre 2018 18:16:20
Objet: [ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket 
closed (con state OPEN)

Hi, 

we are currently test cephfs with kernel module (4.17 and 4.18) instead fuse 
(worked fine), 

and we have hang, iowait jump like crazy for around 20min. 

client is a qemu 2.12 vm with virtio-net interface. 


Is the client logs, we are seeing this kind of logs: 

[jeu. nov. 8 12:20:18 2018] libceph: osd14 x.x.x.x:6801 socket closed (con 
state OPEN) 
[jeu. nov. 8 12:42:03 2018] libceph: osd9 x.x.x.x:6821 socket closed (con state 
OPEN) 


and in osd logs: 

osd14: 
2018-11-08 12:20:25.247 7f31ffac8700 0 -- x.x.x.x:6801/1745 >> 
x.x.x.x:0/3678871522 conn(0x558c430ec300 :6801 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg 
accept replacing existing (lossy) channel (new one lossy=1) 

osd9: 
2018-11-08 12:42:09.820 7f7ca970e700 0 -- x.x.x.x:6821/1739 >> 
x.x.x.x:0/3678871522 conn(0x564fcbec5100 :6821 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg 
accept replacing existing (lossy) channel (new one lossy=1) 


cluster is ceph 13.2.1 

Note that we have a physical firewall between client and server, I'm not sure 
yet if the session could be dropped. (I don't have find any logs in the 
firewall). 

Any idea ? I would like to known if it's a network bug, or ceph bug (not sure 
how to understand the osd logs) 

Regards, 

Alexandre 



client ceph.conf 
 
[client] 
fuse_disable_pagecache = true 
client_reconnect_stale = true 


___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket closed (con state OPEN)

2018-11-08 Thread Alexandre DERUMIER

Hi,

we are currently test cephfs with kernel module (4.17 and 4.18) instead fuse 
(worked fine),

and we have hang, iowait jump like crazy for around 20min.

client is a qemu 2.12 vm with virtio-net interface.


Is the client logs, we are seeing this kind of logs:

[jeu. nov.  8 12:20:18 2018] libceph: osd14 x.x.x.x:6801 socket closed (con 
state OPEN)
[jeu. nov.  8 12:42:03 2018] libceph: osd9 x.x.x.x:6821 socket closed (con 
state OPEN)


and in osd logs:

osd14:
2018-11-08 12:20:25.247 7f31ffac8700  0 -- x.x.x.x:6801/1745 >> 
x.x.x.x:0/3678871522 conn(0x558c430ec300 :6801 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg 
accept replacing existing (lossy) channel (new one lossy=1)

osd9:
2018-11-08 12:42:09.820 7f7ca970e700  0 -- x.x.x.x:6821/1739 >> 
x.x.x.x:0/3678871522 conn(0x564fcbec5100 :6821 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg 
accept replacing existing (lossy) channel (new one lossy=1)


cluster is ceph 13.2.1

Note that we have a physical firewall between client and server, I'm not sure 
yet if the session could be dropped. (I don't have find any logs in the 
firewall).

Any idea ? I would like to known if it's a network bug, or ceph bug (not sure 
how to understand the osd logs)

Regards,

Alexandre



client ceph.conf

[client]
fuse_disable_pagecache = true
client_reconnect_stale = true


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Don't upgrade to 13.2.2 if you use cephfs

2018-10-17 Thread Alexandre DERUMIER

Hi,

Is it possible to have more infos or announce about this problem ?

I'm currently waiting to migrate from luminious to mimic, (I need new quota 
feature for cephfs)

is it safe to upgrade to 13.2.2 ?  

or better to wait to 13.2.3 ? or install 13.2.1 for now ?

--
Alexandre

- Mail original -
De: "Patrick Donnelly" 
À: "Zheng Yan" 
Cc: "ceph-devel" , "ceph-users" 
, ceph-annou...@lists.ceph.com
Envoyé: Lundi 8 Octobre 2018 18:50:59
Objet: Re: [ceph-users] Don't upgrade to 13.2.2 if you use cephfs

+ceph-announce 

On Sun, Oct 7, 2018 at 7:30 PM Yan, Zheng  wrote: 
> There is a bug in v13.2.2 mds, which causes decoding purge queue to 
> fail. If mds is already in damaged state, please downgrade mds to 
> 13.2.1, then run 'ceph mds repaired fs_name:damaged_rank' . 
> 
> Sorry for all the trouble I caused. 
> Yan, Zheng 

This issue is being tracked here: http://tracker.ceph.com/issues/36346 

The problem was caused by a backport of the wrong commit which 
unfortunately was not caught. The backport was not done to Luminous; 
only Mimic 13.2.2 is affected. New deployments on 13.2.2 are also 
affected but do not require immediate action. A procedure for handling 
upgrades of fresh deployments from 13.2.2 to 13.2.3 will be included 
in the release notes for 13.2.3. 
-- 
Patrick Donnelly 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Mimic and Debian 9

2018-10-17 Thread Alexandre DERUMIER

It's also possible to install ubuntu xenial packages on stretch,

but this need the old  libssl1.0.0 package (you can install manually the deb 
from jessie)


- Mail original -
De: "Hervé Ballans" 
À: "ceph-users" 
Envoyé: Mercredi 17 Octobre 2018 11:21:14
Objet: [ceph-users] Mimic and Debian 9

Hi, 

I just wanted to know if we had a chance soon to install Mimic on Debian 
9 ?! ;) 

I know there is a problem with the required version gcc (compatible with 
c++17) that is not yet backported on current stable version of Debian, 
but is there any news on this side ? 

Regards, 
Hervé 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Anyone tested Samsung 860 DCT SSDs?

2018-10-12 Thread Alexandre DERUMIER

don't have tested them, but be careful of dwpd

0.2 DWPD

:/

- Mail original -
De: "Kenneth Van Alstyne" 
À: "ceph-users" 
Envoyé: Vendredi 12 Octobre 2018 15:53:43
Objet: [ceph-users] Anyone tested Samsung 860 DCT SSDs?

Cephers: 
As the subject suggests, has anyone tested Samsung 860 DCT SSDs? They are 
really inexpensive and we are considering buying some to test. 

Thanks, 

-- 
Kenneth Van Alstyne 
Systems Architect 
Knight Point Systems, LLC 
Service-Disabled Veteran-Owned Business 
1775 Wiehle Avenue Suite 101 | Reston, VA 20190 
c: 228-547-8045 f: 571-266-3106 
www.knightpoint.com 
DHS EAGLE II Prime Contractor: FC1 SDVOSB Track 
GSA Schedule 70 SDVOSB: GS-35F-0646S 
GSA MOBIS Schedule: GS-10F-0404Y 
ISO 2 / ISO 27001 / CMMI Level 3 

Notice: This e-mail message, including any attachments, is for the sole use of 
the intended recipient(s) and may contain confidential and privileged 
information. Any unauthorized review, copy, use, disclosure, or distribution is 
STRICTLY prohibited. If you are not the intended recipient, please contact the 
sender by reply e-mail and destroy all copies of the original message. 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-08-08 Thread Alexandre DERUMIER

Hi,

I have upgraded to 12.2.7 , 2 weeks ago,
and I don't see anymore memory increase !  (can't confirm that it was related 
to your patch).


Thanks again for helping !

Regards,

Alexandre Derumier


- Mail original -
De: "Zheng Yan" 
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Mardi 29 Mai 2018 04:40:27
Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

Could you try path https://github.com/ceph/ceph/pull/22240/files. 

The leakage of MMDSBeacon messages can explain your issue. 

Regards 
Yan, Zheng 





On Mon, May 28, 2018 at 12:06 PM, Alexandre DERUMIER 
 wrote: 
>>>could you send me full output of dump_mempools 
> 
> # ceph daemon mds.ceph4-2.odiso.net dump_mempools 
> { 
> "bloom_filter": { 
> "items": 41262668, 
> "bytes": 41262668 
> }, 
> "bluestore_alloc": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "bluestore_cache_data": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "bluestore_cache_onode": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "bluestore_cache_other": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "bluestore_fsck": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "bluestore_txc": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "bluestore_writing_deferred": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "bluestore_writing": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "bluefs": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "buffer_anon": { 
> "items": 712726, 
> "bytes": 106964870 
> }, 
> "buffer_meta": { 
> "items": 15, 
> "bytes": 1320 
> }, 
> "osd": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "osd_mapbl": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "osd_pglog": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "osdmap": { 
> "items": 216, 
> "bytes": 12168 
> }, 
> "osdmap_mapping": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "pgmap": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "mds_co": { 
> "items": 50741038, 
> "bytes": 5114319203 
> }, 
> "unittest_1": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "unittest_2": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "total": { 
> "items": 92716663, 
> "bytes": 5262560229 
> } 
> } 
> 
> 
> 
> 
> 
> ceph daemon mds.ceph4-2.odiso.net perf dump 
> { 
> "AsyncMessenger::Worker-0": { 
> "msgr_recv_messages": 1276789161, 
> "msgr_send_messages": 1317625246, 
> "msgr_recv_bytes": 10630409633633, 
> "msgr_send_bytes": 1093972769957, 
> "msgr_created_connections": 207, 
> "msgr_active_connections": 204, 
> "msgr_running_total_time": 63745.463077594, 
> "msgr_running_send_time": 22210.867549070, 
> "msgr_running_recv_time": 51944.624353942, 
> "msgr_running_fast_dispatch_time": 9185.274084187 
> }, 
> "AsyncMessenger::Worker-1": { 
> "msgr_recv_messages": 641622644, 
> "msgr_send_messages": 616664293, 
> "msgr_recv_bytes": 7287546832466, 
> "msgr_send_bytes": 588278035895, 
> "msgr_created_connections": 494, 
> "msgr_active_connections": 494, 
> "msgr_running_total_time": 35390.081250881, 
> "msgr_running_send_time": 11559.689889195, 
> "msgr_running_recv_time": 29844.885712902, 
> "msgr_running_fast_dispatch_time": 6361.466445253 
> }, 
> "AsyncMessenger::Worker-2": { 
> "msgr_recv_messages": 1972469623, 
> "msgr_send_messages": 1886060294, 
> "msgr_recv_bytes": 7924136565846, 
> "msgr_send_bytes": 5072502101797, 
> "msgr_created_connections": 181, 
> "msgr_active_connections": 176, 
> "msgr_running_total_time": 93257.811989806, 
> "msgr_running_send_time": 35556.662488302, 
> "msgr_running_recv_time": 81686.262228047, 
> "msgr_running_fast_dispatch_time": 6476.875317930 
> }, 
> "finisher-PurgeQueue": { 
> "queue_len": 0, 
> "complete_latency": { 
> "a

Re: [ceph-users] krbd vs librbd performance with qemu

2018-07-18 Thread Alexandre DERUMIER

Hi,

qemu use only 1 thread for disk, generally the performance limitation come from 
cpu.

(you can have 1 thread for each disk using iothread).

I'm not sure how it's work with krbd, but with librbd  and qemu rbd driver, 
it's only use 1core by disk.

So, you need to have fast cpu frequency, disable rbd cache, disable client 
debug, or others option which can lower cpu usage client side.



- Mail original -
De: "Nikola Ciprich" 
À: "ceph-users" 
Cc: "nik" 
Envoyé: Mercredi 18 Juillet 2018 16:54:58
Objet: [ceph-users] krbd vs librbd performance with qemu

Hi, 

historically I've found many discussions about this topic in 
last few years, but it seems to me to be still a bit unresolved 
so I'd like to open the question again.. 

In all flash deployments, under 12.2.5 luminous and qemu 12.2.0 
using lbirbd, I'm getting much worse results regarding IOPS then 
with KRBD and direct block device access.. 

I'm testing on the same 100GB RBD volume, notable ceph settings: 

client rbd cache disabled 
osd_enable_op_tracker = False 
osd_op_num_shards = 64 
osd_op_num_threads_per_shard = 1 

osds are running bluestore, 2 replicas (it's just for testing) 

when I run FIO using librbd directly, I'm getting ~160k reads/s 
and ~60k writes/s which is not that bad. 

however when I run fio on block device under VM (qemu using librbd), 
I'm getting only 60/40K op/s which is a huge loss.. 

when I use VM with block access to krbd mapped device, numbers 
are much better, I'm getting something like 115/40K op/s which 
is not ideal, but still much better.. tried many optimisations 
and configuration variants (multiple queues, threads vs native aio 
etc), but krbd still performs much much better.. 

My question is whether this is expected, or should both access methods 
give more similar results? If possible, I'd like to stick to librbd 
(especially because krbd still lacks layering support, but there are 
more reasons) 

interesting is, that when I compare fio direct ceph access, librbd performs 
better then KRBD, but this doesn't concern me that much.. 

another question, during the tests, I noticed that enabling exclusive lock 
feature degrades write iops a lot as well, is this expected? (the performance 
falls to someting like 50%) 

I'm doing the tests on small 2 node cluster, VMS are running directly on ceph 
nodes, 
all is centos 7 with 4.14 kernel. (I know it's not recommended to run VMs 
directly 
on ceph nodes, but for small deployments it's necessary for us) 

if I could provide more details, I'll be happy to do so 

BR 

nik 


-- 
- 
Ing. Nikola CIPRICH 
LinuxBox.cz, s.r.o. 
28.rijna 168, 709 00 Ostrava 

tel.: +420 591 166 214 
fax: +420 596 621 273 
mobil: +420 777 093 799 
www.linuxbox.cz 

mobil servis: +420 737 238 656 
email servis: ser...@linuxbox.cz 
- 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: v13.2.0 Mimic is out

2018-06-01 Thread Alexandre DERUMIER

CephFS snapshot is now stable and enabled by default on new filesystems 


:) 






Alexandre Derumier 
Ingénieur système et stockage 

Manager Infrastructure 


Fixe : +33 3 59 82 20 10 



125 Avenue de la république 
59110 La Madeleine 
[ https://twitter.com/OdisoHosting ] [ https://twitter.com/mindbaz ] [ 
https://www.linkedin.com/company/odiso ] [ 
https://www.viadeo.com/fr/company/odiso ] [ 
https://www.facebook.com/monsiteestlent ] 

[ https://www.monsiteestlent.com/ | MonSiteEstLent.com ] - Blog dédié à la 
webperformance et la gestion de pics de trafic 






De: "ceph"  
À: "ceph-users"  
Envoyé: Vendredi 1 Juin 2018 14:48:13 
Objet: [ceph-users] Fwd: v13.2.0 Mimic is out 

FYI 


De: "Abhishek"  À: "ceph-devel" 
, "ceph-users" , 
ceph-maintain...@ceph.com, ceph-annou...@ceph.com Envoyé: Vendredi 1 
Juin 2018 14:11:00 Objet: v13.2.0 Mimic is out 
We're glad to announce the first stable release of Mimic, the next long 
term release series. There have been major changes since Luminous and 
please read the upgrade notes carefully. 
We'd also like to highlight that we've had contributions from over 282 
contributors, for Mimic, and would like to thank everyone for the 
continued support. The next major release of Ceph will be called Nautilus. 
For the detailed changelog, please refer to the release blog at 
https://ceph.com/releases/v13-2-0-mimic-released/ 
Major Changes from Luminous --- 
- *Dashboard*: 
* The (read-only) Ceph manager dashboard introduced in Ceph Luminous has 
been replaced with a new implementation inspired by and derived from the 
openATTIC[1] Ceph management tool, providing a drop-in replacement 
offering a number of additional management features 
- *RADOS*: 
* Config options can now be centrally stored and managed by the monitor. 
* The monitor daemon uses significantly less disk space when undergoing 
recovery or rebalancing operations. * An *async recovery* feature 
reduces the tail latency of requests when the OSDs are recovering from a 
recent failure. * OSD preemption of scrub by conflicting requests 
reduces tail latency. 
- *RGW*: 
* RGW can now replicate a zone (or a subset of buckets) to an external 
cloud storage service like S3. * RGW now supports the S3 multi-factor 
authentication api on versioned buckets. * The Beast frontend is no long 
expermiental and is considered stable and ready for use. 
- *CephFS*: 
* Snapshots are now stable when combined with multiple MDS daemons. 
- *RBD*: 
* Image clones no longer require explicit *protect* and *unprotect* 
steps. * Images can be deep-copied (including any clone linkage to a 
parent image and associated snapshots) to new pools or with altered data 
layouts. 
Upgrading from Luminous --- 
Notes ~ 
* We recommend you avoid creating any RADOS pools while the upgrade is 
in process. 
* You can monitor the progress of your upgrade at each stage with the 
`ceph versions` command, which will tell you what ceph version(s) are 
running for each type of daemon. 
Instructions  
#. Make sure your cluster is stable and healthy (no down or recoverying 
OSDs). (Optional, but recommended.) 
#. Set the `noout` flag for the duration of the upgrade. (Optional, but 
recommended.):: 
# ceph osd set noout 
#. Upgrade monitors by installing the new packages and restarting the 
monitor daemons.:: 
# systemctl restart ceph-mon.target 
Verify the monitor upgrade is complete once all monitors are up by 
looking for the `mimic` feature string in the mon map. For example:: 
# ceph mon feature ls 
should include `mimic` under persistent features:: 
on current monmap (epoch NNN) persistent: [kraken,luminous,mimic] 
required: [kraken,luminous,mimic] 
#. Upgrade `ceph-mgr` daemons by installing the new packages and 
restarting with:: 
# systemctl restart ceph-mgr.target 
Verify the ceph-mgr daemons are running by checking `ceph -s`:: 
# ceph -s 
... services: mon: 3 daemons, quorum foo,bar,baz mgr: foo(active), 
standbys: bar, baz ... 
#. Upgrade all OSDs by installing the new packages and restarting the 
ceph-osd daemons on all hosts:: 
# systemctl restart ceph-osd.target 
You can monitor the progress of the OSD upgrades with the new `ceph 
versions` or `ceph osd versions` command:: 
# ceph osd versions { "ceph version 12.2.5 (...) luminous (stable)": 12, 
"ceph version 13.2.0 (...) mimic (stable)": 22, } 
#. Upgrade all CephFS MDS daemons. For each CephFS file system, 
#. Reduce the number of ranks to 1. (Make note of the original number of 
MDS daemons first if you plan to restore it later.):: 
# ceph status # ceph fs set  max_mds 1 
#. Wait for the cluster to deactivate any non-zero ranks by periodically 
checking the status:: 
# ceph status 
#. Take all standby MDS daemons offline on the appropriate hosts with:: 
# systemctl stop ceph-mds@ 
#. Confirm that only one MDS is online and is rank 0 for your FS::

Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-29 Thread Alexandre DERUMIER

>>Could you try path https://github.com/ceph/ceph/pull/22240/files.
>>
>>The leakage of  MMDSBeacon messages can explain your issue.

Thanks. I can't test it in production for now, and I can't reproduce it in my 
test environment.

I'll wait for next luminous release to test it.

Thanks you very much again !

I'll keep you in touch in this thread.

Regards,

Alexandre

- Mail original -
De: "Zheng Yan" 
À: "aderumier" 
Cc: "ceph-users" 
Envoyé: Mardi 29 Mai 2018 04:40:27
Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

Could you try path https://github.com/ceph/ceph/pull/22240/files. 

The leakage of MMDSBeacon messages can explain your issue. 

Regards 
Yan, Zheng 





On Mon, May 28, 2018 at 12:06 PM, Alexandre DERUMIER 
 wrote: 
>>>could you send me full output of dump_mempools 
> 
> # ceph daemon mds.ceph4-2.odiso.net dump_mempools 
> { 
> "bloom_filter": { 
> "items": 41262668, 
> "bytes": 41262668 
> }, 
> "bluestore_alloc": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "bluestore_cache_data": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "bluestore_cache_onode": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "bluestore_cache_other": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "bluestore_fsck": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "bluestore_txc": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "bluestore_writing_deferred": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "bluestore_writing": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "bluefs": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "buffer_anon": { 
> "items": 712726, 
> "bytes": 106964870 
> }, 
> "buffer_meta": { 
> "items": 15, 
> "bytes": 1320 
> }, 
> "osd": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "osd_mapbl": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "osd_pglog": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "osdmap": { 
> "items": 216, 
> "bytes": 12168 
> }, 
> "osdmap_mapping": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "pgmap": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "mds_co": { 
> "items": 50741038, 
> "bytes": 5114319203 
> }, 
> "unittest_1": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "unittest_2": { 
> "items": 0, 
> "bytes": 0 
> }, 
> "total": { 
> "items": 92716663, 
> "bytes": 5262560229 
> } 
> } 
> 
> 
> 
> 
> 
> ceph daemon mds.ceph4-2.odiso.net perf dump 
> { 
> "AsyncMessenger::Worker-0": { 
> "msgr_recv_messages": 1276789161, 
> "msgr_send_messages": 1317625246, 
> "msgr_recv_bytes": 10630409633633, 
> "msgr_send_bytes": 1093972769957, 
> "msgr_created_connections": 207, 
> "msgr_active_connections": 204, 
> "msgr_running_total_time": 63745.463077594, 
> "msgr_running_send_time": 22210.867549070, 
> "msgr_running_recv_time": 51944.624353942, 
> "msgr_running_fast_dispatch_time": 9185.274084187 
> }, 
> "AsyncMessenger::Worker-1": { 
> "msgr_recv_messages": 641622644, 
> "msgr_send_messages": 616664293, 
> "msgr_recv_bytes": 7287546832466, 
> "msgr_send_bytes": 588278035895, 
> "msgr_created_connections": 494, 
> "msgr_active_connections": 494, 
> "msgr_running_total_time": 35390.081250881, 
> "msgr_running_send_time": 11559.689889195, 
> "msgr_running_recv_time": 29844.885712902, 
> "msgr_running_fast_dispatch_time": 6361.466445253 
> }, 
> "AsyncMessenger::Worker-2": { 
> "msgr_recv_messages": 1972469623, 
> "msgr_send_messages": 1886060294, 
> "msgr_recv_bytes": 7924136565846, 
> "msgr_send_bytes": 5072502101797, 
> "msgr_created_connections": 181, 
> "msgr_active_connections": 176, 
> "msgr_running_total_time": 93257.811989806, 
> "msgr_running_send_time": 35556.662488302, 
> "msgr_running_recv_time": 81686.262228047, 
&

Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-27 Thread Alexandre DERUMIER

23544370,
"op_rmw": 0,
"op_pg": 0,
"osdop_stat": 2843429,
"osdop_create": 5729675,
"osdop_read": 126350,
"osdop_write": 89171030,
"osdop_writefull": 365835,
"osdop_writesame": 0,
"osdop_append": 0,
"osdop_zero": 2,
"osdop_truncate": 15,
"osdop_delete": 4128067,
"osdop_mapext": 0,
"osdop_sparse_read": 0,
"osdop_clonerange": 0,
"osdop_getxattr": 46958217,
"osdop_setxattr": 11459350,
"osdop_cmpxattr": 0,
"osdop_rmxattr": 0,
"osdop_resetxattrs": 0,
"osdop_tmap_up": 0,
"osdop_tmap_put": 0,
"osdop_tmap_get": 0,
"osdop_call": 0,
"osdop_watch": 0,
"osdop_notify": 0,
"osdop_src_cmpxattr": 0,
"osdop_pgls": 0,
"osdop_pgls_filter": 0,
"osdop_other": 20547060,
"linger_active": 0,
"linger_send": 0,
"linger_resend": 0,
"linger_ping": 0,
"poolop_active": 0,
"poolop_send": 0,
"poolop_resend": 0,
"poolstat_active": 0,
"poolstat_send": 0,
"poolstat_resend": 0,
"statfs_active": 0,
"statfs_send": 0,
"statfs_resend": 0,
"command_active": 0,
"command_send": 0,
"command_resend": 0,
"map_epoch": 4048,
"map_full": 0,
"map_inc": 742,
"osd_sessions": 18,
"osd_session_open": 26,
"osd_session_close": 8,
"osd_laggy": 0,
"omap_wr": 6209755,
"omap_rd": 346748196,
"omap_del": 605991
},
"purge_queue": {
"pq_executing_ops": 0,
"pq_executing": 0,
"pq_executed": 3118819
},
"throttle-msgr_dispatch_throttler-mds": {
"val": 0,
"max": 104857600,
"get_started": 0,
"get": 3890881428,
"get_sum": 25554167806273,
"get_or_fail_fail": 0,
"get_or_fail_success": 3890881428,
"take": 0,
"take_sum": 0,
"put": 3890881428,
"put_sum": 25554167806273,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
},
"throttle-objecter_bytes": {
"val": 0,
"max": 104857600,
"get_started": 0,
    "get": 0,
"get_sum": 0,
"get_or_fail_fail": 0,
"get_or_fail_success": 0,
"take": 297200336,
"take_sum": 944272996789,
"put": 272525107,
"put_sum": 944272996789,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
},
"throttle-objecter_ops": {
"val": 0,
    "max": 1024,
"get_started": 0,
"get": 0,
"get_sum": 0,
"get_or_fail_fail": 0,
"get_or_fail_success": 0,
"take": 297200336,
"take_sum": 297200336,
"put": 297200336,
"put_sum": 297200336,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
},
"throttle-write_buf_throttle": {
"val": 0,
"max": 3758096384,
"get_started": 0,
"get": 3118819,
"get_sum": 290050463,
"get_or_fail_fail": 0,
"get_or_fail_success": 3118819,
"take": 0,
"take_sum": 0,
"put": 126240,
"put_sum": 290050463,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
},
"throttle-write_buf_throttle-0x55decea8e140": {
"val": 117619,
"max": 3758096384,
"get_started": 0,
&

Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-24 Thread Alexandre DERUMIER

Here the result:


root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net flush journal
{
"message": "",
"return_code": 0
}
root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net config set mds_cache_size 
1
{
"success": "mds_cache_size = '1' (not observed, change may require 
restart) "
}

wait ...


root@ceph4-2:~# ceph tell mds.ceph4-2.odiso.net heap stats
2018-05-25 07:44:02.185911 7f4cad7fa700  0 client.50748489 ms_handle_reset on 
10.5.0.88:6804/994206868
2018-05-25 07:44:02.196160 7f4cae7fc700  0 client.50792764 ms_handle_reset on 
10.5.0.88:6804/994206868
mds.ceph4-2.odiso.net tcmalloc heap 
stats:
MALLOC:13175782328 (12565.4 MiB) Bytes in use by application
MALLOC: +0 (0.0 MiB) Bytes in page heap freelist
MALLOC: +   1774628488 ( 1692.4 MiB) Bytes in central cache freelist
MALLOC: + 34274608 (   32.7 MiB) Bytes in transfer cache freelist
MALLOC: + 57260176 (   54.6 MiB) Bytes in thread cache freelists
MALLOC: +120582336 (  115.0 MiB) Bytes in malloc metadata
MALLOC:   
MALLOC: =  15162527936 (14460.1 MiB) Actual memory used (physical + swap)
MALLOC: +   4974067712 ( 4743.6 MiB) Bytes released to OS (aka unmapped)
MALLOC:   
MALLOC: =  20136595648 (19203.8 MiB) Virtual address space used
MALLOC:
MALLOC:1852388  Spans in use
MALLOC: 18  Thread heaps in use
MALLOC:   8192  Tcmalloc page size

Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.


root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net config set mds_cache_size 0
{
"success": "mds_cache_size = '0' (not observed, change may require restart) 
"
}

- Mail original -
De: "Zheng Yan" <uker...@gmail.com>
À: "aderumier" <aderum...@odiso.com>
Envoyé: Vendredi 25 Mai 2018 05:56:31
Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

On Thu, May 24, 2018 at 11:34 PM, Alexandre DERUMIER 
<aderum...@odiso.com> wrote: 
>>>Still don't find any clue. Does the cephfs have idle period. If it 
>>>has, could you decrease mds's cache size and check what happens. For 
>>>example, run following commands during the old period. 
> 
>>>ceph daemon mds.xx flush journal 
>>>ceph daemon mds.xx config set mds_cache_size 1; 
>>>"wait a minute" 
>>>ceph tell mds.xx heap stats 
>>>ceph daemon mds.xx config set mds_cache_size 0 
> 
> ok thanks. I'll try this night. 
> 
> I have already mds_cache_memory_limit = 5368709120, 
> 
> does it need to remove it first before setting mds_cache_size 1 ? 

no 
> 
> 
> 
> 
> - Mail original - 
> De: "Zheng Yan" <uker...@gmail.com> 
> À: "aderumier" <aderum...@odiso.com> 
> Cc: "ceph-users" <ceph-users@lists.ceph.com> 
> Envoyé: Jeudi 24 Mai 2018 16:27:21 
> Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ? 
> 
> On Thu, May 24, 2018 at 7:22 PM, Alexandre DERUMIER <aderum...@odiso.com> 
> wrote: 
>> Thanks! 
>> 
>> 
>> here the profile.pdf 
>> 
>> 10-15min profiling, I can't do it longer because my clients where lagging. 
>> 
>> but I think it should be enough to observe the rss memory increase. 
>> 
>> 
> 
> Still don't find any clue. Does the cephfs have idle period. If it 
> has, could you decrease mds's cache size and check what happens. For 
> example, run following commands during the old period. 
> 
> ceph daemon mds.xx flush journal 
> ceph daemon mds.xx config set mds_cache_size 1; 
> "wait a minute" 
> ceph tell mds.xx heap stats 
> ceph daemon mds.xx config set mds_cache_size 0 
> 
> 
>> 
>> 
>> - Mail original - 
>> De: "Zheng Yan" <uker...@gmail.com> 
>> À: "aderumier" <aderum...@odiso.com> 
>> Cc: "ceph-users" <ceph-users@lists.ceph.com> 
>> Envoyé: Jeudi 24 Mai 2018 11:34:20 
>> Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ? 
>> 
>> On Tue, May 22, 2018 at 3:11 PM, Alexandre DERUMIER <aderum...@odiso.com> 
>> wrote: 
>>> Hi,some new stats, mds memory is not 16G, 
>>> 
>>> I have almost same number of items and bytes in cache vs some weeks ago 
>>> when mds was using 8G. (ceph 12.2.5) 
>>> 
>>> 
>>> root@ceph4-2:~# while sleep 1; do ceph daemon mds.ceph4-2.odiso.net perf 
>>> dump | jq '.mds_mem.rss'; ceph

Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-24 Thread Alexandre DERUMIER

>>Still don't find any clue. Does the cephfs have idle period. If it 
>>has, could you decrease mds's cache size and check what happens. For 
>>example, run following commands during the old period. 

>>ceph daemon mds.xx flush journal 
>>ceph daemon mds.xx config set mds_cache_size 1; 
>>"wait a minute" 
>>ceph tell mds.xx heap stats 
>>ceph daemon mds.xx config set mds_cache_size 0 

ok thanks. I'll try this night.

I have already mds_cache_memory_limit = 5368709120,

does it need to remove it first before setting  mds_cache_size 1 ?




- Mail original -
De: "Zheng Yan" <uker...@gmail.com>
À: "aderumier" <aderum...@odiso.com>
Cc: "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Jeudi 24 Mai 2018 16:27:21
Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

On Thu, May 24, 2018 at 7:22 PM, Alexandre DERUMIER <aderum...@odiso.com> 
wrote: 
> Thanks! 
> 
> 
> here the profile.pdf 
> 
> 10-15min profiling, I can't do it longer because my clients where lagging. 
> 
> but I think it should be enough to observe the rss memory increase. 
> 
> 

Still don't find any clue. Does the cephfs have idle period. If it 
has, could you decrease mds's cache size and check what happens. For 
example, run following commands during the old period. 

ceph daemon mds.xx flush journal 
ceph daemon mds.xx config set mds_cache_size 1; 
"wait a minute" 
ceph tell mds.xx heap stats 
ceph daemon mds.xx config set mds_cache_size 0 


> 
> 
> - Mail original - 
> De: "Zheng Yan" <uker...@gmail.com> 
> À: "aderumier" <aderum...@odiso.com> 
> Cc: "ceph-users" <ceph-users@lists.ceph.com> 
> Envoyé: Jeudi 24 Mai 2018 11:34:20 
> Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ? 
> 
> On Tue, May 22, 2018 at 3:11 PM, Alexandre DERUMIER <aderum...@odiso.com> 
> wrote: 
>> Hi,some new stats, mds memory is not 16G, 
>> 
>> I have almost same number of items and bytes in cache vs some weeks ago when 
>> mds was using 8G. (ceph 12.2.5) 
>> 
>> 
>> root@ceph4-2:~# while sleep 1; do ceph daemon mds.ceph4-2.odiso.net perf 
>> dump | jq '.mds_mem.rss'; ceph daemon mds.ceph4-2.odiso.net dump_mempools | 
>> jq -c '.mds_co'; done 
>> 16905052 
>> {"items":43350988,"bytes":5257428143} 
>> 16905052 
>> {"items":43428329,"bytes":5283850173} 
>> 16905052 
>> {"items":43209167,"bytes":5208578149} 
>> 16905052 
>> {"items":43177631,"bytes":5198833577} 
>> 16905052 
>> {"items":43312734,"bytes":5252649462} 
>> 16905052 
>> {"items":43355753,"bytes":5277197972} 
>> 16905052 
>> {"items":43700693,"bytes":5303376141} 
>> 16905052 
>> {"items":43115809,"bytes":5156628138} 
>> ^C 
>> 
>> 
>> 
>> 
>> root@ceph4-2:~# ceph status 
>> cluster: 
>> id: e22b8e83-3036-4fe5-8fd5-5ce9d539beca 
>> health: HEALTH_OK 
>> 
>> services: 
>> mon: 3 daemons, quorum ceph4-1,ceph4-2,ceph4-3 
>> mgr: ceph4-1.odiso.net(active), standbys: ceph4-2.odiso.net, 
>> ceph4-3.odiso.net 
>> mds: cephfs4-1/1/1 up {0=ceph4-2.odiso.net=up:active}, 2 up:standby 
>> osd: 18 osds: 18 up, 18 in 
>> rgw: 3 daemons active 
>> 
>> data: 
>> pools: 11 pools, 1992 pgs 
>> objects: 75677k objects, 6045 GB 
>> usage: 20579 GB used, 6246 GB / 26825 GB avail 
>> pgs: 1992 active+clean 
>> 
>> io: 
>> client: 14441 kB/s rd, 2550 kB/s wr, 371 op/s rd, 95 op/s wr 
>> 
>> 
>> root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net cache status 
>> { 
>> "pool": { 
>> "items": 44523608, 
>> "bytes": 5326049009 
>> } 
>> } 
>> 
>> 
>> root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net perf dump 
>> { 
>> "AsyncMessenger::Worker-0": { 
>> "msgr_recv_messages": 798876013, 
>> "msgr_send_messages": 825999506, 
>> "msgr_recv_bytes": 7003223097381, 
>> "msgr_send_bytes": 691501283744, 
>> "msgr_created_connections": 148, 
>> "msgr_active_connections": 146, 
>> "msgr_running_total_time": 39914.832387470, 
>> "msgr_running_send_time": 13744.704199430, 
>> "msgr_running_recv_time": 32342.160588451, 
>> "msgr_running_fast_dispatch_time": 5996.336446782 
>> }, 
>> "AsyncMe

Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-24 Thread Alexandre DERUMIER

Thanks!


here the profile.pdf 

10-15min profiling, I can't do it longer because my clients where lagging.

but I think it should be enough to observe the rss memory increase.


 

- Mail original -
De: "Zheng Yan" <uker...@gmail.com>
À: "aderumier" <aderum...@odiso.com>
Cc: "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Jeudi 24 Mai 2018 11:34:20
Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

On Tue, May 22, 2018 at 3:11 PM, Alexandre DERUMIER <aderum...@odiso.com> 
wrote: 
> Hi,some new stats, mds memory is not 16G, 
> 
> I have almost same number of items and bytes in cache vs some weeks ago when 
> mds was using 8G. (ceph 12.2.5) 
> 
> 
> root@ceph4-2:~# while sleep 1; do ceph daemon mds.ceph4-2.odiso.net perf dump 
> | jq '.mds_mem.rss'; ceph daemon mds.ceph4-2.odiso.net dump_mempools | jq -c 
> '.mds_co'; done 
> 16905052 
> {"items":43350988,"bytes":5257428143} 
> 16905052 
> {"items":43428329,"bytes":5283850173} 
> 16905052 
> {"items":43209167,"bytes":5208578149} 
> 16905052 
> {"items":43177631,"bytes":5198833577} 
> 16905052 
> {"items":43312734,"bytes":5252649462} 
> 16905052 
> {"items":43355753,"bytes":5277197972} 
> 16905052 
> {"items":43700693,"bytes":5303376141} 
> 16905052 
> {"items":43115809,"bytes":5156628138} 
> ^C 
> 
> 
> 
> 
> root@ceph4-2:~# ceph status 
> cluster: 
> id: e22b8e83-3036-4fe5-8fd5-5ce9d539beca 
> health: HEALTH_OK 
> 
> services: 
> mon: 3 daemons, quorum ceph4-1,ceph4-2,ceph4-3 
> mgr: ceph4-1.odiso.net(active), standbys: ceph4-2.odiso.net, 
> ceph4-3.odiso.net 
> mds: cephfs4-1/1/1 up {0=ceph4-2.odiso.net=up:active}, 2 up:standby 
> osd: 18 osds: 18 up, 18 in 
> rgw: 3 daemons active 
> 
> data: 
> pools: 11 pools, 1992 pgs 
> objects: 75677k objects, 6045 GB 
> usage: 20579 GB used, 6246 GB / 26825 GB avail 
> pgs: 1992 active+clean 
> 
> io: 
> client: 14441 kB/s rd, 2550 kB/s wr, 371 op/s rd, 95 op/s wr 
> 
> 
> root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net cache status 
> { 
> "pool": { 
> "items": 44523608, 
> "bytes": 5326049009 
> } 
> } 
> 
> 
> root@ceph4-2:~# ceph daemon mds.ceph4-2.odiso.net perf dump 
> { 
> "AsyncMessenger::Worker-0": { 
> "msgr_recv_messages": 798876013, 
> "msgr_send_messages": 825999506, 
> "msgr_recv_bytes": 7003223097381, 
> "msgr_send_bytes": 691501283744, 
> "msgr_created_connections": 148, 
> "msgr_active_connections": 146, 
> "msgr_running_total_time": 39914.832387470, 
> "msgr_running_send_time": 13744.704199430, 
> "msgr_running_recv_time": 32342.160588451, 
> "msgr_running_fast_dispatch_time": 5996.336446782 
> }, 
> "AsyncMessenger::Worker-1": { 
> "msgr_recv_messages": 429668771, 
> "msgr_send_messages": 414760220, 
> "msgr_recv_bytes": 5003149410825, 
> "msgr_send_bytes": 396281427789, 
> "msgr_created_connections": 132, 
> "msgr_active_connections": 132, 
> "msgr_running_total_time": 23644.410515392, 
> "msgr_running_send_time": 7669.068710688, 
> "msgr_running_recv_time": 19751.610043696, 
> "msgr_running_fast_dispatch_time": 4331.023453385 
> }, 
> "AsyncMessenger::Worker-2": { 
> "msgr_recv_messages": 1312910919, 
> "msgr_send_messages": 1260040403, 
> "msgr_recv_bytes": 5330386980976, 
> "msgr_send_bytes": 3341965016878, 
> "msgr_created_connections": 143, 
> "msgr_active_connections": 138, 
> "msgr_running_total_time": 61696.635450100, 
> "msgr_running_send_time": 23491.027014598, 
> "msgr_running_recv_time": 53858.409319734, 
> "msgr_running_fast_dispatch_time": 4312.451966809 
> }, 
> "finisher-PurgeQueue": { 
> "queue_len": 0, 
> "complete_latency": { 
> "avgcount": 1889416, 
> "sum": 29224.227703697, 
> "avgtime": 0.015467333 
> } 
> }, 
> "mds": { 
> "request": 1822420924, 
> "reply": 1822420886, 
> "reply_latency": { 
> "avgcount": 1822420886, 
> "sum": 5258467.616943274, 
> "avgtime": 0.002885429 
> }, 
> "forward": 0, 
> "dir_fetch": 116035485, 
> "dir_commit": 1865012, 
> "dir_split":

Re: [ceph-users] Ceph replication factor of 2

2018-05-24 Thread Alexandre DERUMIER

Hi,

>>My thoughts on the subject are that even though checksums do allow to find 
>>which replica is corrupt without having to figure which 2 out of 3 copies are 
>>the same, this is not the only reason min_size=2 was required.

AFAIK, 

compare copies (like 2 out of 3 copies are the same) has never been implemented.
pg_repair for example, still copy the primary pg to replicas. (even if it's 
corrupt).


and old topic about this:
http://ceph-users.ceph.narkive.com/zS2yZ2FL/how-safe-is-ceph-pg-repair-these-days

- Mail original -
De: "Janne Johansson" 
À: c...@jack.fr.eu.org
Cc: "ceph-users" 
Envoyé: Jeudi 24 Mai 2018 08:33:32
Objet: Re: [ceph-users] Ceph replication factor of 2

Den tors 24 maj 2018 kl 00:20 skrev Jack < [ mailto:c...@jack.fr.eu.org | 
c...@jack.fr.eu.org ] >: 


Hi, 

I have to say, this is a common yet worthless argument 
If I have 3000 OSD, using 2 or 3 replica will not change much : the 
probability of losing 2 devices is still "high" 
On the other hand, if I have a small cluster, less than a hundred OSD, 
that same probability become "low" 



It's about losing the 2 or 3 OSDs that any particular PG is on that matters, 
not if there are 1000 other OSDs in the next rack. 
Losing data is rather binary, its not a from 0.0 -> 1.0 scale. Either a piece 
of data is lost because its storage units are not there 
or its not. Murphys law will make it so that this lost piece of data is rather 
important to you. And Murphy will of course pick the 
2-3 OSDs that are the worst case for you. 

BQ_BEGIN

I do not buy the "if someone is making a maintenance and a device fails" 
either : this is a no-limit goal: what is X servers burns at the same 
time ? What if an admin make a mistake and drop 5 OSD ? What is some 
network tor or routers blow away ? 
Should we do one replica par OSD ? 


BQ_END

From my viewpoint, maintenance must happen. Unplanned maintenance will happen 
even if I wish it not to. 
So the 2-vs-3 is about what situation you end up in when one replica is under 
(planned or not) maintenance. 
Is this a "any surprise makes me lose data now" mode, or is it "many surprises 
need to occur"? 

BQ_BEGIN

I would like people, especially the Ceph's devs and other people who 
knows how it works deeply (read the code!) to give us their advices 

BQ_END

How about listening to people who have lost data during 20+ year long careers 
in storage? 
They will know a lot more on how the very improbable or "impossible" still 
happened to them 
at the most unfortunate moment, regardless of what the code readers say. 

This is all about weighing risks. If the risk for you is "ok, then I have to 
redownload that lost ubuntu-ISO again" its fine 
to stick to data in only one place. 

If the company goes out of business or at least faces 2 days total stop while 
some sleep-deprived admin tries the 
bare metal restores for the first time of her life then the price of SATA disks 
to cover 3 replicas will be literally 
nothing compared to that. 

To me it sounds like you are chasing some kind of validation of an answer you 
already have while asking the questions, 
so if you want to go 2-replicas, then just do it. But you don't get to complain 
to ceph or ceph-users when you also figure 
that the Mean-Time-Between-Failure ratings on the stickers of the disks is 
bogus and what you really needed was 
"mean time between surprises", and thats always less than MTBF. 

-- 
May the most significant bit of your life be positive. 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-22 Thread Alexandre DERUMIER

;osdop_setxattr": 8628696,
"osdop_cmpxattr": 0,
"osdop_rmxattr": 0,
"osdop_resetxattrs": 0,
"osdop_tmap_up": 0,
"osdop_tmap_put": 0,
"osdop_tmap_get": 0,
"osdop_call": 0,
"osdop_watch": 0,
"osdop_notify": 0,
"osdop_src_cmpxattr": 0,
"osdop_pgls": 0,
"osdop_pgls_filter": 0,
"osdop_other": 13551599,
"linger_active": 0,
"linger_send": 0,
"linger_resend": 0,
"linger_ping": 0,
"poolop_active": 0,
"poolop_send": 0,
"poolop_resend": 0,
"poolstat_active": 0,
"poolstat_send": 0,
"poolstat_resend": 0,
"statfs_active": 0,
"statfs_send": 0,
"statfs_resend": 0,
"command_active": 0,
"command_send": 0,
"command_resend": 0,
"map_epoch": 3907,
"map_full": 0,
"map_inc": 601,
"osd_sessions": 18,
"osd_session_open": 20,
"osd_session_close": 2,
"osd_laggy": 0,
"omap_wr": 3595801,
"omap_rd": 232070972,
"omap_del": 272598
},
"purge_queue": {
"pq_executing_ops": 0,
"pq_executing": 0,
"pq_executed": 1659514
},
"throttle-msgr_dispatch_throttler-mds": {
"val": 0,
"max": 104857600,
"get_started": 0,
"get": 2541455703,
"get_sum": 17148691767160,
"get_or_fail_fail": 0,
"get_or_fail_success": 2541455703,
"take": 0,
"take_sum": 0,
"put": 2541455703,
"put_sum": 17148691767160,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
},
"throttle-objecter_bytes": {
"val": 0,
"max": 104857600,
"get_started": 0,
"get": 0,
"get_sum": 0,
"get_or_fail_fail": 0,
"get_or_fail_success": 0,
"take": 197932421,
"take_sum": 606323353310,
"put": 182060027,
"put_sum": 606323353310,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
},
"throttle-objecter_ops": {
"val": 0,
"max": 1024,
"get_started": 0,
"get": 0,
"get_sum": 0,
"get_or_fail_fail": 0,
"get_or_fail_success": 0,
"take": 197932421,
"take_sum": 197932421,
"put": 197932421,
"put_sum": 197932421,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
},
"throttle-write_buf_throttle": {
"val": 0,
"max": 3758096384,
"get_started": 0,
"get": 1659514,
"get_sum": 154334946,
"get_or_fail_fail": 0,
"get_or_fail_success": 1659514,
"take": 0,
"take_sum": 0,
"put": 79728,
"put_sum": 154334946,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
},
"throttle-write_buf_throttle-0x55decea8e140": {
"val": 255839,
"max": 3758096384,
"get_started": 0,
"get": 357717092,
"get_sum": 596677113363,
"get_or_fail_fail": 0,
"get_or_fail_success": 357717092,
"take": 0,
"take_sum": 0,
"put": 59071693,
"put_sum": 596676857524,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
}
}



- Mail original -
De: "Webert de Souza Lima" <webert.b...@gmail.com>
À: "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Lundi 14 Mai 2018 15:14:35
Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

On Sat, May 12, 2018 at 3:11 AM Alexandre DERUMIER < [ 
mailto:aderum...@odiso.com | aderum...@odiso.com ] > wrote: 


The documentation (luminous) say: 




BQ_BEGIN
>mds cache size 
> 
>Description: The number of inodes to cache. A value of 0 indicates an 
>unlimited number. It is recommended to use mds_cache_memory_limit to limit the 
>amount of memory the MDS cache uses. 
>Type: 32-bit Integer 
>Default: 0 
> 
BQ_END

BQ_BEGIN
and, my mds_cache_memory_limit is currently at 5GB. 
BQ_END

yeah I have only suggested that because the high memory usage seemed to trouble 
you and it might be a bug, so it's more of a workaround. 

Regards, 
Webert Lima 
DevOps Engineer at MAV Tecnologia 
Belo Horizonte - Brasil 
IRC NICK - WebertRLZ 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] a big cluster or several small

2018-05-16 Thread Alexandre DERUMIER

Hi,

>>Our main reason for using multiple clusters is that Ceph has a bad 
>>reliability history when scaling up and even now there are many issues 
>>unresolved (https://tracker.ceph.com/issues/21761 for example) so by 
>>dividing single, large cluster into few smaller ones, we reduce the impact 
>>for customers when things go fatally wrong - when one cluster goes down or 
>>it's performance is on single ESDI drive level due to recovery, other 
>>clusters - and their users - are unaffected. For us this already proved 
>>useful in the past.

we are also doing multiple small clusters here. (3 nodes, 18 osd (ssd or nvme))
mainly vms and rbd, so it's not a problem.

Mainly to avoid lags for all clients when a osd goes down for example, or make 
upgrade more easy.

We only have a bigger cluster for radosgw and object storage.

Alexandre

- Mail original -
De: "Piotr Dałek" 
À: "ceph-users" 
Envoyé: Mardi 15 Mai 2018 09:14:53
Objet: Re: [ceph-users] a big cluster or several small

On 18-05-14 06:49 PM, Marc Boisis wrote: 
> 
> Hi, 
> 
> Hello, 
> Currently we have a 294 OSD (21 hosts/3 racks) cluster with RBD clients 
> only, 1 single pool (size=3). 
> 
> We want to divide this cluster into several to minimize the risk in case of 
> failure/crash. 
> For example, a cluster for the mail, another for the file servers, a test 
> cluster ... 
> Do you think it's a good idea ? 

If reliability and data availability is your main concern, and you don't 
share data between clusters - yes. 

> Do you have experience feedback on multiple clusters in production on the 
> same hardware: 
> - containers (LXD or Docker) 
> - multiple cluster on the same host without virtualization (with ceph-deploy 
> ... --cluster ...) 
> - multilple pools 
> ... 
> 
> Do you have any advice? 

We're using containers to host OSDs, but we don't host multiple clusters on 
same machine (in other words, single physical machine hosts containers for 
one and the same cluster). We're using Ceph for RBD images, so having 
multiple clusters isn't a problem for us. 

Our main reason for using multiple clusters is that Ceph has a bad 
reliability history when scaling up and even now there are many issues 
unresolved (https://tracker.ceph.com/issues/21761 for example) so by 
dividing single, large cluster into few smaller ones, we reduce the impact 
for customers when things go fatally wrong - when one cluster goes down or 
it's performance is on single ESDI drive level due to recovery, other 
clusters - and their users - are unaffected. For us this already proved 
useful in the past. 

-- 
Piotr Dałek 
piotr.da...@corp.ovh.com 
https://www.ovhcloud.com 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Intel Xeon Scalable and CPU frequency scaling on NVMe/SSD Ceph OSDs

2018-05-16 Thread Alexandre DERUMIER

Hi,

I'm able to have fixed frequency with

intel_pstate=disable intel_idle.max_cstate=0 processor.max_cstate=1 

Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz

# cat /proc/cpuinfo |grep MHz
cpu MHz : 3400.002
cpu MHz : 3399.994
cpu MHz : 3399.995
cpu MHz : 3399.994
cpu MHz : 3399.997
cpu MHz : 3399.998
cpu MHz : 3399.992
cpu MHz : 3399.989
cpu MHz : 3399.998
cpu MHz : 3399.994
cpu MHz : 3399.988
cpu MHz : 3399.987
cpu MHz : 3399.990
cpu MHz : 3399.990
cpu MHz : 3399.994
cpu MHz : 3399.996
cpu MHz : 3399.996
cpu MHz : 3399.985
cpu MHz : 3399.991
cpu MHz : 3399.981
cpu MHz : 3399.979
cpu MHz : 3399.993
cpu MHz : 3399.985
cpu MHz : 3399.985


- Mail original -
De: "Wido den Hollander" 
À: "Blair Bethwaite" 
Cc: "ceph-users" , "Nick Fisk" 
Envoyé: Mercredi 16 Mai 2018 15:34:35
Objet: Re: [ceph-users] Intel Xeon Scalable and CPU frequency scaling on 
NVMe/SSD Ceph OSDs

On 05/16/2018 01:22 PM, Blair Bethwaite wrote: 
> On 15 May 2018 at 08:45, Wido den Hollander  > wrote: 
> 
> > We've got some Skylake Ubuntu based hypervisors that we can look at to 
> > compare tomorrow... 
> > 
> 
> Awesome! 
> 
> 
> Ok, so results still inconclusive I'm afraid... 
> 
> The Ubuntu machines we're looking at (Dell R740s and C6420s running with 
> Performance BIOS power profile, which amongst other things disables 
> cstates and enables turbo) are currently running either a 4.13 or a 4.15 
> HWE kernel - we needed 4.13 to support PERC10 and even get them booting 
> from local storage, then 4.15 to get around a prlimit bug that was 
> breaking Nova snapshots, so here we are. Where are you getting 4.16, 
> http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.16/ ? 
> 

Yes, that's where I got 4.16: 4.16.7-041607-generic 

I also tried 4.16.8, but that didn't change anything either. 

Server I am testing with are these: 
https://www.supermicro.nl/products/system/1U/1029/SYS-1029U-TN10RT.cfm 

> So interestingly in our case we seem to have no cpufreq driver loaded. 
> After installing linux-generic-tools (cause cpupower is supposed to 
> supersede cpufrequtils I think?): 
> 
> rr42-03:~$ uname -a 
> Linux rcgpudc1rr42-03 4.15.0-13-generic #14~16.04.1-Ubuntu SMP Sat Mar 
> 17 03:04:59 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux 
> 
> rr42-03:~$ cat /proc/cmdline 
> BOOT_IMAGE=/vmlinuz-4.15.0-13-generic root=/dev/mapper/vg00-root ro 
> intel_iommu=on iommu=pt intel_idle.max_cstate=0 processor.max_cstate=1 
> 

I have those settings as well, intel_idle and processor.max_cstate. 

[ 1.776036] intel_idle: disabled 

That works, the CPUs stay in C0 or C1 according to i7z, but they are 
clocking down in Mhz, for example: 

processor : 23 
vendor_id : GenuineIntel 
cpu family : 6 
model : 85 
model name : Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz 
stepping : 4 
microcode : 0x243 
cpu MHz : 799.953 

/sys/devices/system/cpu/intel_pstate/min_perf_pct is set to 100, but 
that setting doesn't seem to do anything. 

I'm running out of ideas :-) 

Wido 

> rr42-03:~$ lscpu 
> Architecture: x86_64 
> CPU op-mode(s): 32-bit, 64-bit 
> Byte Order: Little Endian 
> CPU(s): 36 
> On-line CPU(s) list: 0-35 
> Thread(s) per core: 1 
> Core(s) per socket: 18 
> Socket(s): 2 
> NUMA node(s): 2 
> Vendor ID: GenuineIntel 
> CPU family: 6 
> Model: 85 
> Model name: Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz 
> Stepping: 4 
> CPU MHz: 3400.956 
> BogoMIPS: 5401.45 
> Virtualization: VT-x 
> L1d cache: 32K 
> L1i cache: 32K 
> L2 cache: 1024K 
> L3 cache: 25344K 
> NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34 
> NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35 
> Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr 
> pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe 
> syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts 
> rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq 
> dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid 
> dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx 
> f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 
> invpcid_single pti intel_ppin mba tpr_shadow vnmi flexpriority ept vpid 
> fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx 
> rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd 
> avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc 
> cqm_mbm_total cqm_mbm_local ibpb ibrs stibp dtherm ida arat pln pts pku 
> ospke 
> 
> rr42-03:~$ sudo cpupower frequency-info 
> analyzing CPU 0: 
> no or unknown cpufreq driver is active on this CPU 
> CPUs which run at the same hardware frequency: Not Available 
>

Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-12 Thread Alexandre DERUMIER

my cache is correctly capped at 5G currently


here some stats:  (mds has been restarted yesterday, using around 8,8gb, and 
cache capped at 5G).

I'll try to sent some stats in 1 or 2 week, when the memory should be at 20g


# while sleep 1; do ceph daemon mds.ceph4-2.odiso.net perf dump | jq 
'.mds_mem.rss'; ceph daemon mds.ceph4-2.odiso.net dump_mempools | jq -c 
'.mds_co'; done
8821728
{"items":44512173,"bytes":5346723108}
8821728
{"items":44647862,"bytes":5356139145}
8821728
{"items":43644205,"bytes":5129276043}
8821728
{"items":44134481,"bytes":5260485627}
8821728
{"items":44418491,"bytes":5338308734}
8821728
{"items":45091444,"bytes":5404019118}
8821728
{"items":44714180,"bytes":5322182878}
8821728
{"items":43853828,"bytes":5221597919}
8821728
{"items":44518074,"bytes":5323670444}
8821728
{"items":44679829,"bytes":5367219523}
8821728
{"items":44809929,"bytes":5382383166}
8821728
{"items":43441538,"bytes":5180408997}
8821728
{"items":44239001,"bytes":5349655543}
8821728
{"items":44558135,"bytes":5414566237}
8821728
{"items":44664773,"bytes":5433279976}
8821728
{"items":43433859,"bytes":5148008705}
8821728
{"items":43683053,"bytes":5236668693}
8821728
{"items":44248833,"bytes":5310420155}
8821728
{"items":45013698,"bytes":5381693077}
8821728
{"items":44928825,"bytes":5313048602}
8821728
{"items":43828630,"bytes":5146482155}
8821728
{"items":44005515,"bytes":5167930294}
8821728
{"items":44412223,"bytes":5182643376}
8821728
{"items":44842966,"bytes":5198073066}

- Mail original -
De: "aderumier" <aderum...@odiso.com>
À: "Webert de Souza Lima" <webert.b...@gmail.com>
Cc: "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Samedi 12 Mai 2018 08:11:04
Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

Hi 
>>You could use "mds_cache_size" to limit number of CAPS untill you have this 
>>fixed, but I'd say for your number of caps and inodes, 20GB is normal. 

The documentation (luminous) say: 

" 
mds cache size 

Description: The number of inodes to cache. A value of 0 indicates an unlimited 
number. It is recommended to use mds_cache_memory_limit to limit the amount of 
memory the MDS cache uses. 
Type: 32-bit Integer 
Default: 0 
" 

and, my mds_cache_memory_limit is currently at 5GB. 





- Mail original - 
De: "Webert de Souza Lima" <webert.b...@gmail.com> 
À: "ceph-users" <ceph-users@lists.ceph.com> 
Envoyé: Vendredi 11 Mai 2018 20:18:27 
Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ? 

You could use "mds_cache_size" to limit number of CAPS untill you have this 
fixed, but I'd say for your number of caps and inodes, 20GB is normal. 
this mds (jewel) here is consuming 24GB RAM: 

{ 
"mds": { 
"request": 7194867047, 
"reply": 7194866688, 
"reply_latency": { 
"avgcount": 7194866688, 
"sum": 27779142.611775008 
}, 
"forward": 0, 
"dir_fetch": 179223482, 
"dir_commit": 1529387896, 
"dir_split": 0, 
"inode_max": 300, 
"inodes": 3001264, 
"inodes_top": 160517, 
"inodes_bottom": 226577, 
"inodes_pin_tail": 2614170, 
"inodes_pinned": 2770689, 
"inodes_expired": 2920014835, 
"inodes_with_caps": 2743194, 
"caps": 2803568, 
"subtrees": 2, 
"traverse": 8255083028, 
"traverse_hit": 7452972311, 
"traverse_forward": 0, 
"traverse_discover": 0, 
"traverse_dir_fetch": 180547123, 
"traverse_remote_ino": 122257, 
"traverse_lock": 5957156, 
"load_cent": 18446743934203149911, 
"q": 54, 
"exported": 0, 
"exported_inodes": 0, 
"imported": 0, 
"imported_inodes": 0 
} 
} 


Regards, 
Webert Lima 
DevOps Engineer at MAV Tecnologia 
Belo Horizonte - Brasil 
IRC NICK - WebertRLZ 


On Fri, May 11, 2018 at 3:13 PM Alexandre DERUMIER < [ 
mailto:aderum...@odiso.com | aderum...@odiso.com ] > wrote: 


Hi, 

I'm still seeing memory leak with 12.2.5. 

seem to leak some MB each 5 minutes. 

I'll try to resent some stats next weekend. 


- Mail original - 
De: "Patrick Donnelly" < [ mailto:pdonn...@redhat.com | pdonn...@redhat.com ] > 
À: "Brady Deetz" < [ mailto:bde...@gmail.com | bde...@gmail.com ] > 
Cc: &q

Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-12 Thread Alexandre DERUMIER

Hi
>>You could use "mds_cache_size" to limit number of CAPS untill you have this 
>>fixed, but I'd say for your number of caps and inodes, 20GB is normal. 

The documentation (luminous) say:

"
mds cache size

Description:The number of inodes to cache. A value of 0 indicates an 
unlimited number. It is recommended to use mds_cache_memory_limit to limit the 
amount of memory the MDS cache uses.
Type:   32-bit Integer
Default:0
"

and, my mds_cache_memory_limit is currently at 5GB.





- Mail original -
De: "Webert de Souza Lima" <webert.b...@gmail.com>
À: "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Vendredi 11 Mai 2018 20:18:27
Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

You could use "mds_cache_size" to limit number of CAPS untill you have this 
fixed, but I'd say for your number of caps and inodes, 20GB is normal. 
this mds (jewel) here is consuming 24GB RAM: 

{ 
"mds": { 
"request": 7194867047, 
"reply": 7194866688, 
"reply_latency": { 
"avgcount": 7194866688, 
"sum": 27779142.611775008 
}, 
"forward": 0, 
"dir_fetch": 179223482, 
"dir_commit": 1529387896, 
"dir_split": 0, 
"inode_max": 300, 
"inodes": 3001264, 
"inodes_top": 160517, 
"inodes_bottom": 226577, 
"inodes_pin_tail": 2614170, 
"inodes_pinned": 2770689, 
"inodes_expired": 2920014835, 
"inodes_with_caps": 2743194, 
"caps": 2803568, 
"subtrees": 2, 
"traverse": 8255083028, 
"traverse_hit": 7452972311, 
"traverse_forward": 0, 
"traverse_discover": 0, 
"traverse_dir_fetch": 180547123, 
"traverse_remote_ino": 122257, 
"traverse_lock": 5957156, 
"load_cent": 18446743934203149911, 
"q": 54, 
"exported": 0, 
"exported_inodes": 0, 
"imported": 0, 
"imported_inodes": 0 
} 
} 


Regards, 
Webert Lima 
DevOps Engineer at MAV Tecnologia 
Belo Horizonte - Brasil 
IRC NICK - WebertRLZ 


On Fri, May 11, 2018 at 3:13 PM Alexandre DERUMIER < [ 
mailto:aderum...@odiso.com | aderum...@odiso.com ] > wrote: 


Hi, 

I'm still seeing memory leak with 12.2.5. 

seem to leak some MB each 5 minutes. 

I'll try to resent some stats next weekend. 


- Mail original - 
De: "Patrick Donnelly" < [ mailto:pdonn...@redhat.com | pdonn...@redhat.com ] > 
À: "Brady Deetz" < [ mailto:bde...@gmail.com | bde...@gmail.com ] > 
Cc: "Alexandre Derumier" < [ mailto:aderum...@odiso.com | aderum...@odiso.com ] 
>, "ceph-users" < [ mailto:ceph-users@lists.ceph.com | 
ceph-users@lists.ceph.com ] > 
Envoyé: Jeudi 10 Mai 2018 21:11:19 
Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ? 

On Thu, May 10, 2018 at 12:00 PM, Brady Deetz < [ mailto:bde...@gmail.com | 
bde...@gmail.com ] > wrote: 
> [ceph-admin@mds0 ~]$ ps aux | grep ceph-mds 
> ceph 1841 3.5 94.3 133703308 124425384 ? Ssl Apr04 1808:32 
> /usr/bin/ceph-mds -f --cluster ceph --id mds0 --setuser ceph --setgroup ceph 
> 
> 
> [ceph-admin@mds0 ~]$ sudo ceph daemon mds.mds0 cache status 
> { 
> "pool": { 
> "items": 173261056, 
> "bytes": 76504108600 
> } 
> } 
> 
> So, 80GB is my configured limit for the cache and it appears the mds is 
> following that limit. But, the mds process is using over 100GB RAM in my 
> 128GB host. I thought I was playing it safe by configuring at 80. What other 
> things consume a lot of RAM for this process? 
> 
> Let me know if I need to create a new thread. 

The cache size measurement is imprecise pre-12.2.5 [1]. You should upgrade 
ASAP. 

[1] [ https://tracker.ceph.com/issues/22972 | 
https://tracker.ceph.com/issues/22972 ] 

-- 
Patrick Donnelly 

___ 
ceph-users mailing list 
[ mailto:ceph-users@lists.ceph.com | ceph-users@lists.ceph.com ] 
[ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ] 




___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-11 Thread Alexandre DERUMIER

Hi,

I'm still seeing memory leak with 12.2.5.

seem to leak some MB each 5 minutes.

I'll try to resent some stats next weekend.


- Mail original -
De: "Patrick Donnelly" <pdonn...@redhat.com>
À: "Brady Deetz" <bde...@gmail.com>
Cc: "Alexandre Derumier" <aderum...@odiso.com>, "ceph-users" 
<ceph-users@lists.ceph.com>
Envoyé: Jeudi 10 Mai 2018 21:11:19
Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

On Thu, May 10, 2018 at 12:00 PM, Brady Deetz <bde...@gmail.com> wrote: 
> [ceph-admin@mds0 ~]$ ps aux | grep ceph-mds 
> ceph 1841 3.5 94.3 133703308 124425384 ? Ssl Apr04 1808:32 
> /usr/bin/ceph-mds -f --cluster ceph --id mds0 --setuser ceph --setgroup ceph 
> 
> 
> [ceph-admin@mds0 ~]$ sudo ceph daemon mds.mds0 cache status 
> { 
> "pool": { 
> "items": 173261056, 
> "bytes": 76504108600 
> } 
> } 
> 
> So, 80GB is my configured limit for the cache and it appears the mds is 
> following that limit. But, the mds process is using over 100GB RAM in my 
> 128GB host. I thought I was playing it safe by configuring at 80. What other 
> things consume a lot of RAM for this process? 
> 
> Let me know if I need to create a new thread. 

The cache size measurement is imprecise pre-12.2.5 [1]. You should upgrade 
ASAP. 

[1] https://tracker.ceph.com/issues/22972 

-- 
Patrick Donnelly 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-04-18 Thread Alexandre DERUMIER

oolop_active": 0,
"poolop_send": 0,
"poolop_resend": 0,
"poolstat_active": 0,
"poolstat_send": 0,
"poolstat_resend": 0,
"statfs_active": 0,
"statfs_send": 0,
"statfs_resend": 0,
"command_active": 0,
"command_send": 0,
"command_resend": 0,
"map_epoch": 3121,
"map_full": 0,
"map_inc": 76,
"osd_sessions": 18,
"osd_session_open": 20,
"osd_session_close": 2,
"osd_laggy": 0,
"omap_wr": 2227270,
"omap_rd": 65197068,
"omap_del": 48058
},
"purge_queue": {
"pq_executing_ops": 0,
"pq_executing": 0,
"pq_executed": 619458
},
"throttle-msgr_dispatch_throttler-mds": {
"val": 0,
"max": 104857600,
"get_started": 0,
"get": 831356927,
"get_sum": 4299208168815,
"get_or_fail_fail": 0,
"get_or_fail_success": 831356927,
"take": 0,
"take_sum": 0,
"put": 831356927,
"put_sum": 4299208168815,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
},
"throttle-objecter_bytes": {
"val": 0,
"max": 104857600,
"get_started": 0,
"get": 0,
"get_sum": 0,
"get_or_fail_fail": 0,
"get_or_fail_success": 0,
"take": 60152757,
"take_sum": 189890861007,
"put": 54571445,
"put_sum": 189890861007,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
},
"throttle-objecter_ops": {
"val": 0,
"max": 1024,
"get_started": 0,
"get": 0,
"get_sum": 0,
"get_or_fail_fail": 0,
"get_or_fail_success": 0,
"take": 60152757,
"take_sum": 60152757,
"put": 60152757,
"put_sum": 60152757,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
},
"throttle-write_buf_throttle": {
"val": 0,
"max": 3758096384,
"get_started": 0,
"get": 619458,
"get_sum": 57609986,
"get_or_fail_fail": 0,
"get_or_fail_success": 619458,
"take": 0,
"take_sum": 0,
"put": 27833,
"put_sum": 57609986,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
},
"throttle-write_buf_throttle-0x559471d00140": {
"val": 105525,
"max": 3758096384,
"get_started": 0,
"get": 108025412,
"get_sum": 185715179864,
"get_or_fail_fail": 0,
"get_or_fail_success": 108025412,
"take": 0,
"take_sum": 0,
"put": 19597987,
"put_sum": 185715074339,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
}
}

- Mail original -
De: "Zheng Yan" <uker...@gmail.com>
À: "aderumier" <aderum...@odiso.com>
Cc: "Patrick Donnelly" <pdonn...@redhat.com>, "ceph-users" 
<ceph-users@lists.ceph.com>
Envoyé: Mardi 17 Avril 2018 05:20:18
Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

On Sat, Apr 14, 2018 at 9:23 PM, Alexandre DERUMIER <aderum...@odiso.com> 
wrote: 
> Hi, 
> 
> Still leaking again after update to 12.2.4, around 17G after 9 days 
> 
> 
> 
> 
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 
> 
> ceph 629903 50.7 25.9 17473680 17082432 ? Ssl avril05 6498:21 
> /usr/bin/ceph-mds -f --cluster ceph --id ceph4-1.odiso.net --setuser ceph 
> --setgroup ceph 
> 
> 
> 
> 
> 
> ~# ceph daemon mds.ceph4-1.odiso.net cache status 
> { 
> "pool

Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-04-14 Thread Alexandre DERUMIER

"map_inc": 160,
"osd_sessions": 18,
"osd_session_open": 20,
"osd_session_close": 2,
"osd_laggy": 0,
"omap_wr": 9743114,
"omap_rd": 191911089,
"omap_del": 684272
},
"purge_queue": {
"pq_executing_ops": 0,
"pq_executing": 0,
"pq_executed": 2316671
},
"throttle-msgr_dispatch_throttler-mds": {
"val": 0,
"max": 104857600,
"get_started": 0,
"get": 1884071270,
"get_sum": 12697353890803,
"get_or_fail_fail": 0,
"get_or_fail_success": 1884071270,
"take": 0,
"take_sum": 0,
"put": 1884071270,
"put_sum": 12697353890803,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
},
"throttle-objecter_bytes": {
"val": 0,
"max": 104857600,
"get_started": 0,
"get": 0,
    "get_sum": 0,
"get_or_fail_fail": 0,
"get_or_fail_success": 0,
"take": 197270390,
"take_sum": 796529593788,
"put": 183928495,
"put_sum": 796529593788,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
},
"throttle-objecter_ops": {
"val": 0,
"max": 1024,
"get_started": 0,
"get": 0,
"get_sum": 0,
"get_or_fail_fail": 0,
"get_or_fail_success": 0,
"take": 197270390,
"take_sum": 197270390,
"put": 197270390,
"put_sum": 197270390,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
},
"throttle-write_buf_throttle": {
"val": 0,
"max": 3758096384,
"get_started": 0,
"get": 2316671,
"get_sum": 215451035,
"get_or_fail_fail": 0,
"get_or_fail_success": 2316671,
"take": 0,
"take_sum": 0,
"put": 31223,
"put_sum": 215451035,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
},
"throttle-write_buf_throttle-0x563c33bea220": {
"val": 29763,
"max": 3758096384,
"get_started": 0,
"get": 293928039,
"get_sum": 765120443785,
"get_or_fail_fail": 0,
"get_or_fail_success": 293928039,
"take": 0,
"take_sum": 0,
"put": 62629276,
"put_sum": 765120414022,
"wait": {
"avgcount": 0,
"sum": 0.0,
"avgtime": 0.0
}
}
}



# ceph status
  cluster:
id: e22b8e83-3036-4fe5-8fd5-5ce9d539beca
health: HEALTH_OK
 
  services:
mon: 3 daemons, quorum ceph4-1,ceph4-2,ceph4-3
mgr: ceph4-2.odiso.net(active), standbys: ceph4-3.odiso.net, 
ceph4-1.odiso.net
mds: cephfs4-1/1/1 up  {0=ceph4-1.odiso.net=up:active}, 2 up:standby
osd: 18 osds: 18 up, 18 in
 
  data:
pools:   11 pools, 1992 pgs
objects: 72258k objects, 5918 GB
usage:   20088 GB used, 6737 GB / 26825 GB avail
pgs: 1992 active+clean
 
  io:
client:   3099 kB/s rd, 6412 kB/s wr, 108 op/s rd, 481 op/s wr


- Mail original -
De: "Patrick Donnelly" <pdonn...@redhat.com>
À: "aderumier" <aderum...@odiso.com>
Cc: "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Mardi 27 Mars 2018 20:35:08
Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

Hello Alexandre, 

On Thu, Mar 22, 2018 at 2:29 AM, Alexandre DERUMIER <aderum...@odiso.com> 
wrote: 
> Hi, 
> 
> I'm running cephfs since 2 months now, 
> 
> and my active msd memory usage is around 20G now (still growing). 
> 
> ceph 1521539 10.8 31.2 20929836 20534868 ? Ssl janv.26 8573:34 
> /usr/bin/ceph-mds -f --cluster ceph --id 2 --setuser ceph --setgroup ceph 
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 
> 
> 
> this is on luminous 12.2.2 
> 
> only tuning done is: 
> 
> mds_cache_memory_limit = 5368709120 
> 
> 
> (5GB). I known it's a soft limit, but 20G seem quite huge vs 5GB  
> 
> 
> Is it normal ? 

No, that's definitely not normal! 


> # ceph daemon mds.2 perf dump mds 
> { 
> "mds": { 
> "request": 1444009197, 
> "reply": 1443999870, 
> "reply_latency": { 
> "avgcount": 1443999870, 
> "sum": 1657849.656122933, 
> "avgtime": 0.001148095 
> }, 
> "forward": 0, 
> "dir_fetch": 51740910, 
> "dir_commit": 9069568, 
> "dir_split": 64367, 
> "dir_merge": 58016, 
> "inode_max": 2147483647, 
> "inodes": 2042975, 
> "inodes_top": 152783, 
> "inodes_bottom": 138781, 
> "inodes_pin_tail": 1751411, 
> "inodes_pinned": 1824714, 
> "inodes_expired": 7258145573, 
> "inodes_with_caps": 1812018, 
> "caps": 2538233, 
> "subtrees": 2, 
> "traverse": 1591668547, 
> "traverse_hit": 1259482170, 
> "traverse_forward": 0, 
> "traverse_discover": 0, 
> "traverse_dir_fetch": 30827836, 
> "traverse_remote_ino": 7510, 
> "traverse_lock": 86236, 
> "load_cent": 144401980319, 
> "q": 49, 
> "exported": 0, 
> "exported_inodes": 0, 
> "imported": 0, 
> "imported_inodes": 0 
> } 
> } 

Can you also share `ceph daemon mds.2 cache status`, the full `ceph 
daemon mds.2 perf dump`, and `ceph status`? 

Note [1] will be in 12.2.5 and may help with your issue. 

[1] https://github.com/ceph/ceph/pull/20527 

-- 
Patrick Donnelly 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-03-28 Thread Alexandre DERUMIER

>>Can you also share `ceph daemon mds.2 cache status`, the full `ceph 
>>daemon mds.2 perf dump`, and `ceph status`? 

Sorry, too late, I needed to restart the mds daemon because I was out of memory 
:(

Seem stable for now. (around 500mb)

Not sure It was related, but I had a ganesha-nfs ->cephfs daemon running on 
this cluster. (but no client connected to it)


>>Note [1] will be in 12.2.5 and may help with your issue. 
>>[1] https://github.com/ceph/ceph/pull/20527 

ok thanks !



- Mail original -
De: "Patrick Donnelly" <pdonn...@redhat.com>
À: "Alexandre Derumier" <aderum...@odiso.com>
Cc: "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Mardi 27 Mars 2018 20:35:08
Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

Hello Alexandre, 

On Thu, Mar 22, 2018 at 2:29 AM, Alexandre DERUMIER <aderum...@odiso.com> 
wrote: 
> Hi, 
> 
> I'm running cephfs since 2 months now, 
> 
> and my active msd memory usage is around 20G now (still growing). 
> 
> ceph 1521539 10.8 31.2 20929836 20534868 ? Ssl janv.26 8573:34 
> /usr/bin/ceph-mds -f --cluster ceph --id 2 --setuser ceph --setgroup ceph 
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 
> 
> 
> this is on luminous 12.2.2 
> 
> only tuning done is: 
> 
> mds_cache_memory_limit = 5368709120 
> 
> 
> (5GB). I known it's a soft limit, but 20G seem quite huge vs 5GB  
> 
> 
> Is it normal ? 

No, that's definitely not normal! 


> # ceph daemon mds.2 perf dump mds 
> { 
> "mds": { 
> "request": 1444009197, 
> "reply": 1443999870, 
> "reply_latency": { 
> "avgcount": 1443999870, 
> "sum": 1657849.656122933, 
> "avgtime": 0.001148095 
> }, 
> "forward": 0, 
> "dir_fetch": 51740910, 
> "dir_commit": 9069568, 
> "dir_split": 64367, 
> "dir_merge": 58016, 
> "inode_max": 2147483647, 
> "inodes": 2042975, 
> "inodes_top": 152783, 
> "inodes_bottom": 138781, 
> "inodes_pin_tail": 1751411, 
> "inodes_pinned": 1824714, 
> "inodes_expired": 7258145573, 
> "inodes_with_caps": 1812018, 
> "caps": 2538233, 
> "subtrees": 2, 
> "traverse": 1591668547, 
> "traverse_hit": 1259482170, 
> "traverse_forward": 0, 
> "traverse_discover": 0, 
> "traverse_dir_fetch": 30827836, 
> "traverse_remote_ino": 7510, 
> "traverse_lock": 86236, 
> "load_cent": 144401980319, 
> "q": 49, 
> "exported": 0, 
> "exported_inodes": 0, 
> "imported": 0, 
> "imported_inodes": 0 
> } 
> } 

Can you also share `ceph daemon mds.2 cache status`, the full `ceph 
daemon mds.2 perf dump`, and `ceph status`? 

Note [1] will be in 12.2.5 and may help with your issue. 

[1] https://github.com/ceph/ceph/pull/20527 

-- 
Patrick Donnelly 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Luminous and jemalloc

2018-03-23 Thread Alexandre DERUMIER

Hi,

I think it's no more a problem since async messenger is default.
Difference is minimal now between jemalloc and tcmalloc.

Regards,

Alexandre

- Mail original -
De: "Xavier Trilla" 
À: "ceph-users" 
Cc: "Arnau Marcé" 
Envoyé: Vendredi 23 Mars 2018 13:34:03
Objet: [ceph-users] Luminous and jemalloc



Hi, 



Does anybody have information about using jemalloc with Luminous? For what I’ve 
seen on the mailing list and online, bluestor crashes when using jemalloc. 



We’ve been running ceph with jemalloc since Hammer, as performance with 
tcmalloc was terrible (We run a quite big full SSD cluster) and jemalloc was a 
game changer (CPU usage and latency were extremely reduced when using 
jemalloc). 



But looks like Ceph with a recent TCmalloc library and a high thread cache work 
pretty well, do you have experience with that? Is jemalloc still justified or 
it does not make sense anymore? 



Thanks for your comments! 

Xavier. 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-03-23 Thread Alexandre DERUMIER

Hi,

>>Did the fs have lots of mount/umount? 

not too much, I have around 300 ceph-fuse clients (12.2.2 && 12.2.4) and ceph 
cluster is 12.2.2.
maybe when client reboot, but that don't happen too much.


>> We recently found a memory leak
>>bug in that area https://github.com/ceph/ceph/pull/20148

Ok thanks. Does session occur only at mount/unmount ?



I have another cluster, with 64 fuse-client, mds memory is around 500mb.
(with default mds_cache_memory_limit , no tuning, and ceph cluster is 12.2.4 
instead 12.2.2)

Clients are also ceph-fuse 12.2.2 && 12.2.4



I'll try to upgrade this buggy mds to 12.2.4 to see if it's helping.

- Mail original -
De: "Zheng Yan" <uker...@gmail.com>
À: "aderumier" <aderum...@odiso.com>
Cc: "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Vendredi 23 Mars 2018 01:08:46
Objet: Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

Did the fs have lots of mount/umount? We recently found a memory leak 
bug in that area https://github.com/ceph/ceph/pull/20148 

Regards 
Yan, Zheng 

On Thu, Mar 22, 2018 at 5:29 PM, Alexandre DERUMIER <aderum...@odiso.com> 
wrote: 
> Hi, 
> 
> I'm running cephfs since 2 months now, 
> 
> and my active msd memory usage is around 20G now (still growing). 
> 
> ceph 1521539 10.8 31.2 20929836 20534868 ? Ssl janv.26 8573:34 
> /usr/bin/ceph-mds -f --cluster ceph --id 2 --setuser ceph --setgroup ceph 
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 
> 
> 
> this is on luminous 12.2.2 
> 
> only tuning done is: 
> 
> mds_cache_memory_limit = 5368709120 
> 
> 
> (5GB). I known it's a soft limit, but 20G seem quite huge vs 5GB  
> 
> 
> Is it normal ? 
> 
> 
> 
> 
> # ceph daemon mds.2 perf dump mds 
> { 
> "mds": { 
> "request": 1444009197, 
> "reply": 1443999870, 
> "reply_latency": { 
> "avgcount": 1443999870, 
> "sum": 1657849.656122933, 
> "avgtime": 0.001148095 
> }, 
> "forward": 0, 
> "dir_fetch": 51740910, 
> "dir_commit": 9069568, 
> "dir_split": 64367, 
> "dir_merge": 58016, 
> "inode_max": 2147483647, 
> "inodes": 2042975, 
> "inodes_top": 152783, 
> "inodes_bottom": 138781, 
> "inodes_pin_tail": 1751411, 
> "inodes_pinned": 1824714, 
> "inodes_expired": 7258145573, 
> "inodes_with_caps": 1812018, 
> "caps": 2538233, 
> "subtrees": 2, 
> "traverse": 1591668547, 
> "traverse_hit": 1259482170, 
> "traverse_forward": 0, 
> "traverse_discover": 0, 
> "traverse_dir_fetch": 30827836, 
> "traverse_remote_ino": 7510, 
> "traverse_lock": 86236, 
> "load_cent": 144401980319, 
> "q": 49, 
> "exported": 0, 
> "exported_inodes": 0, 
> "imported": 0, 
> "imported_inodes": 0 
> } 
> } 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-03-22 Thread Alexandre DERUMIER

Hi,

I'm running cephfs since 2 months now,

and my active msd memory usage is around 20G now (still growing).

ceph 1521539 10.8 31.2 20929836 20534868 ?   Ssl  janv.26 8573:34 
/usr/bin/ceph-mds -f --cluster ceph --id 2 --setuser ceph --setgroup ceph
USER PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND


this is on luminous 12.2.2

only tuning done is:

mds_cache_memory_limit = 5368709120


(5GB). I known it's a soft limit, but 20G seem quite huge vs 5GB 


Is it normal ?




# ceph daemon mds.2 perf dump mds
{
"mds": {
"request": 1444009197,
"reply": 1443999870,
"reply_latency": {
"avgcount": 1443999870,
"sum": 1657849.656122933,
"avgtime": 0.001148095
},
"forward": 0,
"dir_fetch": 51740910,
"dir_commit": 9069568,
"dir_split": 64367,
"dir_merge": 58016,
"inode_max": 2147483647,
"inodes": 2042975,
"inodes_top": 152783,
"inodes_bottom": 138781,
"inodes_pin_tail": 1751411,
"inodes_pinned": 1824714,
"inodes_expired": 7258145573,
"inodes_with_caps": 1812018,
"caps": 2538233,
"subtrees": 2,
"traverse": 1591668547,
"traverse_hit": 1259482170,
"traverse_forward": 0,
"traverse_discover": 0,
"traverse_dir_fetch": 30827836,
"traverse_remote_ino": 7510,
"traverse_lock": 86236,
"load_cent": 144401980319,
"q": 49,
"exported": 0,
"exported_inodes": 0,
"imported": 0,
"imported_inodes": 0
}
}
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Memory leak in Ceph OSD?

2018-03-06 Thread Alexandre DERUMIER

Hi,

I'm also seeing slow memory increase over time with my bluestore nvme osds 
(3,2tb each) , with default ceph.conf settings. (ceph 12.2.2)

each osd start around 5G memory, and go up to 8GB.

Currently I'm restarting them around each month to free memory.


here a dump of osd.0 after 1week running

ceph 2894538  3.9  9.9 7358564 6553080 ? Ssl  mars01 303:03 
/usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph


root@ceph4-1:~#  ceph daemon osd.0 dump_mempools 
{
"bloom_filter": {
"items": 0,
"bytes": 0
},
"bluestore_alloc": {
"items": 84070208,
"bytes": 84070208
},
"bluestore_cache_data": {
"items": 168,
"bytes": 2908160
},
"bluestore_cache_onode": {
"items": 947820,
"bytes": 636935040
},
"bluestore_cache_other": {
"items": 101250372,
"bytes": 2043476720
},
"bluestore_fsck": {
"items": 0,
"bytes": 0
},
"bluestore_txc": {
"items": 8,
"bytes": 5760
},
"bluestore_writing_deferred": {
"items": 85,
"bytes": 1203200
},
"bluestore_writing": {
"items": 7,
"bytes": 569584
},
"bluefs": {
"items": 1774,
"bytes": 106360
},
"buffer_anon": {
"items": 68307,
"bytes": 17188636
},
"buffer_meta": {
"items": 284,
"bytes": 24992
},
"osd": {
"items": 333,
"bytes": 4017312
},
"osd_mapbl": {
"items": 0,
"bytes": 0
},
"osd_pglog": {
"items": 1195884,
"bytes": 298139520
},
"osdmap": {
"items": 4542,
"bytes": 384464
},
"osdmap_mapping": {
"items": 0,
"bytes": 0
},
"pgmap": {
"items": 0,
"bytes": 0
},
"mds_co": {
"items": 0,
"bytes": 0
},
"unittest_1": {
"items": 0,
"bytes": 0
},
"unittest_2": {
"items": 0,
"bytes": 0
},
"total": {
"items": 187539792,
"bytes": 3089029956
}
}



another osd after 1 month:


USER PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
ceph 1718009  2.5 11.7 8542012 7725992 ? Ssl   2017 2463:28 
/usr/bin/ceph-osd -f --cluster ceph --id 5 --setuser ceph --setgroup ceph

root@ceph4-1:~# ceph daemon osd.5 dump_mempools 
{
"bloom_filter": {
"items": 0,
"bytes": 0
},
"bluestore_alloc": {
"items": 98449088,
"bytes": 98449088
},
"bluestore_cache_data": {
"items": 759,
"bytes": 17276928
},
"bluestore_cache_onode": {
"items": 884140,
"bytes": 594142080
},
"bluestore_cache_other": {
"items": 116375567,
"bytes": 2072801299
},
"bluestore_fsck": {
"items": 0,
"bytes": 0
},
"bluestore_txc": {
"items": 6,
"bytes": 4320
},
"bluestore_writing_deferred": {
"items": 99,
"bytes": 1190045
},
"bluestore_writing": {
"items": 11,
"bytes": 4510159
},
"bluefs": {
"items": 1202,
"bytes": 64136
},
"buffer_anon": {
"items": 76863,
"bytes": 21327234
},
"buffer_meta": {
"items": 910,
"bytes": 80080
},
"osd": {
"items": 328,
"bytes": 3956992
},
"osd_mapbl": {
"items": 0,
"bytes": 0
},
"osd_pglog": {
"items": 1118050,
"bytes": 286277600
},
"osdmap": {
"items": 6073,
"bytes": 551872
},
"osdmap_mapping": {
"items": 0,
"bytes": 0
},
"pgmap": {
"items": 0,
"bytes": 0
},
"mds_co": {
"items": 0,
"bytes": 0
},
"unittest_1": {
"items": 0,
"bytes": 0
},
"unittest_2": {
"items": 0,
"bytes": 0
},
"total": {
"items": 216913096,
"bytes": 3100631833
}
}

- Mail original -
De: "Kjetil Joergensen" 
À: "ceph-users" 
Envoyé: Mercredi 7 Mars 2018 01:07:06
Objet: Re: [ceph-users] Memory leak in Ceph OSD?

Hi, 
addendum: We're running 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b). 

The workload is a mix of 3xreplicated & ec-coded (rbd, cephfs, rgw). 

-KJ 

On Tue, Mar 6, 2018 at 3:53 PM, Kjetil Joergensen < [ 
mailto:kje...@medallia.com | kje...@medallia.com ] > wrote: 



Hi, 
so.. +1 

We don't run compression as far as I know, so that wouldn't be it. We do 
actually run a mix of bluestore & filestore - due to the rest of the cluster 
predating a stable bluestore by some amount. 

The interesting part is - the behavior seems to be specific to our bluestore 
nodes. 

Below - yellow line, node with 10 x ~4TB SSDs, green line 8 x 800GB SSDs. Blue 
line - dump_mempools total bytes for all the OSDs running

Re: [ceph-users] Migrating to new pools

2018-02-21 Thread Alexandre DERUMIER

Hi,

if you use qemu, it's also possible to use drive-mirror feature from qemu.
(can mirror and migrate from 1 storage to another storage without downtime).

I don't known if openstack has implemented it, but It's working fine on proxmox.


- Mail original -
De: "Anthony D'Atri" 
À: "ceph-users" 
Envoyé: Jeudi 22 Février 2018 01:27:23
Objet: Re: [ceph-users] Migrating to new pools

>> I was thinking we might be able to configure/hack rbd mirroring to mirror to 
>> a pool on the same cluster but I gather from the OP and your post that this 
>> is not really possible? 
> 
> No, it's not really possible currently and we have no plans to add 
> such support since it would not be of any long-term value. 

The long-term value would be the ability to migrate volumes from, say, a 
replicated pool to an an EC pool without extended downtime. 





___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Does anyone else still experiancing memory issues with 12.2.2 and Bluestore?

2018-02-10 Thread Alexandre DERUMIER

Hi,

I still have my osd memory growing slowly.

Default config, with ssd osd,

start around 5Gb, and after 1-2 months, near 8gb.

(Maybe related to fragmentation ?)



USER PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
ceph 1718009  2.5 11.7 8542012 7725992 ? Ssl   2017 2463:28 
/usr/bin/ceph-osd -f --cluster ceph --id 5 --setuser ceph --setgroup ceph




root@ceph4-1:~# ceph daemon osd.5 dump_mempools 
{
"bloom_filter": {
"items": 0,
"bytes": 0
},
"bluestore_alloc": {
"items": 98449088,
"bytes": 98449088
},
"bluestore_cache_data": {
"items": 759,
"bytes": 17276928
},
"bluestore_cache_onode": {
"items": 884140,
"bytes": 594142080
},
"bluestore_cache_other": {
"items": 116375567,
"bytes": 2072801299
},
"bluestore_fsck": {
"items": 0,
"bytes": 0
},
"bluestore_txc": {
"items": 6,
"bytes": 4320
},
"bluestore_writing_deferred": {
"items": 99,
"bytes": 1190045
},
"bluestore_writing": {
"items": 11,
"bytes": 4510159
},
"bluefs": {
"items": 1202,
"bytes": 64136
},
"buffer_anon": {
"items": 76863,
"bytes": 21327234
},
"buffer_meta": {
"items": 910,
"bytes": 80080
},
"osd": {
"items": 328,
"bytes": 3956992
},
"osd_mapbl": {
"items": 0,
"bytes": 0
},
"osd_pglog": {
"items": 1118050,
"bytes": 286277600
},
"osdmap": {
"items": 6073,
"bytes": 551872
},
"osdmap_mapping": {
"items": 0,
"bytes": 0
},
"pgmap": {
"items": 0,
"bytes": 0
},
"mds_co": {
"items": 0,
"bytes": 0
},
"unittest_1": {
"items": 0,
"bytes": 0
},
"unittest_2": {
"items": 0,
"bytes": 0
},
"total": {
"items": 216913096,
"bytes": 3100631833
}
}


- Mail original -
De: "Tzachi Strul" 
À: "ceph-users" 
Envoyé: Samedi 10 Février 2018 11:44:40
Objet: [ceph-users] Does anyone else still experiancing memory issues with 
12.2.2 and Bluestore?

Hi, 
I know that 12.2.2 should have fixed all memory leak issues with bluestore but 
we still experiencing some odd behavior. 

Our osd flaps once in a while... sometimes it doesn't stop until we restart all 
osds on all nodes/on the same server... 
in our syslog we see messages like this "failed: Cannot allocate memory" on all 
kind of processes... 

In addition, sometimes we get this error while trying to work with ceph 
commands: 
Traceback (most recent call last): 
File "/usr/bin/ceph", line 125, in  
import rados 
ImportError: libceph-common.so.0: cannot map zero-fill pages 

seems like a memory leak issue...when we restart all osds this behavior stops 
for few hours/days. 
we have 8 osd servers with 16 ssd disks on each and 64GB of ram. bluestore 
cache set to default (3GB for ssd) 

the result is our cluster is almost constantly in rebuilds and that impacts 
performance. 

root@ecprdbcph10-opens:~# ceph daemon osd.1 dump_mempools 
{ 
"bloom_filter": { 
"items": 0, 
"bytes": 0 
}, 
"bluestore_alloc": { 
"items": 5105472, 
"bytes": 5105472 
}, 
"bluestore_cache_data": { 
"items": 68868, 
"bytes": 1934663680 
}, 
"bluestore_cache_onode": { 
"items": 152640, 
"bytes": 102574080 
}, 
"bluestore_cache_other": { 
"items": 16920009, 
"bytes": 371200513 
}, 
"bluestore_fsck": { 
"items": 0, 
"bytes": 0 
}, 
"bluestore_txc": { 
"items": 3, 
"bytes": 2160 
}, 
"bluestore_writing_deferred": { 
"items": 33, 
"bytes": 265015 
}, 
"bluestore_writing": { 
"items": 19, 
"bytes": 6403820 
}, 
"bluefs": { 
"items": 303, 
"bytes": 12760 
}, 
"buffer_anon": { 
"items": 32958, 
"bytes": 14087657 
}, 
"buffer_meta": { 
"items": 68996, 
"bytes": 6071648 
}, 
"osd": { 
"items": 187, 
"bytes": 2255968 
}, 
"osd_mapbl": { 
"items": 0, 
"bytes": 0 
}, 
"osd_pglog": { 
"items": 514238, 
"bytes": 152438172 
}, 
"osdmap": { 
"items": 35699, 
"bytes": 823040 
}, 
"osdmap_mapping": { 
"items": 0, 
"bytes": 0 
}, 
"pgmap": { 
"items": 0, 
"bytes": 0 
}, 
"mds_co": { 
"items": 0, 
"bytes": 0 
}, 
"unittest_1": { 
"items": 0, 
"bytes": 0 
}, 
"unittest_2": { 
"items": 0, 
"bytes": 0 
}, 
"total": { 
"items": 22899425, 
"bytes": 2595903985 
} 
} 


Any help would be appreciated. 
Thank you 


-- 


Tzachi Strul 

Storage DevOps // Kenshoo 

Office +972 73 2862-368 // Mobile +972 54 755 1308 

[ http://kenshoo.com/ ] 

This e-mail, as well as any attached document, may contain material which is 
confidential and privileged and may include trademark, copyright and other 
intellectual property rights that are proprietary to Kenshoo Ltd, its 
subsidiaries or affiliates ("Kenshoo"). This e-mail and its attachments may be 
read, copied and used only by

Re: [ceph-users] Question about librbd with qemu-kvm

2018-01-02 Thread Alexandre DERUMIER

It's not possible to use multiple threads by disk in qemu currently. (It's on 
qemu roadmap).

but you can create multiple disk/rbd image and use multiple qemu iothreads. (1 
by disk).


(BTW, I'm able to reach around 70k iops max with 4k read, with 3,1ghz cpu, 
rbd_cache=none, disabling debug and cephx in ceph.conf)


- Mail original -
De: "冷镇宇" 
À: "ceph-users" 
Envoyé: Mardi 2 Janvier 2018 04:01:39
Objet: [ceph-users] Question about librbd with qemu-kvm



Hi all, 

I am using librbd of Ceph10.2.0 with Qemu-kvm. When the virtual machine booted, 
I found that there is only one tp_librbd thread for one rbd image. Then the 
iops of 4KB read for one rbd image is only 20,000. I'm wondering if there are 
some configures for librbd in qemu which can add librbd threads for one rbd 
image. Can someone help me? Thank you very much. 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Corrupted files on CephFS since Luminous upgrade

2017-12-08 Thread Alexandre DERUMIER

do you have disabled fuse pagecache on your clients ceph.conf ?


[client]
fuse_disable_pagecache = true

- Mail original -
De: "Florent Bautista" 
À: "ceph-users" 
Envoyé: Vendredi 8 Décembre 2017 10:54:59
Objet: Re: [ceph-users] Corrupted files on CephFS since Luminous upgrade

On 08/12/2017 10:44, Wido den Hollander wrote: 
> 
> 
> On 12/08/2017 10:27 AM, Florent B wrote: 
>> Hi everyone, 
>> 
>> A few days ago I upgraded a cluster from Kraken to Luminous. 
>> 
>> I have a few mail servers on it, running Ceph-Fuse & Dovecot. 
>> 
>> And since the day of upgrade, Dovecot is reporting some corrupted files 
>> on a large account : 
>> 
>> doveadm(myu...@mydomain.com): Error: Corrupted dbox file 
>> /mnt/maildata1/mydomain.com/myuser//mdbox/storage/m.5808 (around 
>> offset=79178): purging found mismatched offsets (79148 vs 72948, 
>> 13/1313) 
>> doveadm(myu...@mydomain.com): Warning: fscking index file 
>> /mnt/maildata1/mydomain.com/myuser//mdbox/storage/dovecot.map.index 
>> doveadm(myu...@mydomain.com): Warning: mdbox 
>> /mnt/maildata1/mydomain.com/myuser//mdbox/storage: rebuilding indexes 
>> doveadm(myu...@mydomain.com): Warning: Transaction log file 
>> /mnt/maildata1/mydomain.com/myuser//mdbox/storage/dovecot.map.index.log 
>> was locked for 1249 seconds (mdbox storage rebuild) 
>> doveadm(myu...@mydomain.com): Error: Purging namespace '' failed: 
>> Corrupted dbox file 
>> /mnt/maildata1/mydomain.com/myuser//mdbox/storage/m.5808 (around 
>> offset=79178): purging found mismatched offsets (79148 vs 72948, 
>> 13/1313) 
>> 
>> Even if Dovecot fixes this problem, every day new files are corrupted. 
>> 
>> I never had this problem before ! And Ceph status is reporting some "MDS 
>> slow requests" ! 
>> 
>> Do you have an idea ? 
>> 
> 
> Not really, but could you share a bit more information: 
> 
> - Which version if Luminous? 
> - Running with BlueStore or FileStore? 
> - Replication? 
> - Cache tiering? 
> - Which kernel version do the clients use? 
> 
> Wido 
> 

Luminous 12.2.1 upgraded to 12.2.2 yesterday, and always the same 
problem today. 

FileStore only (xfs). 

Replication is 3 copies for these mail files. 

No Cache Tiering. 

Kernel on clients is default Debian Jessie (3.16.43-2+deb8u5) but I'm 
using ceph-fuse, not kernel client. 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] luminous vs jewel rbd performance

2017-09-21 Thread Alexandre DERUMIER

ok, thanks.

I'll try to do same bench in coming week, I'll you in touch with results.


- Mail original -
De: "Rafael Lopez" <rafael.lo...@monash.edu>
À: "aderumier" <aderum...@odiso.com>
Cc: "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Mercredi 20 Septembre 2017 08:51:22
Objet: Re: [ceph-users] luminous vs jewel rbd performance

Hi Alexandre, 
Yeah we are using filestore for the moment with luminous. With regards to 
client, I tried both jewel and luminous librbd versions against the luminous 
cluster - similar results. 

I am running fio on a physical machine with fio rbd engine. This is a snippet 
of the fio config for the runs (the complete jobfile adds variations of 
read/write/block size/iodepth). 

[global] 
ioengine=rbd 
clientname=cinder-volume 
pool=rbd-bronze 
invalidate=1 
ramp_time=5 
runtime=30 
time_based 
direct=1 

[write-rbd1-4k-depth1] 
rbdname=rbd-tester-fio 
bs=4k 
iodepth=1 
rw=write 
stonewall 

[write-rbd2-4k-depth16] 
rbdname=rbd-tester-fio-2 
bs=4k 
iodepth=16 
rw=write 
stonewall 

Raf 

On 20 September 2017 at 16:43, Alexandre DERUMIER < [ 
mailto:aderum...@odiso.com | aderum...@odiso.com ] > wrote: 


Hi 

so, you use also filestore on luminous ? 

do you have also upgraded librbd on client ? (are you benching inside a qemu 
machine ? or directly with fio-rbd ?) 



(I'm going to do a lot of benchmarks in coming week, I'll post results on 
mailing soon.) 



- Mail original - 
De: "Rafael Lopez" < [ mailto:rafael.lo...@monash.edu | rafael.lo...@monash.edu 
] > 
À: "ceph-users" < [ mailto:ceph-users@lists.ceph.com | 
ceph-users@lists.ceph.com ] > 
Envoyé: Mercredi 20 Septembre 2017 08:17:23 
Objet: [ceph-users] luminous vs jewel rbd performance 

hey guys. 
wondering if anyone else has done some solid benchmarking of jewel vs luminous, 
in particular on the same cluster that has been upgraded (same cluster, client 
and config). 

we have recently upgraded a cluster from 10.2.9 to 12.2.0, and unfortunately i 
only captured results from a single fio (librbd) run with a few jobs in it 
before upgrading. i have run the same fio jobfile many times at different times 
of the day since upgrading, and been unable to produce a close match to the 
pre-upgrade (jewel) run from the same client. one particular job is 
significantly slower (4M block size, iodepth=1, seq read), up to 10x in one 
run. 

i realise i havent supplied much detail and it could be dozens of things, but i 
just wanted to see if anyone else had done more quantitative benchmarking or 
had similar experiences. keep in mind all we changed was daemons were restarted 
to use luminous code, everything else exactly the same. granted it is possible 
that some/all osds had some runtime config injected that differs from now, but 
i'm fairly confident this is not the case as they were recently restarted (on 
jewel code) after OS upgrades. 

cheers, 
Raf 

___ 
ceph-users mailing list 
[ mailto:ceph-users@lists.ceph.com | ceph-users@lists.ceph.com ] 
[ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ] 







-- 
Rafael Lopez 
Research Devops Engineer 
Monash University eResearch Centre 

T: [ tel:%2B61%203%209905%209118 | +61 3 9905 9118 ] 
M: [ tel:%2B61%204%2027682%20670 | +61 (0)427682670 ] 
E: [ mailto:rafael.lo...@monash.edu | rafael.lo...@monash.edu ] 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] luminous vs jewel rbd performance

2017-09-20 Thread Alexandre DERUMIER

Hi

so, you use also filestore on luminous ?

do you have also upgraded librbd on client ? (are you benching inside a qemu 
machine ? or directly with fio-rbd ?)



(I'm going to do a lot of benchmarks in coming week, I'll post results on 
mailing soon.)



- Mail original -
De: "Rafael Lopez" 
À: "ceph-users" 
Envoyé: Mercredi 20 Septembre 2017 08:17:23
Objet: [ceph-users] luminous vs jewel rbd performance

hey guys. 
wondering if anyone else has done some solid benchmarking of jewel vs luminous, 
in particular on the same cluster that has been upgraded (same cluster, client 
and config). 

we have recently upgraded a cluster from 10.2.9 to 12.2.0, and unfortunately i 
only captured results from a single fio (librbd) run with a few jobs in it 
before upgrading. i have run the same fio jobfile many times at different times 
of the day since upgrading, and been unable to produce a close match to the 
pre-upgrade (jewel) run from the same client. one particular job is 
significantly slower (4M block size, iodepth=1, seq read), up to 10x in one 
run. 

i realise i havent supplied much detail and it could be dozens of things, but i 
just wanted to see if anyone else had done more quantitative benchmarking or 
had similar experiences. keep in mind all we changed was daemons were restarted 
to use luminous code, everything else exactly the same. granted it is possible 
that some/all osds had some runtime config injected that differs from now, but 
i'm fairly confident this is not the case as they were recently restarted (on 
jewel code) after OS upgrades. 

cheers, 
Raf 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] output discards (queue drops) on switchport

2017-09-08 Thread Alexandre DERUMIER

Sorry, I dind't see that you use proxmox5.

As I'm a proxmox contributor, I can tell you that I have error with kernel 4.10 
(which is ubuntu kernel).

if you don't use zfs, try kernel 4.12 from stretch-backports, or kernel 4.4 
from proxmox 4 (with zfs support).


Tell me if it's works better for you.

(I'm currently try to backports last mlx5 patches from kernel 4.12 to kernel 
4.10, to see if it's helping)

I have open a thread on pve-devel mailing list today.



- Mail original -
De: "Alexandre Derumier" <aderum...@odiso.com>
À: "Burkhard Linke" <burkhard.li...@computational.bio.uni-giessen.de>
Cc: "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Vendredi 8 Septembre 2017 17:27:49
Objet: Re: [ceph-users] output discards (queue drops) on switchport

Hi, 

>> public network Mellanox ConnectX-4 Lx dual-port 25 GBit/s 

which kernel/distro do you use ? 

I have same card, and I had problem with centos7 kernel 3.10 recently, with 
packet drop 

i have also problems with ubuntu kernel 4.10 and lacp 


kernel 4.4 or 4.12 are working fine for me. 





- Mail original - 
De: "Burkhard Linke" <burkhard.li...@computational.bio.uni-giessen.de> 
À: "ceph-users" <ceph-users@lists.ceph.com> 
Envoyé: Vendredi 8 Septembre 2017 16:25:31 
Objet: Re: [ceph-users] output discards (queue drops) on switchport 

Hi, 


On 09/08/2017 04:13 PM, Andreas Herrmann wrote: 
> Hi, 
> 
> On 08.09.2017 15:59, Burkhard Linke wrote: 
>> On 09/08/2017 02:12 PM, Marc Roos wrote: 
>>> 
>>> Afaik ceph is is not supporting/working with bonding. 
>>> 
>>> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg35474.html 
>>> (thread: Maybe some tuning for bonded network adapters) 
>> CEPH works well with LACP bonds. The problem described in that thread is the 
>> fact that LACP is not using links in a round robin fashion, but distributes 
>> network stream depending on a hash of certain parameters like source and 
>> destination IP address. This is already set to layer3+4 policy by the OP. 
>> 
>> Regarding the drops (and without any experience with neither 25GBit ethernet 
>> nor the Arista switches): 
>> Do you have corresponding input drops on the server's network ports? 
> No input drops, just output drop 
Output drops on the switch are related to input drops on the server 
side. If the link uses flow control and the server signals the switch 
that its internal buffer are full, the switch has to drop further 
packages if the port buffer is also filled. If there's no flow control, 
and the network card is not able to store the packet (full buffers...), 
it should be noted as overrun in the interface statistics (and if this 
is not correct, please correct me, I'm not a network guy). 

> 
>> Did you tune the network settings on server side for high throughput, e.g. 
>> net.ipv4.tcp_rmem, wmem, ...? 
> sysctl tuning is disabled at the moment. I tried sysctl examples from 
> https://fatmin.com/2015/08/19/ceph-tcp-performance-tuning/. But there is 
> still 
> the same amount of output drops. 
> 
>> And are the CPUs fast enough to handle the network traffic? 
> Xeon(R) CPU E5-1660 v4 @ 3.20GHz should be fast enough. But I'm unsure. It's 
> my first Ceph cluster. 
The CPU has 6 cores, and you are driving 2x 10GBit, 2x 25 GBit, the raid 
controller and 8 ssd based osds with it. You can use tools like atop or 
ntop to watch certain aspects of the system during the tests (network, 
cpu, disk). 

Regards, 
Burkhard 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] output discards (queue drops) on switchport

2017-09-08 Thread Alexandre DERUMIER

Hi,

>> public network Mellanox ConnectX-4 Lx dual-port 25 GBit/s

which kernel/distro do you use ?

I have same card, and I had problem with centos7 kernel 3.10 recently, with 
packet drop

i have also problems with ubuntu kernel 4.10 and lacp 


kernel 4.4 or 4.12 are working fine for me.





- Mail original -
De: "Burkhard Linke" 
À: "ceph-users" 
Envoyé: Vendredi 8 Septembre 2017 16:25:31
Objet: Re: [ceph-users] output discards (queue drops) on switchport

Hi, 


On 09/08/2017 04:13 PM, Andreas Herrmann wrote: 
> Hi, 
> 
> On 08.09.2017 15:59, Burkhard Linke wrote: 
>> On 09/08/2017 02:12 PM, Marc Roos wrote: 
>>> 
>>> Afaik ceph is is not supporting/working with bonding. 
>>> 
>>> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg35474.html 
>>> (thread: Maybe some tuning for bonded network adapters) 
>> CEPH works well with LACP bonds. The problem described in that thread is the 
>> fact that LACP is not using links in a round robin fashion, but distributes 
>> network stream depending on a hash of certain parameters like source and 
>> destination IP address. This is already set to layer3+4 policy by the OP. 
>> 
>> Regarding the drops (and without any experience with neither 25GBit ethernet 
>> nor the Arista switches): 
>> Do you have corresponding input drops on the server's network ports? 
> No input drops, just output drop 
Output drops on the switch are related to input drops on the server 
side. If the link uses flow control and the server signals the switch 
that its internal buffer are full, the switch has to drop further 
packages if the port buffer is also filled. If there's no flow control, 
and the network card is not able to store the packet (full buffers...), 
it should be noted as overrun in the interface statistics (and if this 
is not correct, please correct me, I'm not a network guy). 

> 
>> Did you tune the network settings on server side for high throughput, e.g. 
>> net.ipv4.tcp_rmem, wmem, ...? 
> sysctl tuning is disabled at the moment. I tried sysctl examples from 
> https://fatmin.com/2015/08/19/ceph-tcp-performance-tuning/. But there is 
> still 
> the same amount of output drops. 
> 
>> And are the CPUs fast enough to handle the network traffic? 
> Xeon(R) CPU E5-1660 v4 @ 3.20GHz should be fast enough. But I'm unsure. It's 
> my first Ceph cluster. 
The CPU has 6 cores, and you are driving 2x 10GBit, 2x 25 GBit, the raid 
controller and 8 ssd based osds with it. You can use tools like atop or 
ntop to watch certain aspects of the system during the tests (network, 
cpu, disk). 

Regards, 
Burkhard 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PCIe journal benefit for SSD OSDs

2017-09-07 Thread Alexandre DERUMIER

Hi Stefan

>>Have you already done tests how he performance changes with bluestore 
>>while putting all 3 block devices on the same ssd?

I'm going to test bluestore with 3 nodes , 18 x intel s3610 1,6TB in coming 
weeks.

I'll send results on the mailing.

- Mail original -
De: "Stefan Priebe, Profihost AG" 
À: "Christian Balzer" , "ceph-users" 
Envoyé: Jeudi 7 Septembre 2017 08:03:31
Objet: Re: [ceph-users] PCIe journal benefit for SSD OSDs

Hello, 
Am 07.09.2017 um 03:53 schrieb Christian Balzer: 
> 
> Hello, 
> 
> On Wed, 6 Sep 2017 09:09:54 -0400 Alex Gorbachev wrote: 
> 
>> We are planning a Jewel filestore based cluster for a performance 
>> sensitive healthcare client, and the conservative OSD choice is 
>> Samsung SM863A. 
>> 
> 
> While I totally see where you're coming from and me having stated that 
> I'll give Luminous and Bluestore some time to mature, I'd also be looking 
> into that if I were being in the planning phase now, with like 3 months 
> before deployment. 
> The inherent performance increase with Bluestore (and having something 
> that hopefully won't need touching/upgrading for a while) shouldn't be 
> ignored. 

Yes and that's the point where i'm currently as well. Thinking about how 
to design a new cluster based on bluestore. 

> The SSDs are fine, I've been starting to use those recently (though not 
> with Ceph yet) as Intel DC S36xx or 37xx are impossible to get. 
> They're a bit slower in the write IOPS department, but good enough for me. 

I've never used the Intel DC ones but always the Samsung are the Intel 
really faster? Have you disabled te FLUSH command for the Samsung ones? 
They don't skip the command automatically like the Intel do. Sadly the 
Samsung SM863 got more expensive over the last months. They were a lot 
cheaper in the first month of 2016. May be the 2,5" optane intel ssds 
will change the game. 

>> but was wondering if anyone has seen a positive 
>> impact from also using PCIe journals (e.g. Intel P3700 or even the 
>> older 910 series) in front of such SSDs? 
>> 
> NVMe journals (or WAL and DB space for Bluestore) are nice and can 
> certainly help, especially if Ceph is tuned accordingly. 
> Avoid non DC NVMes, I doubt you can still get 910s, they are officially 
> EOL. 
> You want to match capabilities and endurances, a DC P3700 800GB would be 
> an OK match for 3-4 SM863a 960GB for example. 

That's a good point but makes the cluster more expensive. Currently 
while using filestore i use one SSD for journal and data which works fine. 

With bluestore we've block, db and wal so we need 3 block devices per 
OSD. If we need one PCIe or NVMe device per 3-4 devices it get's much 
more expensive per host - currently running 10 OSDs / SSDs per Node. 

Have you already done tests how he performance changes with bluestore 
while putting all 3 block devices on the same ssd? 

Greets, 
Stefan 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Broken Ceph Cluster when adding new one - Proxmox 5.0 & Ceph Luminous

2017-07-26 Thread Alexandre DERUMIER

Hi Phil,


It's possible that rocksdb have a bug with some old cpus currently (old xeon 
and some opteron)
I have the same behaviour with new cluster when creating mons
http://tracker.ceph.com/issues/20529

What is your cpu model ?

in your log: 

sh[1869]:  in thread 7f6d85db3c80 thread_name:ceph-osd
sh[1869]:  ceph version 12.1.0 (330b5d17d66c6c05b08ebc129d3e6e8f92f73c60) 
luminous (dev)
sh[1869]:  1: (()+0x9bc562) [0x558561169562]
sh[1869]:  2: (()+0x110c0) [0x7f6d835cb0c0]
sh[1869]:  3: 
(rocksdb::VersionBuilder::SaveTo(rocksdb::VersionStorageInfo*)+0x871) 
[0x5585615788b1]
sh[1869]:  4: 
(rocksdb::VersionSet::Recover(std::vector const&, bool)+0x26bc) 
[0x55856145ca4c]
sh[1869]:  5: 
(rocksdb::DBImpl::Recover(std::vector const&, bool, bool, 
bool)+0x11f) [0x558561423e6f]
sh[1869]:  6: (rocksdb::DB::Open(rocksdb::DBOptions const&, 
std::__cxx11::basic_string 
const&, std:
sh[1869]:  7: (rocksdb::DB::Open(rocksdb::Options const&, 
std::__cxx11::basic_string 
const&, rocksdb:
sh[1869]:  8: (RocksDBStore::do_open(std::ostream&, bool)+0x68e) 
[0x5585610af76e]
sh[1869]:  9: (RocksDBStore::create_and_open(std::ostream&)+0xd7) 
[0x5585610b0d27]
sh[1869]:  10: (BlueStore::_open_db(bool)+0x326) [0x55856103c6d6]
sh[1869]:  11: (BlueStore::mkfs()+0x856) [0x55856106d406]
sh[1869]:  12: (OSD::mkfs(CephContext*, ObjectStore*, 
std::__cxx11::basic_string 
const&, uuid_d, int)+0x348) [0x558560bc98f8]
sh[1869]:  13: (main()+0xe58) [0x558560b1da78]
sh[1869]:  14: (__libc_start_main()+0xf1) [0x7f6d825802b1]
sh[1869]:  15: (_start()+0x2a) [0x558560ba4dfa]
sh[1869]: 2017-07-16 14:46:00.763521 7f6d85db3c80 -1 *** Caught signal (Illegal 
instruction) **
sh[1869]:  in thread 7f6d85db3c80 thread_name:ceph-osd
sh[1869]:  ceph version 12.1.0 (330b5d17d66c6c05b08ebc129d3e6e8f92f73c60) 
luminous (dev)
sh[1869]:  1: (()+0x9bc562) [0x558561169562]

- Mail original -
De: "Phil Schwarz" 
À: "Udo Lembke" , "ceph-users" 
Envoyé: Dimanche 16 Juillet 2017 15:04:16
Objet: Re: [ceph-users] Broken Ceph Cluster when adding new one - Proxmox 5.0 & 
Ceph Luminous

Le 15/07/2017 à 23:09, Udo Lembke a écrit : 
> Hi, 
> 
> On 15.07.2017 16:01, Phil Schwarz wrote: 
>> Hi, 
>> ... 
>> 
>> While investigating, i wondered about my config : 
>> Question relative to /etc/hosts file : 
>> Should i use private_replication_LAN Ip or public ones ? 
> private_replication_LAN!! And the pve-cluster should use another network 
> (nics) if possible. 
> 
> Udo 
> 
OK, thanks Udo. 

After investigation, i did : 
- set Noout OSDs 
- Stopped CPU-pegging LXC 
- Check the cabling 
- Restart the whole cluster 

Everything went fine ! 

But, when i tried to add a new OSD : 

fdisk /dev/sdc --> Deleted the partition table 
parted /dev/sdc --> mklabel msdos (Disk came from a ZFS FreeBSD system) 
dd if=/dev/null of=/dev/sdc 
ceph-disk zap /dev/sdc 
dd if=/dev/zero of=/dev/sdc bs=10M count=1000 

And recreated the OSD via Web GUI. 
Same result, the OSD is known by the node, but not by the cluster. 

Logs seem to show an issue with this bluestore OSD, have a look at the file. 

I'm gonna give a try to OSD recreating using Filestore. 

Thanks 


___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] qemu-img convert vs rbd import performance

2017-07-21 Thread Alexandre DERUMIER

>>Is there anything changed from Hammer to Jewel that might be affecting the 
>>qemu-img convert performance?

maybe object map for exclusive lock ? (I think it could be a little bit slower 
when objects are created first)

you could test it, create the target rbd volume, disable exclusive lock,objet 
map, and try qemu-img convert.



- Mail original -
De: "Mahesh Jambhulkar" <mahesh.jambhul...@trilio.io>
À: "aderumier" <aderum...@odiso.com>
Cc: "dillaman" <dilla...@redhat.com>, "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Vendredi 21 Juillet 2017 14:38:20
Objet: Re: [ceph-users] qemu-img convert vs rbd import performance

Thanks Alexandre! 
We were using ceph - Hammer before and we never had these performance issues 
with qemu-img convert. 

Is there anything changed from Hammer to Jewel that might be affecting the 
qemu-img convert performance? 

On Fri, Jul 21, 2017 at 2:24 PM, Alexandre DERUMIER < [ 
mailto:aderum...@odiso.com | aderum...@odiso.com ] > wrote: 


It's already in qemu 2.9 

[ 
http://git.qemu.org/?p=qemu.git;a=commit;h=2d9187bc65727d9dd63e2c410b5500add3db0b0d
 | 
http://git.qemu.org/?p=qemu.git;a=commit;h=2d9187bc65727d9dd63e2c410b5500add3db0b0d
 ] 


" 
This patches introduces 2 new cmdline parameters. The -m parameter to specify 
the number of coroutines running in parallel (defaults to 8). And the -W 
parameter to 
allow qemu-img to write to the target out of order rather than sequential. This 
improves 
performance as the writes do not have to wait for each other to complete. 
" 

- Mail original - 
De: "aderumier" < [ mailto:aderum...@odiso.com | aderum...@odiso.com ] > 
À: "dillaman" < [ mailto:dilla...@redhat.com | dilla...@redhat.com ] > 
Cc: "Mahesh Jambhulkar" < [ mailto:mahesh.jambhul...@trilio.io | 
mahesh.jambhul...@trilio.io ] >, "ceph-users" < [ 
mailto:ceph-users@lists.ceph.com | ceph-users@lists.ceph.com ] > 
Envoyé: Vendredi 21 Juillet 2017 10:51:21 
Objet: Re: [ceph-users] qemu-img convert vs rbd import performance 

Hi, 

they are an RFC here: 

"[RFC] qemu-img: make convert async" 
[ https://patchwork.kernel.org/patch/9552415/ | 
https://patchwork.kernel.org/patch/9552415/ ] 


maybe it could help 


- Mail original - 
De: "Jason Dillaman" < [ mailto:jdill...@redhat.com | jdill...@redhat.com ] > 
À: "Mahesh Jambhulkar" < [ mailto:mahesh.jambhul...@trilio.io | 
mahesh.jambhul...@trilio.io ] > 
Cc: "ceph-users" < [ mailto:ceph-users@lists.ceph.com | 
ceph-users@lists.ceph.com ] > 
Envoyé: Jeudi 20 Juillet 2017 15:20:32 
Objet: Re: [ceph-users] qemu-img convert vs rbd import performance 

Running a similar 20G import test within a single OSD VM-based cluster, I see 
the following: 
$ time qemu-img convert -p -O raw -f raw ~/image rbd:rbd/image 
(100.00/100%) 

real 3m20.722s 
user 0m18.859s 
sys 0m20.628s 

$ time rbd import ~/image 
Importing image: 100% complete...done. 

real 2m11.907s 
user 0m12.236s 
sys 0m20.971s 

Examining the IO patterns from qemu-img, I can see that it is effectively using 
synchronous IO (i.e. only a single write is in-flight at a time), whereas "rbd 
import" will send up to 10 (by default) IO requests concurrently. Therefore, 
the higher the latencies to your cluster, the worse qemu-img will perform as 
compared to "rbd import". 



On Thu, Jul 20, 2017 at 5:07 AM, Mahesh Jambhulkar < [ mailto: [ 
mailto:mahesh.jambhul...@trilio.io | mahesh.jambhul...@trilio.io ] | [ 
mailto:mahesh.jambhul...@trilio.io | mahesh.jambhul...@trilio.io ] ] > wrote: 



Adding rbd readahead disable after bytes = 0 did not help. 

[root@cephlarge mnt]# time qemu-img convert -p -O raw 
/mnt/data/workload_326e8a43-a90a-4fe9-8aab-6d33bcdf5a05/snapshot_9f0cee13-8200-4562-82ec-1fb9f234bcd8/vm_id_05e9534e-5c84-4487-9613-1e0e227e4c1a/vm_res_id_24291e4b-93d2-47ad-80a8-bf3c395319b9_vdb/66582225-6539-4e5e-9b7a-59aa16739df1
 rbd:volumes/24291e4b-93d2-47ad-80a8-bf3c395319b9 (100.00/100%) 

real 4858m13.822s 
user 73m39.656s 
sys 32m11.891s 
It took 80 hours to complete. 

Also, its not feasible to test this with huge 465GB file every time. So I 
tested qemu-img convert with a 20GB file. 

Parameters Time taken 
-t writeback 38mins 
-t none 38 mins 
-S 4k 38 mins 
With client options mentions by Irek Fasikhov 40 mins 
The time taken is almost the same. 

On Thu, Jul 13, 2017 at 6:40 PM, Jason Dillaman < [ mailto: [ 
mailto:jdill...@redhat.com | jdill...@redhat.com ] | [ 
mailto:jdill...@redhat.com | jdill...@redhat.com ] ] > wrote: 


On Thu, Jul 13, 2017 at 8:57 AM, Irek Fasikhov < [ mailto: [ 
mailto:malm...@gmail.com | malm...@gmail.com ] | [ mailto:malm...@gmail.com | 
malm...@gmail.com ] ] > wrote: 
> rbd readahead disable after bytes = 0 


There isn't any readin

Re: [ceph-users] qemu-img convert vs rbd import performance

2017-07-21 Thread Alexandre DERUMIER

It's already in qemu 2.9

http://git.qemu.org/?p=qemu.git;a=commit;h=2d9187bc65727d9dd63e2c410b5500add3db0b0d


"
This patches introduces 2 new cmdline parameters. The -m parameter to specify
the number of coroutines running in parallel (defaults to 8). And the -W 
parameter to
allow qemu-img to write to the target out of order rather than sequential. This 
improves
performance as the writes do not have to wait for each other to complete.
"

- Mail original -
De: "aderumier" 
À: "dillaman" 
Cc: "Mahesh Jambhulkar" , "ceph-users" 

Envoyé: Vendredi 21 Juillet 2017 10:51:21
Objet: Re: [ceph-users] qemu-img convert vs rbd import performance

Hi, 

they are an RFC here: 

"[RFC] qemu-img: make convert async" 
https://patchwork.kernel.org/patch/9552415/ 


maybe it could help 


- Mail original - 
De: "Jason Dillaman"  
À: "Mahesh Jambhulkar"  
Cc: "ceph-users"  
Envoyé: Jeudi 20 Juillet 2017 15:20:32 
Objet: Re: [ceph-users] qemu-img convert vs rbd import performance 

Running a similar 20G import test within a single OSD VM-based cluster, I see 
the following: 
$ time qemu-img convert -p -O raw -f raw ~/image rbd:rbd/image 
(100.00/100%) 

real 3m20.722s 
user 0m18.859s 
sys 0m20.628s 

$ time rbd import ~/image 
Importing image: 100% complete...done. 

real 2m11.907s 
user 0m12.236s 
sys 0m20.971s 

Examining the IO patterns from qemu-img, I can see that it is effectively using 
synchronous IO (i.e. only a single write is in-flight at a time), whereas "rbd 
import" will send up to 10 (by default) IO requests concurrently. Therefore, 
the higher the latencies to your cluster, the worse qemu-img will perform as 
compared to "rbd import". 



On Thu, Jul 20, 2017 at 5:07 AM, Mahesh Jambhulkar < [ 
mailto:mahesh.jambhul...@trilio.io | mahesh.jambhul...@trilio.io ] > wrote: 



Adding rbd readahead disable after bytes = 0 did not help. 

[root@cephlarge mnt]# time qemu-img convert -p -O raw 
/mnt/data/workload_326e8a43-a90a-4fe9-8aab-6d33bcdf5a05/snapshot_9f0cee13-8200-4562-82ec-1fb9f234bcd8/vm_id_05e9534e-5c84-4487-9613-1e0e227e4c1a/vm_res_id_24291e4b-93d2-47ad-80a8-bf3c395319b9_vdb/66582225-6539-4e5e-9b7a-59aa16739df1
 rbd:volumes/24291e4b-93d2-47ad-80a8-bf3c395319b9 (100.00/100%) 

real 4858m13.822s 
user 73m39.656s 
sys 32m11.891s 
It took 80 hours to complete. 

Also, its not feasible to test this with huge 465GB file every time. So I 
tested qemu-img convert with a 20GB file. 

Parameters Time taken 
-t writeback 38mins 
-t none 38 mins 
-S 4k 38 mins 
With client options mentions by Irek Fasikhov 40 mins 
The time taken is almost the same. 

On Thu, Jul 13, 2017 at 6:40 PM, Jason Dillaman < [ mailto:jdill...@redhat.com 
| jdill...@redhat.com ] > wrote: 

 
On Thu, Jul 13, 2017 at 8:57 AM, Irek Fasikhov < [ mailto:malm...@gmail.com | 
malm...@gmail.com ] > wrote: 
> rbd readahead disable after bytes = 0 


There isn't any reading from an RBD image in this example -- plus 
readahead disables itself automatically after the first 50MBs of IO 
(i.e. after the OS should have had enough time to start its own 
readahead logic). 

-- 
Jason 






-- 
Regards, 
mahesh j 

 




-- 
Jason 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] qemu-img convert vs rbd import performance

2017-07-21 Thread Alexandre DERUMIER

Hi,

they are an RFC here:

"[RFC] qemu-img: make convert async"
https://patchwork.kernel.org/patch/9552415/


maybe it could help


- Mail original -
De: "Jason Dillaman" 
À: "Mahesh Jambhulkar" 
Cc: "ceph-users" 
Envoyé: Jeudi 20 Juillet 2017 15:20:32
Objet: Re: [ceph-users] qemu-img convert vs rbd import performance

Running a similar 20G import test within a single OSD VM-based cluster, I see 
the following: 
$ time qemu-img convert -p -O raw -f raw ~/image rbd:rbd/image 
(100.00/100%) 

real 3m20.722s 
user 0m18.859s 
sys 0m20.628s 

$ time rbd import ~/image 
Importing image: 100% complete...done. 

real 2m11.907s 
user 0m12.236s 
sys 0m20.971s 

Examining the IO patterns from qemu-img, I can see that it is effectively using 
synchronous IO (i.e. only a single write is in-flight at a time), whereas "rbd 
import" will send up to 10 (by default) IO requests concurrently. Therefore, 
the higher the latencies to your cluster, the worse qemu-img will perform as 
compared to "rbd import". 



On Thu, Jul 20, 2017 at 5:07 AM, Mahesh Jambhulkar < [ 
mailto:mahesh.jambhul...@trilio.io | mahesh.jambhul...@trilio.io ] > wrote: 



Adding rbd readahead disable after bytes = 0 did not help. 

[root@cephlarge mnt]# time qemu-img convert -p -O raw 
/mnt/data/workload_326e8a43-a90a-4fe9-8aab-6d33bcdf5a05/snapshot_9f0cee13-8200-4562-82ec-1fb9f234bcd8/vm_id_05e9534e-5c84-4487-9613-1e0e227e4c1a/vm_res_id_24291e4b-93d2-47ad-80a8-bf3c395319b9_vdb/66582225-6539-4e5e-9b7a-59aa16739df1
 rbd:volumes/24291e4b-93d2-47ad-80a8-bf3c395319b9 (100.00/100%) 

real 4858m13.822s 
user 73m39.656s 
sys 32m11.891s 
It took 80 hours to complete. 

Also, its not feasible to test this with huge 465GB file every time. So I 
tested qemu-img convert with a 20GB file. 

Parameters  Time taken 
-t writeback38mins 
-t none 38 mins 
-S 4k   38 mins 
With client options mentions by Irek Fasikhov   40 mins 
The time taken is almost the same. 

On Thu, Jul 13, 2017 at 6:40 PM, Jason Dillaman < [ mailto:jdill...@redhat.com 
| jdill...@redhat.com ] > wrote: 

BQ_BEGIN
On Thu, Jul 13, 2017 at 8:57 AM, Irek Fasikhov < [ mailto:malm...@gmail.com | 
malm...@gmail.com ] > wrote: 
> rbd readahead disable after bytes = 0 


There isn't any reading from an RBD image in this example -- plus 
readahead disables itself automatically after the first 50MBs of IO 
(i.e. after the OS should have had enough time to start its own 
readahead logic). 

-- 
Jason 






-- 
Regards, 
mahesh j 

BQ_END




-- 
Jason 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph mount rbd

2017-06-30 Thread Alexandre DERUMIER

>>Of course, I always have to ask the use-case behind mapping the same image on 
>>multiple hosts. Perhaps CephFS would be a better fit if you are trying to 
>>serve out a filesystem?

Hi jason,

Currently I'm sharing rbd images between multiple webservers vm with ocfs2 on 
top.

They have old kernels, so can't use cephfs for now . 

some servers have also between 20-30millions files, so I need to test cephfs to 
see if it can handle between 100-150 millions files (which are handle by 5 rbd 
images).

Can cephfs handle so much files currently ?  (I waiting for luminous to test it)

- Mail original -
De: "Jason Dillaman" 
À: "Maged Mokhtar" 
Cc: "ceph-users" 
Envoyé: Jeudi 29 Juin 2017 02:02:44
Objet: Re: [ceph-users] Ceph mount rbd

... additionally, the forthcoming 4.12 kernel release will support 
non-cooperative exclusive locking. By default, since 4.9, when the 
exclusive-lock feature is enabled, only a single client can write to the block 
device at a time -- but they will cooperatively pass the lock back and forth 
upon write request. With the new "rbd map" option, you can map a image on 
exactly one host and prevent other hosts from mapping the image. If that host 
should die, the exclusive-lock will automatically become available to other 
hosts for mapping. 
Of course, I always have to ask the use-case behind mapping the same image on 
multiple hosts. Perhaps CephFS would be a better fit if you are trying to serve 
out a filesystem? 

On Wed, Jun 28, 2017 at 6:25 PM, Maged Mokhtar < [ mailto:mmokh...@petasan.org 
| mmokh...@petasan.org ] > wrote: 

On 2017-06-28 22:55, [ mailto:li...@marcelofrota.info | li...@marcelofrota.info 
] wrote: 

BQ_BEGIN

Hi People, 

I am testing the new enviroment, with ceph + rbd with ubuntu 16.04, and i have 
one question. 

I have my cluster ceph and mount the using the comands to ceph in my linux 
enviroment : 

rbd create veeamrepo --size 20480 
rbd --image veeamrepo info 
modprobe rbd 
rbd map veeamrepo 
rbd feature disable veeamrepo exclusive-lock object-map fast-diff deep-flatten 
mkdir /mnt/veeamrepo 
mount /dev/rbd0 /mnt/veeamrepo 

The comands work fine, but i have one problem, in the moment, i can mount the 
/mnt/veeamrepo in the same time in 2 machines, and this is a bad option for me 
in the moment, because this could generate one filesystem corrupt. 

I need only one machine to be allowed to mount and write at a time. 

Example if machine1 mount the /mnt/veeamrepo and machine2 try mount, one error 
would be displayed, show message the machine can not mount, because the system 
already mounted in machine1. 

Someone, could help-me with this or give some tips, for solution my problem. ? 

Thanks a lot 

___ 
ceph-users mailing list 
[ mailto:ceph-users@lists.ceph.com | ceph-users@lists.ceph.com ] 
[ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ] 

You can use Pacemaker to map the rbd and mount the filesystem on 1 server and 
in case of failure switch to another server. 

___ 
ceph-users mailing list 
[ mailto:ceph-users@lists.ceph.com | ceph-users@lists.ceph.com ] 
[ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ] 

BQ_END

-- 
Jason 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph packages for Debian Stretch?

2017-06-21 Thread Alexandre DERUMIER

Hi,

Proxmox is maintening a ceph-luminous repo for stretch

http://download.proxmox.com/debian/ceph-luminous/


git is here, with patches and modifications to get it work
https://git.proxmox.com/?p=ceph.git;a=summary



- Mail original -
De: "Alfredo Deza" 
À: "Christian Balzer" 
Cc: "ceph-users" 
Envoyé: Mardi 20 Juin 2017 18:54:05
Objet: Re: [ceph-users] Ceph packages for Debian Stretch?

On Mon, Jun 19, 2017 at 8:25 PM, Christian Balzer  wrote: 
> 
> Hello, 
> 
> can we have the status, projected release date of the Ceph packages for 
> Debian Stretch? 

We don't have anything yet as a projected release date. 

The current status is that this has not been prioritized. I anticipate 
that this will not be hard to accommodate in our repositories but 
it will require quite the effort to add in all of our tooling. 

In case anyone would like to help us out before the next stable 
release, these are places that would need to be updated for "stretch" 

https://github.com/ceph/ceph-build/tree/master/ceph-build 
https://github.com/ceph/chacra 

"grepping" for "jessie" should indicate every spot that might need to 
be updated. 

I am happy to review and answer questions to get these changes in! 


> 
> Christian 
> -- 
> Christian Balzer Network/Systems Engineer 
> ch...@gol.com Rakuten Communications 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Sharing SSD journals and SSD drive choice

2017-04-27 Thread Alexandre DERUMIER

Hi,

>>What I'm trying to get from the list is /why/ the "enterprise" drives 
>>are important. Performance? Reliability? Something else? 

performance, for sure (for SYNC write, 
https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/)

Reliabity : yes, enteprise drive have supercapacitor in case of powerfailure, 
and endurance (1 DWPD for 3520, 3 DWPD for 3610)


>>Also, 4 x Intel DC S3520 costs as much as 1 x Intel DC S3610. Obviously
>>the single drive leaves more bays free for OSD disks, but is there any
>>other reason a single S3610 is preferable to 4 S3520s? Wouldn't 4xS3520s
>>mean:

where do you see this price difference ?

for me , S3520 are around 25-30% cheaper than S3610


- Mail original -
De: "Adam Carheden" 
À: "ceph-users" 
Envoyé: Mercredi 26 Avril 2017 16:53:48
Objet: Re: [ceph-users] Sharing SSD journals and SSD drive choice

What I'm trying to get from the list is /why/ the "enterprise" drives 
are important. Performance? Reliability? Something else? 

The Intel was the only one I was seriously considering. The others were 
just ones I had for other purposes, so I thought I'd see how they fared 
in benchmarks. 

The Intel was the clear winner, but my tests did show that throughput 
tanked with more threads. Hypothetically, if I was throwing 16 OSDs at 
it, all with osd op threads = 2, do the benchmarks below not show that 
the Hynix would be a better choice (at least for performance)? 

Also, 4 x Intel DC S3520 costs as much as 1 x Intel DC S3610. Obviously 
the single drive leaves more bays free for OSD disks, but is there any 
other reason a single S3610 is preferable to 4 S3520s? Wouldn't 4xS3520s 
mean: 

a) fewer OSDs go down if the SSD fails 

b) better throughput (I'm speculating that the S3610 isn't 4 times 
faster than the S3520) 

c) load spread across 4 SATA channels (I suppose this doesn't really 
matter since the drives can't throttle the SATA bus). 


-- 
Adam Carheden 

On 04/26/2017 01:55 AM, Eneko Lacunza wrote: 
> Adam, 
> 
> What David said before about SSD drives is very important. I will tell 
> you another way: use enterprise grade SSD drives, not consumer grade. 
> Also, pay attention to endurance. 
> 
> The only suitable drive for Ceph I see in your tests is SSDSC2BB150G7, 
> and probably it isn't even the most suitable SATA SSD disk from Intel; 
> better use S3610 o S3710 series. 
> 
> Cheers 
> Eneko 
> 
> El 25/04/17 a las 21:02, Adam Carheden escribió: 
>> On 04/25/2017 11:57 AM, David wrote: 
>>> On 19 Apr 2017 18:01, "Adam Carheden" >> > wrote: 
>>> 
>>> Does anyone know if XFS uses a single thread to write to it's 
>>> journal? 
>>> 
>>> 
>>> You probably know this but just to avoid any confusion, the journal in 
>>> this context isn't the metadata journaling in XFS, it's a separate 
>>> journal written to by the OSD daemons 
>> Ha! I didn't know that. 
>> 
>>> I think the number of threads per OSD is controlled by the 'osd op 
>>> threads' setting which defaults to 2 
>> So the ideal (for performance) CEPH cluster would be one SSD per HDD 
>> with 'osd op threads' set to whatever value fio shows as the optimal 
>> number of threads for that drive then? 
>> 
>>> I would avoid the SanDisk and Hynix. The s3500 isn't too bad. Perhaps 
>>> consider going up to a 37xx and putting more OSDs on it. Of course with 
>>> the caveat that you'll lose more OSDs if it goes down. 
>> Why would you avoid the SanDisk and Hynix? Reliability (I think those 
>> two are both TLC)? Brand trust? If it's my benchmarks in my previous 
>> email, why not the Hynix? It's slower than the Intel, but sort of 
>> decent, at lease compared to the SanDisk. 
>> 
>> My final numbers are below, including an older Samsung Evo (MCL I think) 
>> which did horribly, though not as bad as the SanDisk. The Seagate is a 
>> 10kRPM SAS "spinny" drive I tested as a control/SSD-to-HDD comparison. 
>> 
>> SanDisk SDSSDA240G, fio 1 jobs: 7.0 MB/s (5 trials) 
>> 
>> 
>> SanDisk SDSSDA240G, fio 2 jobs: 7.6 MB/s (5 trials) 
>> 
>> 
>> SanDisk SDSSDA240G, fio 4 jobs: 7.5 MB/s (5 trials) 
>> 
>> 
>> SanDisk SDSSDA240G, fio 8 jobs: 7.6 MB/s (5 trials) 
>> 
>> 
>> SanDisk SDSSDA240G, fio 16 jobs: 7.6 MB/s (5 trials) 
>> 
>> 
>> SanDisk SDSSDA240G, fio 32 jobs: 7.6 MB/s (5 trials) 
>> 
>> 
>> SanDisk SDSSDA240G, fio 64 jobs: 7.6 MB/s (5 trials) 
>> 
>> 
>> HFS250G32TND-N1A2A 3P10, fio 1 jobs: 4.2 MB/s (5 trials) 
>> 
>> 
>> HFS250G32TND-N1A2A 3P10, fio 2 jobs: 0.6 MB/s (5 trials) 
>> 
>> 
>> HFS250G32TND-N1A2A 3P10, fio 4 jobs: 7.5 MB/s (5 trials) 
>> 
>> 
>> HFS250G32TND-N1A2A 3P10, fio 8 jobs: 17.6 MB/s (5 trials) 
>> 
>> 
>> HFS250G32TND-N1A2A 3P10, fio 16 jobs: 32.4 MB/s (5 trials) 
>> 
>> 
>> HFS250G32TND-N1A2A 3P10, fio 32 jobs: 64.4 MB/s (5 trials) 
>> 
>> 
>> HFS250G32TND-N1A2A 3P10, fio 64 jobs: 71.6 MB/s (5 trials) 
>> 
>>

Re: [ceph-users] ceph packages on stretch from eu.ceph.com

2017-04-26 Thread Alexandre DERUMIER

you can try the proxmox stretch repository if you want

http://download.proxmox.com/debian/ceph-luminous/dists/stretch/



- Mail original -
De: "Wido den Hollander" 
À: "ceph-users" , "Ronny Aasen" 

Envoyé: Mercredi 26 Avril 2017 16:58:04
Objet: Re: [ceph-users] ceph packages on stretch from eu.ceph.com

> Op 25 april 2017 om 20:07 schreef Ronny Aasen : 
> 
> 
> Hello 
> 
> i am trying to install ceph on debian stretch from 
> 
> http://eu.ceph.com/debian-jewel/dists/ 
> 
> but there is no stretch repo there. 
> 
> now with stretch being frozen, it is a good time to be testing ceph on 
> stretch. is it possible to get packages for stretch on jewel, kraken, 
> and lumious ? 

Afaik packages are only build for stable releases. As Stretch isn't out there 
are no packages. 

You can try if the Ubuntu 16.04 (Xenial) packages work. 

Wido 

> 
> 
> 
> kind regards 
> 
> Ronny Aasen 
> 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] libjemalloc.so.1 not used?

2017-03-27 Thread Alexandre DERUMIER

you need to recompile ceph with jemalloc, without have tcmalloc dev librairies.

LD_PRELOAD has never work for jemalloc and ceph


- Mail original -
De: "Engelmann Florian" 
À: "ceph-users" 
Envoyé: Lundi 27 Mars 2017 16:54:33
Objet: [ceph-users] libjemalloc.so.1 not used?

Hi, 

we are testing Ceph as block storage (XFS based OSDs) running in a hyper 
converged setup with KVM as hypervisor. We are using NVMe SSD only (Intel DC 
P5320) and I would like to use jemalloc on Ubuntu xenial (current kernel 
4.4.0-64-generic). I tried to use /etc/default/ceph and uncommented: 


# /etc/default/ceph 
# 
# Environment file for ceph daemon systemd unit files. 
# 

# Increase tcmalloc cache size 
TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728 

## use jemalloc instead of tcmalloc 
# 
# jemalloc is generally faster for small IO workloads and when 
# ceph-osd is backed by SSDs. However, memory usage is usually 
# higher by 200-300mb. 
# 
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 

and it looks like the OSDs are using jemalloc: 

lsof |grep -e "ceph-osd.*8074.*malloc" 
ceph-osd 8074 ceph mem REG 252,0 294776 659213 /usr/lib/libtcmalloc.so.4.2.6 
ceph-osd 8074 ceph mem REG 252,0 219816 658861 
/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 
ceph-osd 8074 8116 ceph mem REG 252,0 294776 659213 
/usr/lib/libtcmalloc.so.4.2.6 
ceph-osd 8074 8116 ceph mem REG 252,0 219816 658861 
/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 
ceph-osd 8074 8117 ceph mem REG 252,0 294776 659213 
/usr/lib/libtcmalloc.so.4.2.6 
ceph-osd 8074 8117 ceph mem REG 252,0 219816 658861 
/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 
ceph-osd 8074 8118 ceph mem REG 252,0 294776 659213 
/usr/lib/libtcmalloc.so.4.2.6 
ceph-osd 8074 8118 ceph mem REG 252,0 219816 658861 
/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 
[...] 

But perf top shows something different: 

Samples: 11M of event 'cycles:pp', Event count (approx.): 603904862529620 
Overhead Shared Object Symbol 
1.86% libtcmalloc.so.4.2.6 [.] operator new[] 
1.73% [kernel] [k] mem_cgroup_iter 
1.34% libstdc++.so.6.0.21 [.] std::__ostream_insert 
1.29% libpthread-2.23.so [.] pthread_mutex_lock 
1.10% [kernel] [k] __switch_to 
0.97% libpthread-2.23.so [.] pthread_mutex_unlock 
0.94% [kernel] [k] native_queued_spin_lock_slowpath 
0.92% [kernel] [k] update_cfs_shares 
0.90% libc-2.23.so [.] __memcpy_avx_unaligned 
0.87% libtcmalloc.so.4.2.6 [.] operator delete[] 
0.80% ceph-osd [.] ceph::buffer::ptr::release 
0.80% [kernel] [k] mem_cgroup_zone_lruvec 


Do my OSDs use jemalloc or don't they? 

All the best, 
Florian 




EveryWare AG 
Florian Engelmann 
Systems Engineer 
Zurlindenstrasse 52a 
CH-8003 Zürich 

T +41 44 466 60 00 
F +41 44 466 60 10 

florian.engelm...@everyware.ch 
www.everyware.ch 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] noout, nodown and blocked requests

2017-03-13 Thread Alexandre DERUMIER

Hi,

>>Currently I have the. noout and nodown flags set while doing the maintenance 
>>work.

you only need noout to avoid rebalancing

see documentation:
http://docs.ceph.com/docs/kraken/rados/troubleshooting/troubleshooting-osd/
"STOPPING W/OUT REBALANCING".


Your clients are hanging because of the no down flag


See this blog for no-out, no-down flags experiements

https://www.sebastien-han.fr/blog/2013/04/17/some-ceph-experiments/

- Mail original -
De: "Shain Miley" 
À: "ceph-users" 
Envoyé: Lundi 13 Mars 2017 04:58:08
Objet: [ceph-users] noout, nodown and blocked requests

Hello, 
One of the nodes in our 14 node cluster is offline and before I totally commit 
to fully removing the node from the cluster (there is a chance I can get the 
node back in working order in the next few days) I would like to run the 
cluster with that single node out for a few days. 

Currently I have the. noout and nodown flags set while doing the maintenance 
work. 

Some users are complaining about disconnects and other oddities when try to 
save and access files currently on the cluster. 

I am also seeing some blocked requests when viewing the cluster status (at this 
point I see 160 block requests spread over 15 to 20 osd’s). 

Currently I have a replication level of 3 on this pool and a min_size of 1. 

My question is this…is there a better method to use (other than using noout and 
nodown) in this scenario where I do not want data movement yet…but I do want 
the reads and writes to the cluster to to respond as normally as possible for 
the end users? 

Thanks in advance, 

Shain 


___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Posix AIO vs libaio read performance

2017-03-10 Thread Alexandre DERUMIER

>>Regarding rbd cache, is something I will try -today I was thinking about it- 
>>but I did not try it yet because I don't want to reduce write speed.

note that rbd_cache only work for sequential writes. so it don't help for 
random writes.

also, internaly, qemu force use of aio=threads with cache=writeback is enable, 
but can use aio=native with cache=none.



- Mail original -
De: "Xavier Trilla" <xavier.tri...@silicontower.net>
À: "aderumier" <aderum...@odiso.com>
Cc: "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Vendredi 10 Mars 2017 14:12:59
Objet: Re: [ceph-users] Posix AIO vs libaio read performance

Hi Alexandre, 

Debugging is disabled in client and osds. 

Regarding rbd cache, is something I will try -today I was thinking about it- 
but I did not try it yet because I don't want to reduce write speed. 

I also tried iothreads, but no benefit. 

I tried as well with virtio-blk and virtio-scsi, there is a small improvement 
with virtio-blk, but it's around a 10%. 

This is becoming a quite strange issue, as it only affects posix aio read 
performance. Nothing less seems to be affected -although posix aio write isn't 
nowhere near libaio performance-. 

Thanks for you help, if you have any other ideas they will be really 
appreciated. 

Also if somebody could run in their cluster from inside a VM the following 
command: 



fio --name=randread-posix --output ./test --runtime 60 --ioengine=posixaio 
--buffered=0 --direct=1 --rw=randread --bs=4k --size=1024m --iodepth=32 



It would be really helpful to know if I'm the only one affected or this is 
happening in all qemu + ceph setups. 

Thanks! 
Xavier 

El 10 mar 2017, a las 8:07, Alexandre DERUMIER < [ mailto:aderum...@odiso.com | 
aderum...@odiso.com ] > escribió: 


BQ_BEGIN



BQ_BEGIN

BQ_BEGIN
But it still looks like there is some bottleneck in QEMU o Librbd I cannot 
manage to find. 

BQ_END

BQ_END

you can improve latency on client with disable debug. 

on your client, create a /etc/ceph/ceph.conf with 

[global] 
debug asok = 0/0 
debug auth = 0/0 
debug buffer = 0/0 
debug client = 0/0 
debug context = 0/0 
debug crush = 0/0 
debug filer = 0/0 
debug filestore = 0/0 
debug finisher = 0/0 
debug heartbeatmap = 0/0 
debug journal = 0/0 
debug journaler = 0/0 
debug lockdep = 0/0 
debug mds = 0/0 
debug mds balancer = 0/0 
debug mds locker = 0/0 
debug mds log = 0/0 
debug mds log expire = 0/0 
debug mds migrator = 0/0 
debug mon = 0/0 
debug monc = 0/0 
debug ms = 0/0 
debug objclass = 0/0 
debug objectcacher = 0/0 
debug objecter = 0/0 
debug optracker = 0/0 
debug osd = 0/0 
debug paxos = 0/0 
debug perfcounter = 0/0 
debug rados = 0/0 
debug rbd = 0/0 
debug rgw = 0/0 
debug throttle = 0/0 
debug timer = 0/0 
debug tp = 0/0 


you can also disable rbd_cache=false or in qemu set cache=none. 

Using iothread on qemu drive should help a little bit too. 

- Mail original - 
De: "Xavier Trilla" < [ mailto:xavier.tri...@silicontower.net | 
xavier.tri...@silicontower.net ] > 
À: "ceph-users" < [ mailto:ceph-users@lists.ceph.com | 
ceph-users@lists.ceph.com ] > 
Envoyé: Vendredi 10 Mars 2017 05:37:01 
Objet: Re: [ceph-users] Posix AIO vs libaio read performance 



Hi, 



We compiled Hammer .10 to use jemalloc and now the cluster performance improved 
a lot, but POSIX AIO operations are still quite slower than libaio. 



Now with a single thread read operations are about 1000 per second and write 
operations about 5000 per second. 



Using same FIO configuration, but libaio read operations are about 15K per 
second and writes 12K per second. 



I’m compiling QEMU with jemalloc support as well, and I’m planning to replace 
librbd in QEMU hosts to the new one using jemalloc. 



But it still looks like there is some bottleneck in QEMU o Librbd I cannot 
manage to find. 



Any help will be much appreciated. 



Thanks. 






De: ceph-users [ [ mailto:ceph-users-boun...@lists.ceph.com | 
mailto:ceph-users-boun...@lists.ceph.com ] ] En nombre de Xavier Trilla 
Enviado el: jueves, 9 de marzo de 2017 6:56 
Para: [ mailto:ceph-users@lists.ceph.com | ceph-users@lists.ceph.com ] 
Asunto: [ceph-users] Posix AIO vs libaio read performance 




Hi, 



I’m trying to debut why there is a big difference using POSIX AIO and libaio 
when performing read tests from inside a VM using librbd. 



The results I’m getting using FIO are: 



POSIX AIO Read: 



Type: Random Read - IO Engine: POSIX AIO - Buffered: No - Direct: Yes - Block 
Size: 4KB - Disk Target: /: 



Average: 2.54 MB/s 

Average: 632 IOPS 



Libaio Read: 



Type: Random Read - IO Engine: Libaio - Buffered: No - Direct: Yes - Block 
Size: 4KB - Disk Target: /: 



Average: 147.88 MB/s 

Average: 36967 IOPS 



When performing writes the differences aren’t so big, because the cluster 
–which is in production right now- is CPU bonded: 



POSIX AIO Write: 



Type:

Re: [ceph-users] Posix AIO vs libaio read performance

2017-03-09 Thread Alexandre DERUMIER


>>But it still looks like there is some bottleneck in QEMU o Librbd I cannot 
>>manage to find.

you can improve latency on client with disable debug.

on your client, create a /etc/ceph/ceph.conf with

[global]
 debug asok = 0/0
 debug auth = 0/0
 debug buffer = 0/0
 debug client = 0/0
 debug context = 0/0
 debug crush = 0/0
 debug filer = 0/0
 debug filestore = 0/0
 debug finisher = 0/0
 debug heartbeatmap = 0/0
 debug journal = 0/0
 debug journaler = 0/0
 debug lockdep = 0/0
 debug mds = 0/0
 debug mds balancer = 0/0
 debug mds locker = 0/0
 debug mds log = 0/0
 debug mds log expire = 0/0
 debug mds migrator = 0/0
 debug mon = 0/0
 debug monc = 0/0
 debug ms = 0/0
 debug objclass = 0/0
 debug objectcacher = 0/0
 debug objecter = 0/0
 debug optracker = 0/0
 debug osd = 0/0
 debug paxos = 0/0
 debug perfcounter = 0/0
 debug rados = 0/0
 debug rbd = 0/0
 debug rgw = 0/0
 debug throttle = 0/0
 debug timer = 0/0
 debug tp = 0/0


you can also disable rbd_cache=false   or in qemu set cache=none.

Using iothread on qemu drive should help a little bit too.

- Mail original -
De: "Xavier Trilla" 
À: "ceph-users" 
Envoyé: Vendredi 10 Mars 2017 05:37:01
Objet: Re: [ceph-users] Posix AIO vs libaio read performance



Hi, 



We compiled Hammer .10 to use jemalloc and now the cluster performance improved 
a lot, but POSIX AIO operations are still quite slower than libaio. 



Now with a single thread read operations are about 1000 per second and write 
operations about 5000 per second. 



Using same FIO configuration, but libaio read operations are about 15K per 
second and writes 12K per second. 



I’m compiling QEMU with jemalloc support as well, and I’m planning to replace 
librbd in QEMU hosts to the new one using jemalloc. 



But it still looks like there is some bottleneck in QEMU o Librbd I cannot 
manage to find. 



Any help will be much appreciated. 



Thanks. 






De: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] En nombre de Xavier 
Trilla 
Enviado el: jueves, 9 de marzo de 2017 6:56 
Para: ceph-users@lists.ceph.com 
Asunto: [ceph-users] Posix AIO vs libaio read performance 




Hi, 



I’m trying to debut why there is a big difference using POSIX AIO and libaio 
when performing read tests from inside a VM using librbd. 



The results I’m getting using FIO are: 



POSIX AIO Read: 



Type: Random Read - IO Engine: POSIX AIO - Buffered: No - Direct: Yes - Block 
Size: 4KB - Disk Target: /: 



Average: 2.54 MB/s 

Average: 632 IOPS 



Libaio Read: 



Type: Random Read - IO Engine: Libaio - Buffered: No - Direct: Yes - Block 
Size: 4KB - Disk Target: /: 



Average: 147.88 MB/s 

Average: 36967 IOPS 



When performing writes the differences aren’t so big, because the cluster 
–which is in production right now- is CPU bonded: 



POSIX AIO Write: 



Type: Random Write - IO Engine: POSIX AIO - Buffered: No - Direct: Yes - Block 
Size: 4KB - Disk Target: /: 



Average: 14.87 MB/s 

Average: 3713 IOPS 



Libaio Write: 



Type: Random Write - IO Engine: Libaio - Buffered: No - Direct: Yes - Block 
Size: 4KB - Disk Target: /: 



Average: 14.51 MB/s 

Average: 3622 IOPS 





Even if the write results are CPU bonded, as the machines containing the OSDs 
don’t have enough CPU to handle all the IOPS (CPU upgrades are on its way) I 
cannot really understand why I’m seeing so much difference in the read tests. 



Some configuration background: 



- Cluster and clients are using Hammer 0.94.90 

- It’s a full SSD cluster running over Samsung Enterprise SATA SSDs, with all 
the typical tweaks (Customized ceph.conf, optimized sysctl, etc…) 

- Tried QEMU 2.0 and 2.7 – Similar results 

- Tried virtio-blk and virtio-scsi – Similar results 



I’ve been reading about POSIX AIO and Libaio, and I can see there are several 
differences on how they work (Like one being user space and the other one being 
kernel) but I don’t really get why Ceph have such problems handling POSIX AIO 
read operations, but not write operation, and how to avoid them. 



Right now I’m trying to identify if it’s something wrong with our Ceph cluster 
setup, with Ceph in general or with QEMU (virtio-scsi or virtio-blk as both 
have the same behavior) 



If you would like to try to reproduce the issue here are the two command lines 
I’m using: 



fio --name=randread-posix --output ./test --runtime 60 --ioengine=posixaio 
--buffered=0 --direct=1 --rw=randread --bs=4k --size=1024m --iodepth=32 

fio --name=randread-libaio --output ./test --runtime 60 --ioengine=libaio 
--buffered=0 --direct=1 --rw=randread --bs=4k --size=1024m --iodepth=32 





If you could shed any light over this I would be really helpful, as right now, 
although I have still some ideas left to try, I’m don’t have much idea about 
why is this happening… 



Thanks! 

Xavier 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com

Re: [ceph-users] KVM/QEMU rbd read latency

2017-02-17 Thread Alexandre DERUMIER

>>We also need to support >1 librbd/librados-internal IO
>>thread for outbound/inbound paths.

Could be worderfull !
multiple iothread by disk is coming for qemu too. (I have seen Paolo Bonzini 
sending a lot of patches this month)

- Mail original -
De: "Jason Dillaman" <jdill...@redhat.com>
À: "aderumier" <aderum...@odiso.com>
Cc: "Phil Lacroute" <lacro...@skyportsystems.com>, "ceph-users" 
<ceph-users@lists.ceph.com>
Envoyé: Vendredi 17 Février 2017 15:16:39
Objet: Re: [ceph-users] KVM/QEMU rbd read latency

On Fri, Feb 17, 2017 at 2:14 AM, Alexandre DERUMIER <aderum...@odiso.com> 
wrote: 
> and I have good hope than this new feature 
> "RBD: Add support readv,writev for rbd" 
> http://marc.info/?l=ceph-devel=148726026914033=2 

Definitely will eliminate 1 unnecessary data copy -- but sadly it 
still will make a single copy within librbd immediately since librados 
*might* touch the IO memory after it has ACKed the op. Once that issue 
is addressed, librbd can eliminate that copy if the librbd cache is 
disabled. We also need to support >1 librbd/librados-internal IO 
thread for outbound/inbound paths. 

-- 
Jason 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] KVM/QEMU rbd read latency

2017-02-16 Thread Alexandre DERUMIER

Hi,

Currently I can reduce the latency with

- compile qemu to use jemalloc
- disabling rbd_cache  (or qemu cache=none)


disabling debug in /etc/ceph.conf on the client node


[global]
 debug asok = 0/0
 debug auth = 0/0
 debug buffer = 0/0
 debug client = 0/0
 debug context = 0/0
 debug crush = 0/0
 debug filer = 0/0
 debug filestore = 0/0
 debug finisher = 0/0
 debug heartbeatmap = 0/0
 debug journal = 0/0
 debug journaler = 0/0
 debug lockdep = 0/0
 debug mds = 0/0
 debug mds balancer = 0/0
 debug mds locker = 0/0
 debug mds log = 0/0
 debug mds log expire = 0/0
 debug mds migrator = 0/0
 debug mon = 0/0
 debug monc = 0/0
 debug ms = 0/0
 debug objclass = 0/0
 debug objectcacher = 0/0
 debug objecter = 0/0
 debug optracker = 0/0
 debug osd = 0/0
 debug paxos = 0/0
 debug perfcounter = 0/0
 debug rados = 0/0
 debug rbd = 0/0
 debug rgw = 0/0
 debug throttle = 0/0
 debug timer = 0/0
 debug tp = 0/0


With this, I can reach around 50-60k iops 4k with 1 disk and iothread enable.


and I have good hope than this new feature 
"RBD: Add support readv,writev for rbd"
http://marc.info/?l=ceph-devel=148726026914033=2

will help too, reducing copy (that's why I'm using jemalloc too)




- Mail original -
De: "Phil Lacroute" 
À: "ceph-users" 
Envoyé: Jeudi 16 Février 2017 19:53:47
Objet: [ceph-users] KVM/QEMU rbd read latency

Hi, 

I am doing some performance characterization experiments for ceph with KVM 
guests, and I’m observing significantly higher read latency when using the QEMU 
rbd client compared to krbd. Is that expected or have I missed some tuning 
knobs to improve this? 

Cluster details: 
Note that this cluster was built for evaluation purposes, not production, hence 
the choice of small SSDs with low endurance specs. 
Client host OS: Debian, 4.7.0 kernel 
QEMU version 2.7.0 
Ceph version Jewel 10.2.3 
Client and OSD CPU: Xeon D-1541 2.1 GHz 
OSDs: 5 nodes, 3 SSDs each, one journal partition and one data partition per 
SSD, XFS data file system (15 OSDs total) 
Disks: DC S3510 240GB 
Network: 10 GbE, dedicated switch for storage traffic 
Guest OS: Debian, virtio drivers 

Performance testing was done with fio on raw disk devices using this config: 
ioengine=libaio 
iodepth=128 
direct=1 
size=100% 
rw=randread 
bs=4k 

Case 1: krbd, fio running on the raw rbd device on the client host (no guest) 
IOPS: 142k 
Average latency: 0.9 msec 

Case 2: krbd, fio running in a guest (libvirt config below) 
 
 
 
 
 
 
IOPS: 119k 
Average Latency: 1.1 msec 

Case 3: QEMU RBD client, fio running in a guest (libvirt config below) 
 
 
 
 
 
 
 
 
IOPS: 25k 
Average Latency: 5.2 msec 

The question is why the test with the QEMU RBD client (case 3) shows 4 msec of 
additional latency compared the guest using the krbd-mapped image (case 2). 

Note that the IOPS bottleneck for all of these cases is the rate at which the 
client issues requests, which is limited by the average latency and the maximum 
number of outstanding requests (128). Since the latency is the dominant factor 
in average read throughput for these small accesses, we would really like to 
understand the source of the additional latency. 

Thanks, 
Phil 




___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph counters decrementing after changing pg_num

2017-01-20 Thread Alexandre DERUMIER

if you change pg_num value,

ceph will reshuffle almost all datas, so depend of the size of your storage, it 
can take some times ...

- Mail original -
De: "Kai Storbeck" 
À: "ceph-users" 
Envoyé: Vendredi 20 Janvier 2017 17:17:08
Objet: [ceph-users] Ceph counters decrementing after changing pg_num

Hello ceph users, 

My graphs of several counters in our Ceph cluster are showing abnormal 
behaviour after changing the pg_num and pgp_num respectively. 

We're using "http://eu.ceph.com/debian-hammer/ jessie/main". 


Is this a bug, or will the counters stabilize at some time in the near 
future? Or, is this otherwise fixable by "turning it off and on again"? 


Regards, 
Kai 



___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 答复: Does this indicate a "CPU bottleneck"?

2017-01-19 Thread Alexandre DERUMIER

Do you have checked cpu usage on clients ?


also, when you increase number of osd, do you increase pg_num ?


can you provide your fio job config ?

- Mail original -
De: "许雪寒" 
À: "John Spray" 
Cc: "ceph-users" 
Envoyé: Vendredi 20 Janvier 2017 07:25:35
Objet: [ceph-users]答复:  Does this indicate a "CPU bottleneck"?

The network is only about 10% full, and we tested the performance with 
different number of clients, and it turned out that no matter how we increase 
the number of clients, the result is the same. 

-邮件原件- 
发件人: John Spray [mailto:jsp...@redhat.com] 
发送时间: 2017年1月19日 16:11 
收件人: 许雪寒 
抄送: ceph-users@lists.ceph.com 
主题: Re: [ceph-users] Does this indicate a "CPU bottleneck"? 

On Thu, Jan 19, 2017 at 8:51 AM, 许雪寒  wrote: 
> Hi, everyone. 
> 
> 
> 
> Recently, we did some stress test on ceph using three machines. We 
> tested the IOPS of the whole small cluster when there are 1~8 OSDs per 
> machines separately and the result is as follows: 
> 
> 
> 
> OSD num per machine fio iops 
> 
> 1 
> 10k 
> 
> 2 
> 16.5k 
> 
> 3 
> 22k 
> 
> 4 
> 23.5k 
> 
> 5 
> 26k 
> 
> 6 
> 27k 
> 
> 7 
> 27k 
> 
> 8 
> 28k 
> 
> 
> 
> As shown above, it seems that there is some kind of bottleneck when 
> there are more than 4 OSDs per machine. Meanwhile, we observed that 
> the CPU %idle during the test, shown below, has also some kind of 
> correlation with the number of OSDs per machine. 
> 
> 
> 
> OSD num per machine CPU idle 
> 
> 1 
> 74% 
> 
> 2 
> 52% 
> 
> 3 
> 30% 
> 
> 4 
> 25% 
> 
> 5 
> 24% 
> 
> 6 
> 17% 
> 
> 7 
> 14% 
> 
> 8 
> 11% 
> 
> 
> 
> It seems that with the number of OSDs per machine increasing, the CPU 
> idle time is reducing and the reduce rate Is also decreasing, can we 
> come to the conclusion that CPU is the performance bottleneck in this test? 

Impossible to say without looking at what else was bottlenecked, such as the 
network or the client. 

John 

> 
> 
> Thank youJ 
> 
> 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Calamari or Alternative

2017-01-13 Thread Alexandre DERUMIER

Another tool :

http://openattic.org/

- Mail original -
De: "Marko Stojanovic" 
À: "Tu Holmes" , "John Petrini" 
Cc: "ceph-users" 
Envoyé: Vendredi 13 Janvier 2017 09:30:16
Objet: Re: [ceph-users] Calamari or Alternative



There is another nice tool for ceph monitoring: 


[ https://github.com/inkscope/inkscope | https://github.com/inkscope/inkscope ] 
Little hard to setup but beside just monitoring you can also manage some items 
using it. 


regards 
Marko 


On 1/13/17 07:30, Tu Holmes wrote: 


I'll give ceph-dash a look. 

Thanks! 
On Thu, Jan 12, 2017 at 9:19 PM John Petrini < [ mailto:jpetr...@coredial.com | 
jpetr...@coredial.com ] > wrote: 

BQ_BEGIN

I used Calamari before making the move to Ubuntu 16.04 and upgrading to Jewel. 
At the time I tried to install it on 16.04 but couldn't get it working. 

I'm now using [ https://github.com/Crapworks/ceph-dash | ceph-dash ] along with 
the nagios plugin [ https://github.com/Crapworks/check_ceph_dash | 
check_ceph_dash ] and I've found that this gets me everything I need. A nice 
looking dashboard, graphs and alerting on the most important stats. 

Another plus is that it's incredibly easy to setup; you can have the dashboard 
up and running in five minutes. 



___ 

The information transmitted is intended only for the person or entity to which 
it is addressed and may contain confidential and/or privileged material. Any 
review, retransmission, dissemination or other use of, or taking of any action 
in reliance upon, this information by persons or entities other than the 
intended recipient is prohibited. If you received this in error, please contact 
the sender and delete the material from any computer. 

On Fri, Jan 13, 2017 at 12:06 AM, Tu Holmes < [ mailto:tu.hol...@gmail.com | 
tu.hol...@gmail.com ] > wrote: 

BQ_BEGIN
Hey Cephers. 

Question for you. 

Do you guys use Calamari or an alternative? 

If so, why has the installation of Calamari not really gotten much better 
recently. 

Are you still building the vagrant installers and building packages? 

Just wondering what you are all doing. 

Thanks. 

//Tu 





BQ_BEGIN
___ 
ceph-users mailing list 
[ mailto:ceph-users@lists.ceph.com | ceph-users@lists.ceph.com ] 
[ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ] 


BQ_END



BQ_END



___
ceph-users mailing list [ mailto:ceph-users@lists.ceph.com | 
ceph-users@lists.ceph.com ] [ 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ] 

BQ_END


___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

1 2 3 4 5 >

1 - 100 of 401 matches

Mail list logo