Re: [ceph-users] cephfs performance issue MDSs report slow requests and osd memory usage

2019-09-24 Thread Robert LeBlanc
On Tue, Sep 24, 2019 at 4:33 AM Thomas <74cmo...@gmail.com> wrote:
>
> Hi,
>
> I'm experiencing the same issue with this setting in ceph.conf:
> osd op queue = wpq
> osd op queue cut off = high
>
> Furthermore I cannot read any old data in the relevant pool that is
> serving CephFS.
> However, I can write new data and read this new data.

If you restarted all the OSDs with this setting, it won't necessarily
prevent any blocked IO, it just really helps prevent the really long
blocked IO and makes sure that IO is eventually done in a more fair
manner.

It sounds like you may have some MDS issues that are deeper than my
understanding. First thing I'd try is to bounce the MDS service.

> > If I want to add this my ceph-ansible playbook parameters, in which files I 
> > should add it and what is the best way to do it ?
> >
> > Add those 3 lines in all.yml or osds.yml ?
> >
> > ceph_conf_overrides:
> >   global:
> > osd_op_queue_cut_off: high
> >
> > Is there another (better?) way to do that?

I can't speak to either of those approaches. I wanted all my config in
a single file, so I put it in my inventory file, but it looks like you
have the right idea.


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs performance issue MDSs report slow requests and osd memory usage

2019-09-24 Thread Thomas
Hi,

I'm experiencing the same issue with this setting in ceph.conf:
    osd op queue = wpq
    osd op queue cut off = high

Furthermore I cannot read any old data in the relevant pool that is
serving CephFS.
However, I can write new data and read this new data.

Regards
Thomas

Am 24.09.2019 um 10:24 schrieb Yoann Moulin:
> Hello,
>
>>> I have a Ceph Nautilus Cluster 14.2.1 for cephfs only on 40x 1.8T SAS disk 
>>> (no SSD) in 20 servers.
>>>
>>> I often get "MDSs report slow requests" and plenty of "[WRN] 3 slow 
>>> requests, 0 included below; oldest blocked for > 60281.199503 secs"
>>>
>>> After a few investigations, I saw that ALL ceph-osd process eat a lot of 
>>> memory, up to 130GB RSS each. It this value normal? May this related to
>>> slow requests? Is disk only increasing the probability to get slow requests?
>> If you haven't set:
>>
>> osd op queue cut off = high
>>
>> in /etc/ceph/ceph.conf on your OSDs, I'd give that a try. It should
>> help quite a bit with pure HDD clusters.
> OK I'll try this, thanks.
>
> If I want to add this my ceph-ansible playbook parameters, in which files I 
> should add it and what is the best way to do it ?
>
> Add those 3 lines in all.yml or osds.yml ?
>
> ceph_conf_overrides:
>   global:
> osd_op_queue_cut_off: high
>
> Is there another (better?) way to do that?
>
> Thanks for your help.
>
> Best regards,
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs performance issue MDSs report slow requests and osd memory usage

2019-09-24 Thread Yoann Moulin
Hello,

>> I have a Ceph Nautilus Cluster 14.2.1 for cephfs only on 40x 1.8T SAS disk 
>> (no SSD) in 20 servers.
>>
>> I often get "MDSs report slow requests" and plenty of "[WRN] 3 slow 
>> requests, 0 included below; oldest blocked for > 60281.199503 secs"
>>
>> After a few investigations, I saw that ALL ceph-osd process eat a lot of 
>> memory, up to 130GB RSS each. It this value normal? May this related to
>> slow requests? Is disk only increasing the probability to get slow requests?
>
> If you haven't set:
> 
> osd op queue cut off = high
> 
> in /etc/ceph/ceph.conf on your OSDs, I'd give that a try. It should
> help quite a bit with pure HDD clusters.

OK I'll try this, thanks.

If I want to add this my ceph-ansible playbook parameters, in which files I 
should add it and what is the best way to do it ?

Add those 3 lines in all.yml or osds.yml ?

ceph_conf_overrides:
  global:
osd_op_queue_cut_off: high

Is there another (better?) way to do that?

Thanks for your help.

Best regards,

-- 
Yoann Moulin
EPFL IC-IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs performance issue MDSs report slow requests and osd memory usage

2019-09-23 Thread Robert LeBlanc
On Thu, Sep 19, 2019 at 2:36 AM Yoann Moulin  wrote:
>
> Hello,
>
> I have a Ceph Nautilus Cluster 14.2.1 for cephfs only on 40x 1.8T SAS disk 
> (no SSD) in 20 servers.
>
> >   cluster:
> > id: 778234df-5784-4021-b983-0ee1814891be
> > health: HEALTH_WARN
> > 2 MDSs report slow requests
> >
> >   services:
> > mon: 3 daemons, quorum icadmin006,icadmin007,icadmin008 (age 5d)
> > mgr: icadmin008(active, since 18h), standbys: icadmin007, icadmin006
> > mds: cephfs:3 
> > {0=icadmin006=up:active,1=icadmin007=up:active,2=icadmin008=up:active}
> > osd: 40 osds: 40 up (since 2w), 40 in (since 3w)
> >
> >   data:
> > pools:   3 pools, 672 pgs
> > objects: 36.08M objects, 19 TiB
> > usage:   51 TiB used, 15 TiB / 65 TiB avail
> > pgs: 670 active+clean
> >  2   active+clean+scrubbing
>
> I often get "MDSs report slow requests" and plenty of "[WRN] 3 slow requests, 
> 0 included below; oldest blocked for > 60281.199503 secs"
>
> > HEALTH_WARN 2 MDSs report slow requests
> > MDS_SLOW_REQUEST 2 MDSs report slow requests
> > mdsicadmin007(mds.1): 3 slow requests are blocked > 30 secs
> > mdsicadmin006(mds.0): 10 slow requests are blocked > 30 secs
>
> After a few investigations, I saw that ALL ceph-osd process eat a lot of 
> memory, up to 130GB RSS each. It this value normal? May this related to
> slow requests? Is disk only increasing the probability to get slow requests?
>
> > USER PID %CPU %MEM   VSZ   RSS TTY STAT STAR   TIME COMMAND
> > ceph   34196  3.6 35.0 156247524 138521572 ? Ssl  Jul01 4173:18 
> > /usr/bin/ceph-osd -f --cluster apollo --id 1 --setuser ceph --setgroup ceph
> > ceph   34394  3.6 35.0 160001436 138487776 ? Ssl  Jul01 4178:37 
> > /usr/bin/ceph-osd -f --cluster apollo --id 32 --setuser ceph --setgroup ceph
> > ceph   34709  3.5 35.1 156369636 138752044 ? Ssl  Jul01 4088:57 
> > /usr/bin/ceph-osd -f --cluster apollo --id 29 --setuser ceph --setgroup ceph
> > ceph   34915  3.4 35.1 158976936 138715900 ? Ssl  Jul01 3950:45 
> > /usr/bin/ceph-osd -f --cluster apollo --id 3 --setuser ceph --setgroup ceph
> > ceph   34156  3.4 35.1 158280768 138714484 ? Ssl  Jul01 3984:11 
> > /usr/bin/ceph-osd -f --cluster apollo --id 30 --setuser ceph --setgroup ceph
> > ceph   34378  3.7 35.1 155162420 138708096 ? Ssl  Jul01 4312:12 
> > /usr/bin/ceph-osd -f --cluster apollo --id 8 --setuser ceph --setgroup ceph
> > ceph   34161  3.5 35.0 159606788 138523652 ? Ssl  Jul01 4128:17 
> > /usr/bin/ceph-osd -f --cluster apollo --id 16 --setuser ceph --setgroup ceph
> > ceph   34380  3.6 35.1 161465372 138670168 ? Ssl  Jul01 4238:20 
> > /usr/bin/ceph-osd -f --cluster apollo --id 35 --setuser ceph --setgroup ceph
> > ceph   33822  3.7 35.1 163456644 138734036 ? Ssl  Jul01 4342:05 
> > /usr/bin/ceph-osd -f --cluster apollo --id 15 --setuser ceph --setgroup ceph
> > ceph   34003  3.8 35.0 161868584 138531208 ? Ssl  Jul01 4427:32 
> > /usr/bin/ceph-osd -f --cluster apollo --id 38 --setuser ceph --setgroup ceph
> > ceph9753  2.8 24.2 96923856 95580776 ?   Ssl  Sep02 700:25 
> > /usr/bin/ceph-osd -f --cluster apollo --id 31 --setuser ceph --setgroup ceph
> > ceph   10120  2.5 24.0 96130340 94856244 ?   Ssl  Sep02 644:50 
> > /usr/bin/ceph-osd -f --cluster apollo --id 7 --setuser ceph --setgroup ceph
> > ceph   36204  3.6 35.0 159394476 138592124 ? Ssl  Jul01 4185:36 
> > /usr/bin/ceph-osd -f --cluster apollo --id 18 --setuser ceph --setgroup ceph
> > ceph   36427  3.7 34.4 155699060 136076432 ? Ssl  Jul01 4298:26 
> > /usr/bin/ceph-osd -f --cluster apollo --id 36 --setuser ceph --setgroup ceph
> > ceph   36622  4.1 35.1 158219408 138724688 ? Ssl  Jul01 4779:14 
> > /usr/bin/ceph-osd -f --cluster apollo --id 19 --setuser ceph --setgroup ceph
> > ceph   36881  4.0 35.1 157748752 138719064 ? Ssl  Jul01 4669:54 
> > /usr/bin/ceph-osd -f --cluster apollo --id 37 --setuser ceph --setgroup ceph
> > ceph   34649  3.7 35.1 159601580 138652012 ? Ssl  Jul01 4337:20 
> > /usr/bin/ceph-osd -f --cluster apollo --id 14 --setuser ceph --setgroup ceph
> > ceph   34881  3.8 35.1 158632412 138764376 ? Ssl  Jul01 4433:50 
> > /usr/bin/ceph-osd -f --cluster apollo --id 33 --setuser ceph --setgroup ceph
> > ceph   34646  4.2 35.1 155029328 138732376 ? Ssl  Jul01 4831:24 
> > /usr/bin/ceph-osd -f --cluster apollo --id 17 --setuser ceph --setgroup ceph
> > ceph   34881  4.1 35.1 156801676 138763588 ? Ssl  Jul01 4710:19 
> > /usr/bin/ceph-osd -f --cluster apollo --id 39 --setuser ceph --setgroup ceph
> > ceph   36766  3.7 35.1 158070740 138703240 ? Ssl  Jul01 4341:42 
> > /usr/bin/ceph-osd -f --cluster apollo --id 13 --setuser ceph --setgroup ceph
> > ceph   37013  3.5 35.0 157767668 138272248 ? Ssl  Jul01 4094:12 
> > /usr/bin/ceph-osd -f --cluster apollo --id 34 --setuser ceph --setgroup ceph
> > ceph   35007  3.4 35.1 160318780 138756404 ? Ssl  Jul01 3963:21 

[ceph-users] cephfs performance issue MDSs report slow requests and osd memory usage

2019-09-19 Thread Yoann Moulin
Hello,

I have a Ceph Nautilus Cluster 14.2.1 for cephfs only on 40x 1.8T SAS disk (no 
SSD) in 20 servers.

>   cluster:
> id: 778234df-5784-4021-b983-0ee1814891be
> health: HEALTH_WARN
> 2 MDSs report slow requests
>  
>   services:
> mon: 3 daemons, quorum icadmin006,icadmin007,icadmin008 (age 5d)
> mgr: icadmin008(active, since 18h), standbys: icadmin007, icadmin006
> mds: cephfs:3 
> {0=icadmin006=up:active,1=icadmin007=up:active,2=icadmin008=up:active}
> osd: 40 osds: 40 up (since 2w), 40 in (since 3w)
>  
>   data:
> pools:   3 pools, 672 pgs
> objects: 36.08M objects, 19 TiB
> usage:   51 TiB used, 15 TiB / 65 TiB avail
> pgs: 670 active+clean
>  2   active+clean+scrubbing

I often get "MDSs report slow requests" and plenty of "[WRN] 3 slow requests, 0 
included below; oldest blocked for > 60281.199503 secs"

> HEALTH_WARN 2 MDSs report slow requests
> MDS_SLOW_REQUEST 2 MDSs report slow requests
> mdsicadmin007(mds.1): 3 slow requests are blocked > 30 secs
> mdsicadmin006(mds.0): 10 slow requests are blocked > 30 secs

After a few investigations, I saw that ALL ceph-osd process eat a lot of 
memory, up to 130GB RSS each. It this value normal? May this related to
slow requests? Is disk only increasing the probability to get slow requests?

> USER PID %CPU %MEM   VSZ   RSS TTY STAT STAR   TIME COMMAND
> ceph   34196  3.6 35.0 156247524 138521572 ? Ssl  Jul01 4173:18 
> /usr/bin/ceph-osd -f --cluster apollo --id 1 --setuser ceph --setgroup ceph
> ceph   34394  3.6 35.0 160001436 138487776 ? Ssl  Jul01 4178:37 
> /usr/bin/ceph-osd -f --cluster apollo --id 32 --setuser ceph --setgroup ceph
> ceph   34709  3.5 35.1 156369636 138752044 ? Ssl  Jul01 4088:57 
> /usr/bin/ceph-osd -f --cluster apollo --id 29 --setuser ceph --setgroup ceph
> ceph   34915  3.4 35.1 158976936 138715900 ? Ssl  Jul01 3950:45 
> /usr/bin/ceph-osd -f --cluster apollo --id 3 --setuser ceph --setgroup ceph
> ceph   34156  3.4 35.1 158280768 138714484 ? Ssl  Jul01 3984:11 
> /usr/bin/ceph-osd -f --cluster apollo --id 30 --setuser ceph --setgroup ceph
> ceph   34378  3.7 35.1 155162420 138708096 ? Ssl  Jul01 4312:12 
> /usr/bin/ceph-osd -f --cluster apollo --id 8 --setuser ceph --setgroup ceph
> ceph   34161  3.5 35.0 159606788 138523652 ? Ssl  Jul01 4128:17 
> /usr/bin/ceph-osd -f --cluster apollo --id 16 --setuser ceph --setgroup ceph
> ceph   34380  3.6 35.1 161465372 138670168 ? Ssl  Jul01 4238:20 
> /usr/bin/ceph-osd -f --cluster apollo --id 35 --setuser ceph --setgroup ceph
> ceph   33822  3.7 35.1 163456644 138734036 ? Ssl  Jul01 4342:05 
> /usr/bin/ceph-osd -f --cluster apollo --id 15 --setuser ceph --setgroup ceph
> ceph   34003  3.8 35.0 161868584 138531208 ? Ssl  Jul01 4427:32 
> /usr/bin/ceph-osd -f --cluster apollo --id 38 --setuser ceph --setgroup ceph
> ceph9753  2.8 24.2 96923856 95580776 ?   Ssl  Sep02 700:25 
> /usr/bin/ceph-osd -f --cluster apollo --id 31 --setuser ceph --setgroup ceph
> ceph   10120  2.5 24.0 96130340 94856244 ?   Ssl  Sep02 644:50 
> /usr/bin/ceph-osd -f --cluster apollo --id 7 --setuser ceph --setgroup ceph
> ceph   36204  3.6 35.0 159394476 138592124 ? Ssl  Jul01 4185:36 
> /usr/bin/ceph-osd -f --cluster apollo --id 18 --setuser ceph --setgroup ceph
> ceph   36427  3.7 34.4 155699060 136076432 ? Ssl  Jul01 4298:26 
> /usr/bin/ceph-osd -f --cluster apollo --id 36 --setuser ceph --setgroup ceph
> ceph   36622  4.1 35.1 158219408 138724688 ? Ssl  Jul01 4779:14 
> /usr/bin/ceph-osd -f --cluster apollo --id 19 --setuser ceph --setgroup ceph
> ceph   36881  4.0 35.1 157748752 138719064 ? Ssl  Jul01 4669:54 
> /usr/bin/ceph-osd -f --cluster apollo --id 37 --setuser ceph --setgroup ceph
> ceph   34649  3.7 35.1 159601580 138652012 ? Ssl  Jul01 4337:20 
> /usr/bin/ceph-osd -f --cluster apollo --id 14 --setuser ceph --setgroup ceph
> ceph   34881  3.8 35.1 158632412 138764376 ? Ssl  Jul01 4433:50 
> /usr/bin/ceph-osd -f --cluster apollo --id 33 --setuser ceph --setgroup ceph
> ceph   34646  4.2 35.1 155029328 138732376 ? Ssl  Jul01 4831:24 
> /usr/bin/ceph-osd -f --cluster apollo --id 17 --setuser ceph --setgroup ceph
> ceph   34881  4.1 35.1 156801676 138763588 ? Ssl  Jul01 4710:19 
> /usr/bin/ceph-osd -f --cluster apollo --id 39 --setuser ceph --setgroup ceph
> ceph   36766  3.7 35.1 158070740 138703240 ? Ssl  Jul01 4341:42 
> /usr/bin/ceph-osd -f --cluster apollo --id 13 --setuser ceph --setgroup ceph
> ceph   37013  3.5 35.0 157767668 138272248 ? Ssl  Jul01 4094:12 
> /usr/bin/ceph-osd -f --cluster apollo --id 34 --setuser ceph --setgroup ceph
> ceph   35007  3.4 35.1 160318780 138756404 ? Ssl  Jul01 3963:21 
> /usr/bin/ceph-osd -f --cluster apollo --id 2 --setuser ceph --setgroup ceph
> ceph   35217  3.5 35.1 159023744 138626680 ? Ssl  Jul01 4041:50 
> /usr/bin/ceph-osd -f --cluster apollo --id 22 --setuser 

[ceph-users] CephFS performance improved in 13.2.5?

2019-03-20 Thread Sergey Malinin
Hello,
Yesterday I upgraded from 13.2.2 to 13.2.5 and so far I have only seen 
significant improvements in MDS operations. Needless to say I'm happy, but I 
didn't notice anything related in release notes. Am I missing something, 
possibly new configuration settings?

Screenshots below:
https://prnt.sc/n0qzfp (https://prnt.sc/n0qzfp)
https://prnt.sc/n0qzd5 (https://prnt.sc/n0qzd5)

And yes, ceph nodes and clients had kernel upgraded to v5.0.3
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS performance vs. underlying storage

2019-01-30 Thread Marc Roos

I was wondering the same, from a 'default' setup I get this performance,
no idea if this is bad, good or normal.

4k r ran. 

4k w ran. 

4k r seq. 

4k w seq. 

1024k r ran. 

1024k w ran. 

1024k r seq. 

1024k w seq. 

  size 

lat 

iops 

kB/s 

lat 

iops 

kB/s 

lat 

iops 

MB/s 

lat 

iops 

MB/s 

lat 

iops 

MB/s 

lat 

iops 

MB/s 

lat 

iops 

MB/s 

lat 

iops 

MB/s 

Cephfs 

ssd rep. 3 

  2.78 

1781 

7297 

1.42 

700 

2871 

0.29 

3314 

13.6 

0.04 

889 

3.64 

4.3 

231 

243 

0.08 

132 

139 

4.23 

235 

247 

6.99 

142 

150 

Cephfs 

ssd rep. 1 

  0.54 

1809 

7412 

0.8 

1238 

5071 

0.29 

3325 

13.6 

0.56 

1761 

7.21 

4.27 

233 

245 

4.34 

229 

241 

4.21 

236 

248 

4.34 

229 

241 

Samsung 

MZK7KM480 

480GB 

   0.09 

10.2k 

41600 

0.05 

17.9k 

73200 

0.05 

18k 

77.6 

0.05 

18.3k 

75.1 

2.06 

482 

506 

2.16 

460 

483 

1.98 

502 

527 

2.13 

466 

489 


(4 nodes, CentOS7, luminous) 

Ps. not sure why you test with one node. If you expand to a 2nd node,
you might get a unpleasant surprise with a drop in performance, because
you will be adding network latency that decreases your iops.



-Original Message-
From: Hector Martin [mailto:hec...@marcansoft.com]
Sent: 30 January 2019 19:43
To: ceph-users@lists.ceph.com
Subject: [ceph-users] CephFS performance vs. underlying storage

Hi list,

I'm experimentally running single-host CephFS as as replacement for
"traditional" filesystems.

My setup is 8×8TB HDDs using dm-crypt, with CephFS on a 5+2 EC pool. All
of the components are running on the same host (mon/osd/mds/kernel
CephFS client). I've set the stripe_unit/object_size to a relatively
high 80MB (up from the default 4MB). I figure I want individual reads on
the disks to be several megabytes per object for good sequential
performance, and since this is an EC pool 4MB objects would be split
into 800kB chunks, which is clearly not ideal. With 80MB objects, chunks
are 16MB, which sounds more like a healthy read size for sequential
access (e.g. something like 10 IOPS per disk during seq reads).

With this config, I get about 270MB/s sequential from CephFS. On the
same disks, an ext4 on dm-crypt on dm-raid6 yields ~680MB/s. So it seems
Ceph achieves less than half of the raw performance that the underlying
storage is capable of (with similar RAID redundancy). *

Obviously there will be some overhead with a stack as deep as Ceph
compared to more traditional setups, but I'm wondering if there are
improvements to be had here. While reading from CephFS I do not have
significant CPU usage, so I don't think I'm CPU limited. Could the issue
perhaps be latency through the stack / lack of read-ahead? Reading two
files in parallel doesn't really get me more than 300MB/s in total, so
parallelism doesn't seem to help much.

I'm curious as to whether there are any knobs I can play with to try to
improve performance, or whether this level of overhead is pretty much
inherent to Ceph. Even though this is an unusual single-host setup, I
imagine proper clusters might also have similar results when comparing
raw storage performance.

* Ceph has a slight disadvantage here because its chunk of the drives is
logically after the traditional RAID, and HDDs get slower towards higher
logical addresses, but this should be on the order of a 15-20% hit at
most.

--
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS performance vs. underlying storage

2019-01-30 Thread Hector Martin
Hi list,

I'm experimentally running single-host CephFS as as replacement for
"traditional" filesystems.

My setup is 8×8TB HDDs using dm-crypt, with CephFS on a 5+2 EC pool. All
of the components are running on the same host (mon/osd/mds/kernel
CephFS client). I've set the stripe_unit/object_size to a relatively
high 80MB (up from the default 4MB). I figure I want individual reads on
the disks to be several megabytes per object for good sequential
performance, and since this is an EC pool 4MB objects would be split
into 800kB chunks, which is clearly not ideal. With 80MB objects, chunks
are 16MB, which sounds more like a healthy read size for sequential
access (e.g. something like 10 IOPS per disk during seq reads).

With this config, I get about 270MB/s sequential from CephFS. On the
same disks, an ext4 on dm-crypt on dm-raid6 yields ~680MB/s. So it seems
Ceph achieves less than half of the raw performance that the underlying
storage is capable of (with similar RAID redundancy). *

Obviously there will be some overhead with a stack as deep as Ceph
compared to more traditional setups, but I'm wondering if there are
improvements to be had here. While reading from CephFS I do not have
significant CPU usage, so I don't think I'm CPU limited. Could the issue
perhaps be latency through the stack / lack of read-ahead? Reading two
files in parallel doesn't really get me more than 300MB/s in total, so
parallelism doesn't seem to help much.

I'm curious as to whether there are any knobs I can play with to try to
improve performance, or whether this level of overhead is pretty much
inherent to Ceph. Even though this is an unusual single-host setup, I
imagine proper clusters might also have similar results when comparing
raw storage performance.

* Ceph has a slight disadvantage here because its chunk of the drives is
logically after the traditional RAID, and HDDs get slower towards higher
logical addresses, but this should be on the order of a 15-20% hit at most.

-- 
Hector Martin (hec...@marcansoft.com)
Public Key: https://mrcn.st/pub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs performance degraded very fast

2019-01-22 Thread Yan, Zheng
On Tue, Jan 22, 2019 at 8:24 PM renjianxinlover  wrote:
>
> hi,
>at some time, as cache pressure or caps release failure, client apps mount 
> got stuck.
>my use case is in kubernetes cluster and automatic kernel client mount in 
> nodes.
>is anyone faced with same issue or has related solution?
> Brs
>
>

If you mean "client.xxx failing to respond to capability release".
you'd better to make sure all clients are uptodate (newest version of
ceph-fuse, recent kernel)

>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs performance degraded very fast

2019-01-22 Thread renjianxinlover
hi, 
   at some time, as cache pressure or caps release failure, client apps mount 
got stuck.
   my use case is in kubernetes cluster and automatic kernel client mount in 
nodes.
   is anyone faced with same issue or has related solution?
Brs___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS performance.

2018-10-04 Thread Patrick Donnelly
On Thu, Oct 4, 2018 at 2:10 AM Ronny Aasen  wrote:
> in rbd there is a fancy striping solution, by using --stripe-unit and
> --stripe-count. This would get more spindles running ; perhaps consider
> using rbd instead of cephfs if it fits the workload.

CephFS also supports custom striping via layouts:
http://docs.ceph.com/docs/master/cephfs/file-layouts/

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS performance.

2018-10-04 Thread Ronny Aasen

On 10/4/18 7:04 AM, jes...@krogh.cc wrote:

Hi All.

First thanks for the good discussion and strong answer's I've gotten so far.

Current cluster setup is 4 x 10 x 12TB 7.2K RPM drives with all and
10GbitE and metadata on rotating drives - 3x replication - 256GB memory in
OSD hosts and 32+ cores. Behind Perc with eachdiskraid0 and BBWC.

Planned changes:
- is to get 1-2 more OSD-hosts
- experiment with EC-pools for CephFS
- MDS onto seperate host and metadata onto SSD's.

I'm still struggling to get "non-cached" performance up to "hardware"
speed - whatever that means. I do "fio" benchmark using 10GB files, 16
threads, 4M block size -- at which I can "almost" sustained fill the
10GbitE NIC. In this configuraiton I would have expected it to be "way
above" 10Gbit speed thus have the NIC not "almost" filled - but fully
filled - could that be the metadata activities .. but on "big files" and
read - that should not be much - right?

Above is actually ok for production, thus .. not a big issue, just
information.

Single threaded performance is still struggling

Cold HHD (read from disk in NFS-server end) / NFS performance:

jk@zebra01:~$ pipebench < /nfs/16GB.file > /dev/null
Summary:
Piped   15.86 GB in 00h00m27.53s:  589.88 MB/second


Local page cache (just to say it isn't the profiling tool delivering
limitations):
jk@zebra03:~$ pipebench < /nfs/16GB.file > /dev/null
Summary:
Piped   29.24 GB in 00h00m09.15s:3.19 GB/second
jk@zebra03:~$

Now from the Ceph system:
jk@zebra01:~$ pipebench < /ceph/bigfile.file> /dev/null
Summary:
Piped   36.79 GB in 00h03m47.66s:  165.49 MB/second

Can block/stripe-size be tuned? Does it make sense?
Does read-ahead on the CephFS kernel-client need tuning?
What performance are other people seeing?
Other thoughts - recommendations?

On some of the shares we're storing pretty large files (GB size) and
need the backup to move them to tape - so it is preferred to be capable
of filling an LTO6 drive's write speed to capacity with a single thread.

40'ish 7.2K RPM drives - should - add up to more than above.. right?
This is the only current load being put on the cluster - + 100MB/s
recovery traffic.




the problem with single threaded performance in ceph. Is that it reads 
the spindles in serial. so you are practically reading one and one 
drive, and see a single disk's performance, subtracted all the overheads 
from ceph, network, mds, etc.
So you do not get the combined performance of the drives, only one drive 
at the time. So the trick for ceph performance is to get more spindles 
working for you at the same time.



There are ways to get more performance out of a single thread:
- faster components in the path, ie faster disk/network/cpu/memory
- larger pre-fetching/read-ahead, with a large enough read-ahead more 
osd's will participate in reading simultaneously. [1] shows a table of 
benchmarks with different read-ahead sizes.
- erasure coding. while erasure coding does add latency vs replicated 
pools. You will get more spindles involved in reading in parallel. so 
for large sequential loads erasure coding can have a benefit.
- some sort of extra caching scheme, I have not looked at cachefiles, 
but it may provide some benefit.



you can also play with different cephfs implementations, there is a fuse 
client, where you can play with different cache solutions. But generally 
the kernel client is faster.


in rbd there is a fancy striping solution, by using --stripe-unit and 
--stripe-count. This would get more spindles running ; perhaps consider 
using rbd instead of cephfs if it fits the workload.



[1] 
https://tracker.ceph.com/projects/ceph/wiki/Kernel_client_read_ahead_optimization


good luck
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS performance.

2018-10-03 Thread jesper
Hi All.

First thanks for the good discussion and strong answer's I've gotten so far.

Current cluster setup is 4 x 10 x 12TB 7.2K RPM drives with all and
10GbitE and metadata on rotating drives - 3x replication - 256GB memory in
OSD hosts and 32+ cores. Behind Perc with eachdiskraid0 and BBWC.

Planned changes:
- is to get 1-2 more OSD-hosts
- experiment with EC-pools for CephFS
- MDS onto seperate host and metadata onto SSD's.

I'm still struggling to get "non-cached" performance up to "hardware"
speed - whatever that means. I do "fio" benchmark using 10GB files, 16
threads, 4M block size -- at which I can "almost" sustained fill the
10GbitE NIC. In this configuraiton I would have expected it to be "way
above" 10Gbit speed thus have the NIC not "almost" filled - but fully
filled - could that be the metadata activities .. but on "big files" and
read - that should not be much - right?

Above is actually ok for production, thus .. not a big issue, just
information.

Single threaded performance is still struggling

Cold HHD (read from disk in NFS-server end) / NFS performance:

jk@zebra01:~$ pipebench < /nfs/16GB.file > /dev/null
Summary:
Piped   15.86 GB in 00h00m27.53s:  589.88 MB/second


Local page cache (just to say it isn't the profiling tool delivering
limitations):
jk@zebra03:~$ pipebench < /nfs/16GB.file > /dev/null
Summary:
Piped   29.24 GB in 00h00m09.15s:3.19 GB/second
jk@zebra03:~$

Now from the Ceph system:
jk@zebra01:~$ pipebench < /ceph/bigfile.file> /dev/null
Summary:
Piped   36.79 GB in 00h03m47.66s:  165.49 MB/second

Can block/stripe-size be tuned? Does it make sense?
Does read-ahead on the CephFS kernel-client need tuning?
What performance are other people seeing?
Other thoughts - recommendations?

On some of the shares we're storing pretty large files (GB size) and
need the backup to move them to tape - so it is preferred to be capable
of filling an LTO6 drive's write speed to capacity with a single thread.

40'ish 7.2K RPM drives - should - add up to more than above.. right?
This is the only current load being put on the cluster - + 100MB/s
recovery traffic.


Thanks.

Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs performance issue

2018-03-29 Thread Ouyang Xu

Hi David:

That's works, thank you very much!

Best regards,

Steven

On 2018年03月29日 18:30, David C wrote:

Pretty sure you're getting stung by: http://tracker.ceph.com/issues/17563

Consider using an elrepo kernel, 4.14 works well for me.



On Thu, 29 Mar 2018, 09:46 Dan van der Ster, > wrote:


On Thu, Mar 29, 2018 at 10:31 AM, Robert Sander
> wrote:
> On 29.03.2018 09:50, ouyangxu wrote:
>
>> I'm using Ceph 12.2.4 with CentOS 7.4, and tring to use cephfs for
>> MariaDB deployment,
>
> Don't do this.
> As the old saying goes: If it hurts, stop doing it.

Why not? Let's find out where and why the perf is lacking, then
fix it!

-- dan
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs performance issue

2018-03-29 Thread David C
Pretty sure you're getting stung by: http://tracker.ceph.com/issues/17563

Consider using an elrepo kernel, 4.14 works well for me.



On Thu, 29 Mar 2018, 09:46 Dan van der Ster,  wrote:

> On Thu, Mar 29, 2018 at 10:31 AM, Robert Sander
>  wrote:
> > On 29.03.2018 09:50, ouyangxu wrote:
> >
> >> I'm using Ceph 12.2.4 with CentOS 7.4, and tring to use cephfs for
> >> MariaDB deployment,
> >
> > Don't do this.
> > As the old saying goes: If it hurts, stop doing it.
>
> Why not? Let's find out where and why the perf is lacking, then fix it!
>
> -- dan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs performance issue

2018-03-29 Thread Dan van der Ster
On Thu, Mar 29, 2018 at 10:31 AM, Robert Sander
 wrote:
> On 29.03.2018 09:50, ouyangxu wrote:
>
>> I'm using Ceph 12.2.4 with CentOS 7.4, and tring to use cephfs for
>> MariaDB deployment,
>
> Don't do this.
> As the old saying goes: If it hurts, stop doing it.

Why not? Let's find out where and why the perf is lacking, then fix it!

-- dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs performance issue

2018-03-29 Thread Robert Sander
On 29.03.2018 09:50, ouyangxu wrote:

> I'm using Ceph 12.2.4 with CentOS 7.4, and tring to use cephfs for
> MariaDB deployment,

Don't do this.
As the old saying goes: If it hurts, stop doing it.

Regards
-- 
Robert Sander
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs performance issue

2018-03-29 Thread ouyangxu
Hi Ceph users:

I'm using Ceph 12.2.4 with CentOS 7.4, and tring to use cephfs for MariaDB 
deployment, the configuration is default, but got very pool performance during 
creating tables, if I use the local file system, not this issue.

Here is the sql scripts I used:
[root@cmv01cn01]$ cat mysql_test.sql
CREATE TABLE test.t001 (col INT)\g
CREATE TABLE test.t002 (col INT)\g
CREATE TABLE test.t003 (col INT)\g
CREATE TABLE test.t004 (col INT)\g
CREATE TABLE test.t005 (col INT)\g
CREATE TABLE test.t006 (col INT)\g
CREATE TABLE test.t007 (col INT)\g
CREATE TABLE test.t008 (col INT)\g
CREATE TABLE test.t009 (col INT)\g
DROP TABLE test.t001\g
DROP TABLE test.t002\g
DROP TABLE test.t003\g
DROP TABLE test.t004\g
DROP TABLE test.t005\g
DROP TABLE test.t006\g
DROP TABLE test.t007\g
DROP TABLE test.t008\g
DROP TABLE test.t009\g

The following is the running result:
[root@cmv01cn01]$ mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 6522
Server version: 10.1.20-MariaDB MariaDB Server

Copyright (c) 2000, 2016, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> source mysql_test.sql
Query OK, 0 rows affected (3.26 sec)
Query OK, 0 rows affected (5.32 sec)
Query OK, 0 rows affected (4.53 sec)
Query OK, 0 rows affected (5.09 sec)
Query OK, 0 rows affected (4.96 sec)
Query OK, 0 rows affected (4.94 sec)
Query OK, 0 rows affected (4.96 sec)
Query OK, 0 rows affected (5.02 sec)
Query OK, 0 rows affected (5.08 sec)
Query OK, 0 rows affected (0.11 sec)
Query OK, 0 rows affected (0.07 sec)
Query OK, 0 rows affected (0.07 sec)
Query OK, 0 rows affected (0.10 sec)
Query OK, 0 rows affected (0.06 sec)
Query OK, 0 rows affected (0.10 sec)
Query OK, 0 rows affected (0.02 sec)
Query OK, 0 rows affected (0.02 sec)
Query OK, 0 rows affected (0.02 sec)
MariaDB [(none)]>

As you can see, the average time for creating table around 5s, regarding to 
drop table, it is acceptable.

I also dumped the ops for mds as below:

[root@cmv01sn01 ceph]# ceph daemon mds.cmv01sn01 dump_ops_in_flight
{
"ops": [
{
"description": "client_request(client.854369:4659 create 
#0x1003994/t2_0.frm 2018-03-28 17:24:53.744090 caller_uid=27, 
caller_gid=27{})",
"initiated_at": "2018-03-28 17:24:53.744827",
"age": 4.567939,
"duration": 4.567955,
"type_data": {
"flag_point": "submit entry: journal_and_reply",
"reqid": "client.854369:4659",
"op_type": "client_request",
"client_info": {
"client": "client.854369",
"tid": 4659
},
"events": [
{
"time": "2018-03-28 17:24:53.744827",
"event": "initiated"
},
{
"time": "2018-03-28 17:24:53.745226",
"event": "acquired locks"
},
{
"time": "2018-03-28 17:24:53.745364",
"event": "early_replied"
},
{
"time": "2018-03-28 17:24:53.745367",
"event": "submit entry: journal_and_reply"
}
]
}
},
{
"description": "client_request(client.854369:4660 create 
#0x1003994/t2_0.ibd 2018-03-28 17:24:53.751090 caller_uid=27, 
caller_gid=27{})",
"initiated_at": "2018-03-28 17:24:53.752039",
"age": 4.560727,
"duration": 4.560763,
"type_data": {
"flag_point": "submit entry: journal_and_reply",
"reqid": "client.854369:4660",
"op_type": "client_request",
"client_info": {
"client": "client.854369",
"tid": 4660
},
"events": [
{
"time": "2018-03-28 17:24:53.752039",
"event": "initiated"
},
{
"time": "2018-03-28 17:24:53.752358",
"event": "acquired locks"
},
{
"time": "2018-03-28 17:24:53.752480",
"event": "early_replied"
},
{
"time": "2018-03-28 17:24:53.752483",
"event": "submit entry: journal_and_reply"
}
]
}
}
],
"num_ops": 2
}

It seems like stuck at journal_and_reply.

So, does anyone faced this situation? Any thoughts are appropriated.

Thanks,

Steven
___
ceph-users mailing list

Re: [ceph-users] CephFS Performance

2017-05-10 Thread Webert de Souza Lima
On Tue, May 9, 2017 at 9:07 PM, Brady Deetz  wrote:

> So with email, you're talking about lots of small reads and writes. In my
> experience with dicom data (thousands of 20KB files per directory), cephfs
> doesn't perform very well at all on platter drivers. I haven't experimented
> with pure ssd configurations, so I can't comment on that.
>

Yes, that's pretty much why I'm using cache tiering on SSDs.


> Somebody may correct me here, but small block io on writes just makes
> latency all that much more important due to the need to wait for your
> replicas to be written before moving on to the next block.
>

I think that is correct. Smaller blocks = more I/O, so SSDs benefit a lot.


> Without know exact hardware details, my brain is immediately jumping to
> networking constraints. 2 or 3 spindle drives can pretty much saturate a
> 1gbps link. As soon as you create contention for that resource, you create
> system load for iowait and latency.
>
You mentioned you don't control the network. Maybe you can scale down and
> out.
>

 I'm constrained with the topology I showed you for now. I did planned
another (see
https://creately.com/diagram/j1eyig9i/7wloXLNOAYjeregBGkvelMXL50%3D) but it
won't be possible at the time.
 That setup would have a 10 gig interconnection link.

On Wed, May 10, 2017 at 3:55 AM, John Spray  wrote:

>
> Hmm, to understand this better I would start by taking cache tiering
> out of the mix, it adds significant complexity.
>
> The "-direct=1" part could be significant here: when you're using RBD,
> that's getting handled by ext4, and then ext4 is potentially still
> benefiting from some caching at the ceph layer.  With CephFS on the
> other hand, it's getting handled by CephFS, and CephFS will be
> laboriously doing direct access to OSD.
>
> John


I won't be able to change that by now. I would need another testing cluster.
The point of direct=1 was to remove any caching possibility in the middle.
That fio suite was suggested by username peetaur on IRC channel (thanks :)

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Performance

2017-05-09 Thread Brady Deetz
Readding list:

So with email, you're talking about lots of small reads and writes. In my
experience with dicom data (thousands of 20KB files per directory), cephfs
doesn't perform very well at all on platter drivers. I haven't experimented
with pure ssd configurations, so I can't comment on that.

Somebody may correct me here, but small block io on writes just makes
latency all that much more important due to the need to wait for your
replicas to be written before moving on to the next block.

Without know exact hardware details, my brain is immediately jumping to
networking constraints. 2 or 3 spindle drives can pretty much saturate a
1gbps link. As soon as you create contention for that resource, you create
system load for iowait and latency.

You mentioned you don't control the network. Maybe you can scale down and
out.


On May 9, 2017 5:38 PM, "Webert de Souza Lima" 
wrote:


On Tue, May 9, 2017 at 4:40 PM, Brett Niver  wrote:

> What is your workload like?  Do you have a single or multiple active
> MDS ranks configured?


User traffic is heavy. I can't really say in terms of mb/s or iops but it's
an email server with 25k+ users, usually about 6k simultaneously connected
receiving and reading emails.
I have only one active MDS configured. The others are Stand-by.

On Tue, May 9, 2017 at 7:18 PM, Wido den Hollander  wrote:

>
> > Op 9 mei 2017 om 20:26 schreef Brady Deetz :
> >
> >
> > If I'm reading your cluster diagram correctly, I'm seeing a 1gbps
> > interconnect, presumably cat6. Due to the additional latency of
> performing
> > metadata operations, I could see cephfs performing at those speeds. Are
> you
> > using jumbo frames? Also are you routing?
> >
> > If you're routing, the router will introduce additional latency that an
> l2
> > network wouldn't experience.
> >
>
> Partially true. I am running various Ceph clusters using L3 routing and
> with a decent router the latency for routing a packet is minimal, like 0.02
> ms or so.
>
> Ceph spends much more time in the CPU then it will take the network to
> forward that IP-packet.
>
> I wouldn't be too afraid to run Ceph over a L3 network.
>
> Wido
>
> > On May 9, 2017 12:01 PM, "Webert de Souza Lima" 
> > wrote:
> >
> > > Hello all,
> > >
> > > I'm been using cephfs for a while but never really evaluated its
> > > performance.
> > > As I put up a new ceph cluster, I though that I should run a benchmark
> to
> > > see if I'm going the right way.
> > >
> > > By the results I got, I see that RBD performs *a lot* better in
> > > comparison to cephfs.
> > >
> > > The cluster is like this:
> > >  - 2 hosts with one SSD OSD each.
> > >this hosts have 2 pools: cephfs_metadata and cephfs_cache (for
> > > cache tiering).
> > >  - 3 hosts with 5 HDD OSDs each.
> > >   this hosts have 1 pool: cephfs_data.
> > >
> > > all details, cluster set up and results can be seen here:
> > > https://justpaste.it/167fr
> > >
> > > I created the RBD pools the same way as the CEPHFS pools except for the
> > > number of PGs in the data pool.
> > >
> > > I wonder why that difference or if I'm doing something wrong.
> > >
> > > Regards,
> > >
> > > Webert Lima
> > > DevOps Engineer at MAV Tecnologia
> > > *Belo Horizonte - Brasil*
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Performance

2017-05-09 Thread Webert de Souza Lima
On Tue, May 9, 2017 at 4:40 PM, Brett Niver  wrote:

> What is your workload like?  Do you have a single or multiple active
> MDS ranks configured?


User traffic is heavy. I can't really say in terms of mb/s or iops but it's
an email server with 25k+ users, usually about 6k simultaneously connected
receiving and reading emails.
I have only one active MDS configured. The others are Stand-by.

On Tue, May 9, 2017 at 7:18 PM, Wido den Hollander  wrote:

>
> > Op 9 mei 2017 om 20:26 schreef Brady Deetz :
> >
> >
> > If I'm reading your cluster diagram correctly, I'm seeing a 1gbps
> > interconnect, presumably cat6. Due to the additional latency of
> performing
> > metadata operations, I could see cephfs performing at those speeds. Are
> you
> > using jumbo frames? Also are you routing?
> >
> > If you're routing, the router will introduce additional latency that an
> l2
> > network wouldn't experience.
> >
>
> Partially true. I am running various Ceph clusters using L3 routing and
> with a decent router the latency for routing a packet is minimal, like 0.02
> ms or so.
>
> Ceph spends much more time in the CPU then it will take the network to
> forward that IP-packet.
>
> I wouldn't be too afraid to run Ceph over a L3 network.
>
> Wido
>
> > On May 9, 2017 12:01 PM, "Webert de Souza Lima" 
> > wrote:
> >
> > > Hello all,
> > >
> > > I'm been using cephfs for a while but never really evaluated its
> > > performance.
> > > As I put up a new ceph cluster, I though that I should run a benchmark
> to
> > > see if I'm going the right way.
> > >
> > > By the results I got, I see that RBD performs *a lot* better in
> > > comparison to cephfs.
> > >
> > > The cluster is like this:
> > >  - 2 hosts with one SSD OSD each.
> > >this hosts have 2 pools: cephfs_metadata and cephfs_cache (for
> > > cache tiering).
> > >  - 3 hosts with 5 HDD OSDs each.
> > >   this hosts have 1 pool: cephfs_data.
> > >
> > > all details, cluster set up and results can be seen here:
> > > https://justpaste.it/167fr
> > >
> > > I created the RBD pools the same way as the CEPHFS pools except for the
> > > number of PGs in the data pool.
> > >
> > > I wonder why that difference or if I'm doing something wrong.
> > >
> > > Regards,
> > >
> > > Webert Lima
> > > DevOps Engineer at MAV Tecnologia
> > > *Belo Horizonte - Brasil*
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Performance

2017-05-09 Thread Wido den Hollander

> Op 9 mei 2017 om 20:26 schreef Brady Deetz :
> 
> 
> If I'm reading your cluster diagram correctly, I'm seeing a 1gbps
> interconnect, presumably cat6. Due to the additional latency of performing
> metadata operations, I could see cephfs performing at those speeds. Are you
> using jumbo frames? Also are you routing?
> 
> If you're routing, the router will introduce additional latency that an l2
> network wouldn't experience.
> 

Partially true. I am running various Ceph clusters using L3 routing and with a 
decent router the latency for routing a packet is minimal, like 0.02 ms or so.

Ceph spends much more time in the CPU then it will take the network to forward 
that IP-packet.

I wouldn't be too afraid to run Ceph over a L3 network.

Wido

> On May 9, 2017 12:01 PM, "Webert de Souza Lima" 
> wrote:
> 
> > Hello all,
> >
> > I'm been using cephfs for a while but never really evaluated its
> > performance.
> > As I put up a new ceph cluster, I though that I should run a benchmark to
> > see if I'm going the right way.
> >
> > By the results I got, I see that RBD performs *a lot* better in
> > comparison to cephfs.
> >
> > The cluster is like this:
> >  - 2 hosts with one SSD OSD each.
> >this hosts have 2 pools: cephfs_metadata and cephfs_cache (for
> > cache tiering).
> >  - 3 hosts with 5 HDD OSDs each.
> >   this hosts have 1 pool: cephfs_data.
> >
> > all details, cluster set up and results can be seen here:
> > https://justpaste.it/167fr
> >
> > I created the RBD pools the same way as the CEPHFS pools except for the
> > number of PGs in the data pool.
> >
> > I wonder why that difference or if I'm doing something wrong.
> >
> > Regards,
> >
> > Webert Lima
> > DevOps Engineer at MAV Tecnologia
> > *Belo Horizonte - Brasil*
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Performance

2017-05-09 Thread Brett Niver
What is your workload like?  Do you have a single or multiple active
MDS ranks configured?


On Tue, May 9, 2017 at 3:10 PM, Webert de Souza Lima
 wrote:
> That 1gbps link is the only option I have for those servers, unfortunately.
> It's all dedicated server rentals from OVH.
> I don't have information regarding the internals of the vrack.
>
> So by what you said, I understand that one should expect a performance drop
> in comparison to ceph rbd using the same architecture , right?
>
> Thanks.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Performance

2017-05-09 Thread Webert de Souza Lima
That 1gbps link is the only option I have for those servers, unfortunately.
It's all dedicated server rentals from OVH.
I don't have information regarding the internals of the vrack.

So by what you said, I understand that one should expect a performance drop
in comparison to ceph rbd using the same architecture , right?

Thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Performance

2017-05-09 Thread Brady Deetz
If I'm reading your cluster diagram correctly, I'm seeing a 1gbps
interconnect, presumably cat6. Due to the additional latency of performing
metadata operations, I could see cephfs performing at those speeds. Are you
using jumbo frames? Also are you routing?

If you're routing, the router will introduce additional latency that an l2
network wouldn't experience.

On May 9, 2017 12:01 PM, "Webert de Souza Lima" 
wrote:

> Hello all,
>
> I'm been using cephfs for a while but never really evaluated its
> performance.
> As I put up a new ceph cluster, I though that I should run a benchmark to
> see if I'm going the right way.
>
> By the results I got, I see that RBD performs *a lot* better in
> comparison to cephfs.
>
> The cluster is like this:
>  - 2 hosts with one SSD OSD each.
>this hosts have 2 pools: cephfs_metadata and cephfs_cache (for
> cache tiering).
>  - 3 hosts with 5 HDD OSDs each.
>   this hosts have 1 pool: cephfs_data.
>
> all details, cluster set up and results can be seen here:
> https://justpaste.it/167fr
>
> I created the RBD pools the same way as the CEPHFS pools except for the
> number of PGs in the data pool.
>
> I wonder why that difference or if I'm doing something wrong.
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> *Belo Horizonte - Brasil*
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS Performance

2017-05-09 Thread Webert de Souza Lima
Hello all,

I'm been using cephfs for a while but never really evaluated its
performance.
As I put up a new ceph cluster, I though that I should run a benchmark to
see if I'm going the right way.

By the results I got, I see that RBD performs *a lot* better in comparison
to cephfs.

The cluster is like this:
 - 2 hosts with one SSD OSD each.
   this hosts have 2 pools: cephfs_metadata and cephfs_cache (for cache
tiering).
 - 3 hosts with 5 HDD OSDs each.
  this hosts have 1 pool: cephfs_data.

all details, cluster set up and results can be seen here:
https://justpaste.it/167fr

I created the RBD pools the same way as the CEPHFS pools except for the
number of PGs in the data pool.

I wonder why that difference or if I'm doing something wrong.

Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs performance benchmark -- metadata intensive

2016-08-12 Thread John Spray
On Thu, Aug 11, 2016 at 1:24 PM, Brett Niver  wrote:
> Patrick and I had a related question yesterday, are we able to dynamically
> vary cache size to artificially manipulate cache pressure?

Yes -- at the top of MDCache::trim the max size is read straight out
of g_conf so it should pick up on any changes you do with "tell
injectargs".  Things might be a little bit funny though because the
new cache limit wouldn't be reflected in the logic in lru_adjust().

John

> On Thu, Aug 11, 2016 at 6:07 AM, John Spray  wrote:
>>
>> On Thu, Aug 11, 2016 at 8:29 AM, Xiaoxi Chen 
>> wrote:
>> > Hi ,
>> >
>> >
>> >  Here is the slide I shared yesterday on performance meeting.
>> > Thanks and hoping for inputs.
>> >
>> >
>> >
>> > http://www.slideshare.net/XiaoxiChen3/cephfs-jewel-mds-performance-benchmark
>>
>> These are definitely useful results and I encourage everyone working
>> with cephfs to go and look at Xiaoxi's slides.
>>
>> The main thing that this highlighted for me was our lack of testing so
>> far on systems with full caches.  Too much of our existing testing is
>> done on freshly configured systems that never fill the MDS cache.
>>
>> Test 2.1 notes that we don't enable directory fragmentation by default
>> currently -- this is an issue, and I'm hoping we can switch it on by
>> default in Kraken (see thread "Switching on mds_bal_frag by default").
>> In the meantime we have the fix that Patrick wrote for Jewel which at
>> least prevents people creating dirfrags too large for the OSDs to
>> handle.
>>
>> Test 2.2: since a "failing to respond to cache pressure" bug is
>> affecting this, I would guess we see the performance fall off at about
>> the point where the *client* caches fill up (so they start trimming
>> things even though they're ignore cache pressure).  It would be
>> interesting to see this chart with addition lines for some related
>> perf counters like mds_log.evtrm and mds.inodes_expired, that might
>> make it pretty obvious where the MDS is entering different stages that
>> see a decrease in the rate of handling client requests.
>>
>> We really need to sort out the "failing to respond to cache pressure"
>> issues that keep popping up, especially if they're still happening on
>> a comparatively simple test that is just creating files.  We have a
>> specific test for this[1] that is currently being run against the fuse
>> client but not the kernel client[2].  This is a good time to try and
>> push that forward so I've kicked off an experimental run here:
>>
>> http://pulpito.ceph.com/jspray-2016-08-10_16:14:52-kcephfs:recovery-master-testing-basic-mira/
>>
>> In the meantime, although there are reports of similar issues with
>> newer kernels, it would be very useful to confirm if the same issue is
>> still occurring with more recent kernels.  Issues with cache trimming
>> have occurred due to various (separate) bugs, so it's possible that
>> while some people are still seeing cache trimming issues with recent
>> kernels, the specific case you're hitting might be fixed.
>>
>> Test 2.3: restarting the MDS doesn't actually give you a completely
>> empty cache (everything in the journal gets replayed to pre-populate
>> the cache on MDS startup).  However, the results are still valid
>> because you're using a different random order in the non-caching test
>> case, and the number of inodes in your journal is probably much
>> smaller than the overall cache size so it's only a little bit
>> populated.  We don't currently have a "drop cache" command built into
>> the MDS but it would be pretty easy to add one for use in testing
>> (basically just call mds->mdcache->trim(0)).
>>
>> As one would imagine, the non-caching case is latency-dominated when
>> the working set is larger than the cache, where each client is waiting
>> for one open to finish before proceeding to the next.  The MDS is
>> probably capable of handling many more operations per second, but it
>> would need more parallel IO operations from the clients.  When a
>> single client is doing opens one by one, you're potentially seeing a
>> full network+disk latency for each one (though in practice the OSD
>> read cache will be helping a lot here).  This non-caching case would
>> be the main argument for giving the metadata pool low latency (SSD)
>> storage.
>>
>> Test 2.5: The observation that the CPU bottleneck makes using fast
>> storage for the metadata pool less useful (in sequential/cached cases)
>> is valid, although it could still be useful to isolate the metadata
>> OSDs (probably SSDs since not so much capacity is needed) to avoid
>> competing with data operations.  For random access in the non-caching
>> cases (2.3, 2.4) I think you would probably see an improvement from
>> SSDs.
>>
>> Thanks again to the team from ebay for sharing all this.
>>
>> John
>>
>>
>>
>> 1.
>> https://github.com/ceph/ceph-qa-suite/blob/master/tasks/cephfs/test_client_limits.py#L96
>> 2. 

Re: [ceph-users] cephfs performance benchmark -- metadata intensive

2016-08-11 Thread Brett Niver
Patrick and I had a related question yesterday, are we able to dynamically
vary cache size to artificially manipulate cache pressure?

On Thu, Aug 11, 2016 at 6:07 AM, John Spray  wrote:

> On Thu, Aug 11, 2016 at 8:29 AM, Xiaoxi Chen 
> wrote:
> > Hi ,
> >
> >
> >  Here is the slide I shared yesterday on performance meeting.
> > Thanks and hoping for inputs.
> >
> >
> > http://www.slideshare.net/XiaoxiChen3/cephfs-jewel-mds-
> performance-benchmark
>
> These are definitely useful results and I encourage everyone working
> with cephfs to go and look at Xiaoxi's slides.
>
> The main thing that this highlighted for me was our lack of testing so
> far on systems with full caches.  Too much of our existing testing is
> done on freshly configured systems that never fill the MDS cache.
>
> Test 2.1 notes that we don't enable directory fragmentation by default
> currently -- this is an issue, and I'm hoping we can switch it on by
> default in Kraken (see thread "Switching on mds_bal_frag by default").
> In the meantime we have the fix that Patrick wrote for Jewel which at
> least prevents people creating dirfrags too large for the OSDs to
> handle.
>
> Test 2.2: since a "failing to respond to cache pressure" bug is
> affecting this, I would guess we see the performance fall off at about
> the point where the *client* caches fill up (so they start trimming
> things even though they're ignore cache pressure).  It would be
> interesting to see this chart with addition lines for some related
> perf counters like mds_log.evtrm and mds.inodes_expired, that might
> make it pretty obvious where the MDS is entering different stages that
> see a decrease in the rate of handling client requests.
>
> We really need to sort out the "failing to respond to cache pressure"
> issues that keep popping up, especially if they're still happening on
> a comparatively simple test that is just creating files.  We have a
> specific test for this[1] that is currently being run against the fuse
> client but not the kernel client[2].  This is a good time to try and
> push that forward so I've kicked off an experimental run here:
> http://pulpito.ceph.com/jspray-2016-08-10_16:14:52-
> kcephfs:recovery-master-testing-basic-mira/
>
> In the meantime, although there are reports of similar issues with
> newer kernels, it would be very useful to confirm if the same issue is
> still occurring with more recent kernels.  Issues with cache trimming
> have occurred due to various (separate) bugs, so it's possible that
> while some people are still seeing cache trimming issues with recent
> kernels, the specific case you're hitting might be fixed.
>
> Test 2.3: restarting the MDS doesn't actually give you a completely
> empty cache (everything in the journal gets replayed to pre-populate
> the cache on MDS startup).  However, the results are still valid
> because you're using a different random order in the non-caching test
> case, and the number of inodes in your journal is probably much
> smaller than the overall cache size so it's only a little bit
> populated.  We don't currently have a "drop cache" command built into
> the MDS but it would be pretty easy to add one for use in testing
> (basically just call mds->mdcache->trim(0)).
>
> As one would imagine, the non-caching case is latency-dominated when
> the working set is larger than the cache, where each client is waiting
> for one open to finish before proceeding to the next.  The MDS is
> probably capable of handling many more operations per second, but it
> would need more parallel IO operations from the clients.  When a
> single client is doing opens one by one, you're potentially seeing a
> full network+disk latency for each one (though in practice the OSD
> read cache will be helping a lot here).  This non-caching case would
> be the main argument for giving the metadata pool low latency (SSD)
> storage.
>
> Test 2.5: The observation that the CPU bottleneck makes using fast
> storage for the metadata pool less useful (in sequential/cached cases)
> is valid, although it could still be useful to isolate the metadata
> OSDs (probably SSDs since not so much capacity is needed) to avoid
> competing with data operations.  For random access in the non-caching
> cases (2.3, 2.4) I think you would probably see an improvement from
> SSDs.
>
> Thanks again to the team from ebay for sharing all this.
>
> John
>
>
>
> 1. https://github.com/ceph/ceph-qa-suite/blob/master/tasks/
> cephfs/test_client_limits.py#L96
> 2. http://tracker.ceph.com/issues/9466
>
>
> >
> > Xiaoxi
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

Re: [ceph-users] cephfs performance benchmark -- metadata intensive

2016-08-11 Thread John Spray
On Thu, Aug 11, 2016 at 8:29 AM, Xiaoxi Chen  wrote:
> Hi ,
>
>
>  Here is the slide I shared yesterday on performance meeting.
> Thanks and hoping for inputs.
>
>
> http://www.slideshare.net/XiaoxiChen3/cephfs-jewel-mds-performance-benchmark

These are definitely useful results and I encourage everyone working
with cephfs to go and look at Xiaoxi's slides.

The main thing that this highlighted for me was our lack of testing so
far on systems with full caches.  Too much of our existing testing is
done on freshly configured systems that never fill the MDS cache.

Test 2.1 notes that we don't enable directory fragmentation by default
currently -- this is an issue, and I'm hoping we can switch it on by
default in Kraken (see thread "Switching on mds_bal_frag by default").
In the meantime we have the fix that Patrick wrote for Jewel which at
least prevents people creating dirfrags too large for the OSDs to
handle.

Test 2.2: since a "failing to respond to cache pressure" bug is
affecting this, I would guess we see the performance fall off at about
the point where the *client* caches fill up (so they start trimming
things even though they're ignore cache pressure).  It would be
interesting to see this chart with addition lines for some related
perf counters like mds_log.evtrm and mds.inodes_expired, that might
make it pretty obvious where the MDS is entering different stages that
see a decrease in the rate of handling client requests.

We really need to sort out the "failing to respond to cache pressure"
issues that keep popping up, especially if they're still happening on
a comparatively simple test that is just creating files.  We have a
specific test for this[1] that is currently being run against the fuse
client but not the kernel client[2].  This is a good time to try and
push that forward so I've kicked off an experimental run here:
http://pulpito.ceph.com/jspray-2016-08-10_16:14:52-kcephfs:recovery-master-testing-basic-mira/

In the meantime, although there are reports of similar issues with
newer kernels, it would be very useful to confirm if the same issue is
still occurring with more recent kernels.  Issues with cache trimming
have occurred due to various (separate) bugs, so it's possible that
while some people are still seeing cache trimming issues with recent
kernels, the specific case you're hitting might be fixed.

Test 2.3: restarting the MDS doesn't actually give you a completely
empty cache (everything in the journal gets replayed to pre-populate
the cache on MDS startup).  However, the results are still valid
because you're using a different random order in the non-caching test
case, and the number of inodes in your journal is probably much
smaller than the overall cache size so it's only a little bit
populated.  We don't currently have a "drop cache" command built into
the MDS but it would be pretty easy to add one for use in testing
(basically just call mds->mdcache->trim(0)).

As one would imagine, the non-caching case is latency-dominated when
the working set is larger than the cache, where each client is waiting
for one open to finish before proceeding to the next.  The MDS is
probably capable of handling many more operations per second, but it
would need more parallel IO operations from the clients.  When a
single client is doing opens one by one, you're potentially seeing a
full network+disk latency for each one (though in practice the OSD
read cache will be helping a lot here).  This non-caching case would
be the main argument for giving the metadata pool low latency (SSD)
storage.

Test 2.5: The observation that the CPU bottleneck makes using fast
storage for the metadata pool less useful (in sequential/cached cases)
is valid, although it could still be useful to isolate the metadata
OSDs (probably SSDs since not so much capacity is needed) to avoid
competing with data operations.  For random access in the non-caching
cases (2.3, 2.4) I think you would probably see an improvement from
SSDs.

Thanks again to the team from ebay for sharing all this.

John



1. 
https://github.com/ceph/ceph-qa-suite/blob/master/tasks/cephfs/test_client_limits.py#L96
2. http://tracker.ceph.com/issues/9466


>
> Xiaoxi
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com