[ceph-users] PG numbers don't add up?

2018-03-13 Thread Nathan Dehnel
I try to add a data pool:

OSD_STAT USED   AVAIL TOTAL HB_PEERSPG_SUM PRIMARY_PG_SUM
9 1076M  930G  931G [0,1,2,3,4,5,6,7,8]128  5
8 1076M  930G  931G [0,1,2,3,4,5,6,7,9]128 14
7 1076M  930G  931G [0,1,2,3,4,5,6,8,9]128 14
6 1076M  930G  931G [0,1,2,3,4,5,7,8,9]128 19
5 1076M  930G  931G [0,1,2,3,4,6,7,8,9]128 15
4 1076M  930G  931G [0,1,2,3,5,6,7,8,9]128 17
0 1076M  930G  931G [1,2,3,4,5,6,7,8,9]128 16
1 1076M  930G  931G [0,2,3,4,5,6,7,8,9]128  8
2 1076M  930G  931G [0,1,3,4,5,6,7,8,9]128  8
3 1076M  930G  931G [0,1,2,4,5,6,7,8,9]128 12
sum  10765M 9304G 9315G

I try to add a metadata pool:

sum 0 0 0 0 0 0 0 0
OSD_STAT USED   AVAIL TOTAL HB_PEERSPG_SUM PRIMARY_PG_SUM
9 1076M  930G  931G [0,1,2,3,4,5,6,7,8] 73 73
8 1076M  930G  931G [0,1,2,3,4,5,6,7,9] 40 40
7 1076M  930G  931G [0,1,2,3,4,5,6,8,9] 56 56
6 1076M  930G  931G [0,1,2,3,4,5,7,8,9] 42 42
5 1076M  930G  931G [0,1,2,3,4,6,7,8,9] 54 54
4 1076M  930G  931G [0,1,2,3,5,6,7,8,9] 59 59
0 1076M  930G  931G [1,2,3,4,5,6,7,8,9] 38 38
1 1076M  930G  931G [0,2,3,4,5,6,7,8,9] 57 57
2 1076M  930G  931G [0,1,3,4,5,6,7,8,9] 45 45
3 1076M  930G  931G [0,1,2,4,5,6,7,8,9] 48 48
sum  10766M 9304G 9315G

I try to add both pools:
Error ERANGE:  pg_num 128 size 10 would mean 2816 total pgs, which exceeds
max 2000 (mon_max_pg_per_osd 200 * num_in_osds 10)

That's over a thousand more PGs than both pools combined. Where are they
coming from?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects

2018-03-13 Thread Graham Allan
Updated cluster now to 12.2.4 and the cycle of 
inconsistent->repair->unfound seems to continue, though possibly 
slightly differently. A pg does pass through an "active+clean" phase 
after repair, which might be new, but more likely I never observed it at 
the right time before.


I see messages like this in the logs now "attr name mismatch 
'hinfo_key'" - perhaps this might cast more light on the cause:



2018-03-02 18:55:11.583850 osd.386 osd.386 10.31.0.72:6817/4057280 401 : 
cluster [ERR] 70.3dbs0 : soid 
70:dbc6ed68:::default.325674.85_bellplants_images%2f1055211.jpg:head attr name 
mismatch 'hinfo_key'
2018-03-02 19:00:18.031929 osd.386 osd.386 10.31.0.72:6817/4057280 428 : 
cluster [ERR] 70.3dbs0 : soid 
70:dbc97561:::default.325674.85_bellplants_images%2f1017818.jpg:head attr name 
mismatch 'hinfo_key'
2018-03-02 19:04:50.058477 osd.386 osd.386 10.31.0.72:6817/4057280 452 : 
cluster [ERR] 70.3dbs0 : soid 
70:dbcbcb34:::default.325674.85_bellplants_images%2f1049756.jpg:head attr name 
mismatch 'hinfo_key'
2018-03-02 19:13:05.689136 osd.386 osd.386 10.31.0.72:6817/4057280 494 : 
cluster [ERR] 70.3dbs0 : soid 
70:dbcfc7c9:::default.325674.85_bellplants_images%2f1021177.jpg:head attr name 
mismatch 'hinfo_key'
2018-03-02 19:13:30.883100 osd.386 osd.386 10.31.0.72:6817/4057280 495 : 
cluster [ERR] 70.3dbs0 repair 0 missing, 161 inconsistent objects
2018-03-02 19:13:30.888259 osd.386 osd.386 10.31.0.72:6817/4057280 496 : 
cluster [ERR] 70.3db repair 161 errors, 161 fixed


The only similar-sounding issue I could find is

http://tracker.ceph.com/issues/20089

When I look at src/osd/PGBackend.cc be_compare_scrubmaps in luminous, I 
don't see the changes in the commit here:


https://github.com/ceph/ceph/pull/15368/files

of course a lot of other things have changed, but is it possible this 
fix never made it into luminous?


Graham

On 02/17/2018 12:48 PM, David Zafman wrote:


The commits below came after v12.2.2 and may impact this issue. When a 
pg is active+clean+inconsistent means that scrub has detected issues 
with 1 or more replicas of 1 or more objects .  An unfound object is a 
potentially temporary state in which the current set of available OSDs 
doesn't allow an object to be recovered/backfilled/repaired.  When the 
primary OSD restarts, any unfound objects ( an in memory structure) are 
reset so that the new set of peered OSDs can determine again what 
objects are unfound.


I'm not clear in this scenario whether recovery failed to start, 
recovery hung before due to a bug or if recovery stopped (as designed) 
because of the unfound object.  The new recovery_unfound and 
backfill_unfound states indicates that recovery has stopped due to 
unfound objects.



commit 64047e1bac2e775a06423a03cfab69b88462538c
Author: David Zafman 
Date:   Wed Jan 10 13:30:41 2018 -0800

     osd: Don't start recovery for missing until active pg state set

     I was seeing recovery hang when it is started before 
_activate_committed()
     The state machine passes into "Active" but this transitions to 
activating

     pg state and after commmitted into "active" pg state.

     Signed-off-by: David Zafman 

commit 7f8b0ce9e681f727d8217e3ed74a1a3355f364f3
Author: David Zafman 
Date:   Mon Oct 9 08:19:21 2017 -0700

     osd, mon: Add new pg states recovery_unfound and backfill_unfound

     Signed-off-by: David Zafman 



On 2/16/18 1:40 PM, Gregory Farnum wrote:

On Fri, Feb 16, 2018 at 12:17 PM Graham Allan  wrote:


On 02/16/2018 12:31 PM, Graham Allan wrote:

If I set debug rgw=1 and demug ms=1 before running the "object stat"
command, it seems to stall in a loop of trying communicate with osds 
for

pool 96, which is .rgw.control


10.32.16.93:0/2689814946 --> 10.31.0.68:6818/8969 --
osd_op(unknown.0.0:541 96.e 96:7759931f:::notify.3:head [watch ping
cookie 139709246356176] snapc 0=[] ondisk+write+known_if_redirected
e507695) v8 -- 0x7f10ac033610 con 0
10.32.16.93:0/2689814946 <== osd.38 10.31.0.68:6818/8969 59 
osd_op_reply(541 notify.3 [watch ping cookie 139709246356176] v0'0
uv3933745 ondisk = 0) v8  152+0+0 (2536111836 <(253)%20611-1836> 0

0) 0x7f1158003e20

con 0x7f117afd8390

Prior to that, probably more relevant, this was the only communication
logged with the primary osd of the pg:


10.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 --
osd_op(unknown.0.0:96 70.438s0
70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head
[getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e507695)
v8 -- 0x7fab79889fa0 con 0
10.32.16.93:0/1552085932 <== osd.175 10.31.0.71:6838/66301 1 
osd_backoff(70.438s0 block id 1

[70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head) 


e507695) v1  209+0+0 (1958971312 0 0) 0x7fab5003d3c0 con
0x7fab79885980
210.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 --

Re: [ceph-users] Cephfs MDS slow requests

2018-03-13 Thread Gregory Farnum
On Tue, Mar 13, 2018 at 2:56 PM David C  wrote:

> Thanks for the detailed response, Greg. A few follow ups inline:
>
>
> On 13 Mar 2018 20:52, "Gregory Farnum"  wrote:
>
> On Tue, Mar 13, 2018 at 12:17 PM, David C  wrote:
> > Hi All
> >
> > I have a Samba server that is exporting directories from a Cephfs Kernel
> > mount. Performance has been pretty good for the last year but users have
> > recently been complaining of short "freezes", these seem to coincide with
> > MDS related slow requests in the monitor ceph.log such as:
> >
> >> 2018-03-13 13:34:58.461030 osd.15 osd.15 10.10.10.211:6812/13367 5752 :
> >> cluster [WRN] slow request 31.834418 seconds old, received at 2018-03-13
> >> 13:34:26.626474: osd_repop(mds.0.5495:810644 3.3e e14085/14019
> >> 3:7cea5bac:::10001a88b8f.:head v 14085'846936) currently
> commit_sent
> >> 2018-03-13 13:34:59.461270 osd.15 osd.15 10.10.10.211:6812/13367 5754 :
> >> cluster [WRN] slow request 32.832059 seconds old, received at 2018-03-13
> >> 13:34:26.629151: osd_repop(mds.0.5495:810671 2.dc2 e14085/14020
> >> 2:43bdcc3f:::10001e91a91.:head v 14085'21394) currently
> commit_sent
> >> 2018-03-13 14:23:57.409427 osd.30 osd.30 10.10.10.212:6824/14997 5708 :
> >> cluster [WRN] slow request 30.536832 seconds old, received at 2018-03-13
> >> 14:23:26.872513: osd_repop(mds.0.5495:865403 2.fb6 e14085/14077
> >> 2:6df955ef:::10001e93542.00c4:head v 14085'21296) currently
> commit_sent
> >> 2018-03-13 14:23:57.409449 osd.30 osd.30 10.10.10.212:6824/14997 5709 :
> >> cluster [WRN] slow request 30.529640 seconds old, received at 2018-03-13
> >> 14:23:26.879704: osd_repop(mds.0.5495:865407 2.595 e14085/14019
> >> 2:a9a56101:::10001e93542.00c8:head v 14085'20437) currently
> commit_sent
> >> 2018-03-13 14:23:57.409453 osd.30 osd.30 10.10.10.212:6824/14997 5710 :
> >> cluster [WRN] slow request 30.503138 seconds old, received at 2018-03-13
> >> 14:23:26.906207: osd_repop(mds.0.5495:865423 2.ea e14085/14055
> >> 2:57096bbf:::10001e93542.00d8:head v 14085'21147) currently
> commit_sent
>
> Well, that means your OSDs are getting operations that commit quickly
> to a journal but are taking a while to get into the backing
> filesystem. (I assume this is on filestore based on that message
> showing up at all, but could be missing something.)
>
>
> Yep it's filestore. Journals are on Intel P3700 NVME, data and metadata
> pools both on 7200rpm SATA. Sounds like I might benefit from moving
> metadata to a dedicated SSD pool.
>
> In the meantime, are there any recommended tunables? Filestore max/min
> sync interval for example?
>

Well, you can try. I'm not sure what the most successful deployments look
like.

If you turn up the min sync interval you stand a better chance of only
doing one write to your HDD if files get overwritten, for instance. But it
may also mean that your commits end up taking so long that you get worse IO
stalls, if there's no opportunity to coalesce and the reality is just that
you're trying to push more IOs through the system than the backing HDDs can
support.
-Greg


>
> >
> >
> > --
> >
> > Looking in the MDS log, with debug set to 4, it's full of
> "setfilelockrule
> > 1" and "setfilelockrule 2":
> >
> >> 2018-03-13 14:23:00.446905 7fde43e73700  4 mds.0.server
> >> handle_client_request client_request(client.9174621:141162337
> >> setfilelockrule 1, type 4, owner 14971048052668053939, pid 7, start 120,
> >> length 1, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521
> caller_uid=1155,
> >> caller_gid=1131{}) v2
> >> 2018-03-13 14:23:00.447050 7fde43e73700  4 mds.0.server
> >> handle_client_request client_request(client.9174621:141162338
> >> setfilelockrule 2, type 4, owner 14971048137043556787, pid 4632, start
> 0,
> >> length 0, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=0,
> >> caller_gid=0{}) v2
> >> 2018-03-13 14:23:00.447258 7fde43e73700  4 mds.0.server
> >> handle_client_request client_request(client.9174621:141162339
> >> setfilelockrule 2, type 4, owner 14971048137043550643, pid 4632, start
> 0,
> >> length 0, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=0,
> >> caller_gid=0{}) v2
> >> 2018-03-13 14:23:00.447393 7fde43e73700  4 mds.0.server
> >> handle_client_request client_request(client.9174621:141162340
> >> setfilelockrule 1, type 4, owner 14971048052668053939, pid 7, start 124,
> >> length 1, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521
> caller_uid=1155,
> >> caller_gid=1131{}) v2
>
> And that is clients setting (and releasing) advisory locks on files. I
> don't think this should directly have anything to do with the slow OSD
> requests (file locking is ephemeral state, not committed to disk), but
> if you have new applications running which are taking file locks on
> shared files that could definitely impede other clients and slow
> things down more generally.
> -Greg
>
>
> Sounds like that could be a red herring 

Re: [ceph-users] Cephfs MDS slow requests

2018-03-13 Thread David C
Thanks for the detailed response, Greg. A few follow ups inline:

On 13 Mar 2018 20:52, "Gregory Farnum"  wrote:

On Tue, Mar 13, 2018 at 12:17 PM, David C  wrote:
> Hi All
>
> I have a Samba server that is exporting directories from a Cephfs Kernel
> mount. Performance has been pretty good for the last year but users have
> recently been complaining of short "freezes", these seem to coincide with
> MDS related slow requests in the monitor ceph.log such as:
>
>> 2018-03-13 13:34:58.461030 osd.15 osd.15 10.10.10.211:6812/13367 5752 :
>> cluster [WRN] slow request 31.834418 seconds old, received at 2018-03-13
>> 13:34:26.626474: osd_repop(mds.0.5495:810644 3.3e e14085/14019
>> 3:7cea5bac:::10001a88b8f.:head v 14085'846936) currently
commit_sent
>> 2018-03-13 13:34:59.461270 osd.15 osd.15 10.10.10.211:6812/13367 5754 :
>> cluster [WRN] slow request 32.832059 seconds old, received at 2018-03-13
>> 13:34:26.629151: osd_repop(mds.0.5495:810671 2.dc2 e14085/14020
>> 2:43bdcc3f:::10001e91a91.:head v 14085'21394) currently
commit_sent
>> 2018-03-13 14:23:57.409427 osd.30 osd.30 10.10.10.212:6824/14997 5708 :
>> cluster [WRN] slow request 30.536832 seconds old, received at 2018-03-13
>> 14:23:26.872513: osd_repop(mds.0.5495:865403 2.fb6 e14085/14077
>> 2:6df955ef:::10001e93542.00c4:head v 14085'21296) currently
commit_sent
>> 2018-03-13 14:23:57.409449 osd.30 osd.30 10.10.10.212:6824/14997 5709 :
>> cluster [WRN] slow request 30.529640 seconds old, received at 2018-03-13
>> 14:23:26.879704: osd_repop(mds.0.5495:865407 2.595 e14085/14019
>> 2:a9a56101:::10001e93542.00c8:head v 14085'20437) currently
commit_sent
>> 2018-03-13 14:23:57.409453 osd.30 osd.30 10.10.10.212:6824/14997 5710 :
>> cluster [WRN] slow request 30.503138 seconds old, received at 2018-03-13
>> 14:23:26.906207: osd_repop(mds.0.5495:865423 2.ea e14085/14055
>> 2:57096bbf:::10001e93542.00d8:head v 14085'21147) currently
commit_sent

Well, that means your OSDs are getting operations that commit quickly
to a journal but are taking a while to get into the backing
filesystem. (I assume this is on filestore based on that message
showing up at all, but could be missing something.)


Yep it's filestore. Journals are on Intel P3700 NVME, data and metadata
pools both on 7200rpm SATA. Sounds like I might benefit from moving
metadata to a dedicated SSD pool.

In the meantime, are there any recommended tunables? Filestore max/min sync
interval for example?


>
>
> --
>
> Looking in the MDS log, with debug set to 4, it's full of "setfilelockrule
> 1" and "setfilelockrule 2":
>
>> 2018-03-13 14:23:00.446905 7fde43e73700  4 mds.0.server
>> handle_client_request client_request(client.9174621:141162337
>> setfilelockrule 1, type 4, owner 14971048052668053939, pid 7, start 120,
>> length 1, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521
caller_uid=1155,
>> caller_gid=1131{}) v2
>> 2018-03-13 14:23:00.447050 7fde43e73700  4 mds.0.server
>> handle_client_request client_request(client.9174621:141162338
>> setfilelockrule 2, type 4, owner 14971048137043556787, pid 4632, start 0,
>> length 0, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=0,
>> caller_gid=0{}) v2
>> 2018-03-13 14:23:00.447258 7fde43e73700  4 mds.0.server
>> handle_client_request client_request(client.9174621:141162339
>> setfilelockrule 2, type 4, owner 14971048137043550643, pid 4632, start 0,
>> length 0, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=0,
>> caller_gid=0{}) v2
>> 2018-03-13 14:23:00.447393 7fde43e73700  4 mds.0.server
>> handle_client_request client_request(client.9174621:141162340
>> setfilelockrule 1, type 4, owner 14971048052668053939, pid 7, start 124,
>> length 1, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521
caller_uid=1155,
>> caller_gid=1131{}) v2

And that is clients setting (and releasing) advisory locks on files. I
don't think this should directly have anything to do with the slow OSD
requests (file locking is ephemeral state, not committed to disk), but
if you have new applications running which are taking file locks on
shared files that could definitely impede other clients and slow
things down more generally.
-Greg


Sounds like that could be a red herring then. It seems like my issue is
users chucking lots of small writes at the cephfs mount.


>
>
> --
>
> I don't have a particularly good monitoring set up on this cluster yet,
but
> a cursory look at a few things such as iostat doesn't seem to suggest OSDs
> are being hammered.
>
> Some questions:
>
> 1) Can anyone recommend a way of diagnosing this issue?
> 2) Are the multiple "setfilelockrule" per inode to be expected? I assume
> this is something to do with the Samba oplocks.

Hmm, you might be right about the oplocks. That's an output format
error btw, it should be "setfilelock" (the op type) and a separate
word " rule " (indicating the type of lock, 1 for shared and 2 for
exclusive).


> 3) What's the recommended 

Re: [ceph-users] Cephfs MDS slow requests

2018-03-13 Thread Gregory Farnum
On Tue, Mar 13, 2018 at 12:17 PM, David C  wrote:
> Hi All
>
> I have a Samba server that is exporting directories from a Cephfs Kernel
> mount. Performance has been pretty good for the last year but users have
> recently been complaining of short "freezes", these seem to coincide with
> MDS related slow requests in the monitor ceph.log such as:
>
>> 2018-03-13 13:34:58.461030 osd.15 osd.15 10.10.10.211:6812/13367 5752 :
>> cluster [WRN] slow request 31.834418 seconds old, received at 2018-03-13
>> 13:34:26.626474: osd_repop(mds.0.5495:810644 3.3e e14085/14019
>> 3:7cea5bac:::10001a88b8f.:head v 14085'846936) currently commit_sent
>> 2018-03-13 13:34:59.461270 osd.15 osd.15 10.10.10.211:6812/13367 5754 :
>> cluster [WRN] slow request 32.832059 seconds old, received at 2018-03-13
>> 13:34:26.629151: osd_repop(mds.0.5495:810671 2.dc2 e14085/14020
>> 2:43bdcc3f:::10001e91a91.:head v 14085'21394) currently commit_sent
>> 2018-03-13 14:23:57.409427 osd.30 osd.30 10.10.10.212:6824/14997 5708 :
>> cluster [WRN] slow request 30.536832 seconds old, received at 2018-03-13
>> 14:23:26.872513: osd_repop(mds.0.5495:865403 2.fb6 e14085/14077
>> 2:6df955ef:::10001e93542.00c4:head v 14085'21296) currently commit_sent
>> 2018-03-13 14:23:57.409449 osd.30 osd.30 10.10.10.212:6824/14997 5709 :
>> cluster [WRN] slow request 30.529640 seconds old, received at 2018-03-13
>> 14:23:26.879704: osd_repop(mds.0.5495:865407 2.595 e14085/14019
>> 2:a9a56101:::10001e93542.00c8:head v 14085'20437) currently commit_sent
>> 2018-03-13 14:23:57.409453 osd.30 osd.30 10.10.10.212:6824/14997 5710 :
>> cluster [WRN] slow request 30.503138 seconds old, received at 2018-03-13
>> 14:23:26.906207: osd_repop(mds.0.5495:865423 2.ea e14085/14055
>> 2:57096bbf:::10001e93542.00d8:head v 14085'21147) currently commit_sent

Well, that means your OSDs are getting operations that commit quickly
to a journal but are taking a while to get into the backing
filesystem. (I assume this is on filestore based on that message
showing up at all, but could be missing something.)

>
>
> --
>
> Looking in the MDS log, with debug set to 4, it's full of "setfilelockrule
> 1" and "setfilelockrule 2":
>
>> 2018-03-13 14:23:00.446905 7fde43e73700  4 mds.0.server
>> handle_client_request client_request(client.9174621:141162337
>> setfilelockrule 1, type 4, owner 14971048052668053939, pid 7, start 120,
>> length 1, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=1155,
>> caller_gid=1131{}) v2
>> 2018-03-13 14:23:00.447050 7fde43e73700  4 mds.0.server
>> handle_client_request client_request(client.9174621:141162338
>> setfilelockrule 2, type 4, owner 14971048137043556787, pid 4632, start 0,
>> length 0, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=0,
>> caller_gid=0{}) v2
>> 2018-03-13 14:23:00.447258 7fde43e73700  4 mds.0.server
>> handle_client_request client_request(client.9174621:141162339
>> setfilelockrule 2, type 4, owner 14971048137043550643, pid 4632, start 0,
>> length 0, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=0,
>> caller_gid=0{}) v2
>> 2018-03-13 14:23:00.447393 7fde43e73700  4 mds.0.server
>> handle_client_request client_request(client.9174621:141162340
>> setfilelockrule 1, type 4, owner 14971048052668053939, pid 7, start 124,
>> length 1, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=1155,
>> caller_gid=1131{}) v2

And that is clients setting (and releasing) advisory locks on files. I
don't think this should directly have anything to do with the slow OSD
requests (file locking is ephemeral state, not committed to disk), but
if you have new applications running which are taking file locks on
shared files that could definitely impede other clients and slow
things down more generally.
-Greg

>
>
> --
>
> I don't have a particularly good monitoring set up on this cluster yet, but
> a cursory look at a few things such as iostat doesn't seem to suggest OSDs
> are being hammered.
>
> Some questions:
>
> 1) Can anyone recommend a way of diagnosing this issue?
> 2) Are the multiple "setfilelockrule" per inode to be expected? I assume
> this is something to do with the Samba oplocks.

Hmm, you might be right about the oplocks. That's an output format
error btw, it should be "setfilelock" (the op type) and a separate
word " rule " (indicating the type of lock, 1 for shared and 2 for
exclusive).

> 3) What's the recommended highest MDS debug setting before performance
> starts to be adversely affected (I'm aware log files will get huge)?

There's not a good answer here. If you actually encounter a bug
developers will want at least 10, and probably 20, but both of those
have measurable performance impacts. :/

> 4) What's the best way of matching inodes in the MDS log to the file names
> in cephfs?

If you have an actual log then before the inode gets used it has to
get found, and that will be by path. You can also look at the xattrs
on the object in rados; one 

Re: [ceph-users] Issue with fstrim and Nova hw_disk_discard=unmap

2018-03-13 Thread Jason Dillaman
Can you provide the output from "rbd info /volume-80838a69-e544-47eb-b981-a4786be89736"?

On Tue, Mar 13, 2018 at 12:30 PM, Fulvio Galeazzi
 wrote:
> Hallo!
>
>> Discards appear like they are being sent to the device.  How big of a
>> temporary file did you create and then delete? Did you sync the file
>> to disk before deleting it? What version of qemu-kvm are you running?
>
>
> I made several test with commands like (issuing sync after each operation):
>
> dd if=/dev/zero of=/tmp/fileTest bs=1M count=200 oflag=direct
>
> What I see is that if I repeat the command with count<=200 the size does not
> increase.
>
> Let's try now with count>200:
>
> NAMEPROVISIONED  USED
> volume-80838a69-e544-47eb-b981-a4786be89736  15360M 2284M
>
> dd if=/dev/zero of=/tmp/fileTest bs=1M count=750 oflag=direct
> dd if=/dev/zero of=/tmp/fileTest2 bs=1M count=750 oflag=direct
> sync
>
> NAMEPROVISIONED  USED
> volume-80838a69-e544-47eb-b981-a4786be89736  15360M 2528M
>
> rm /tmp/fileTest*
> sync
> sudo fstrim -v /
> /: 14.1 GiB (15145271296 bytes) trimmed
>
> NAMEPROVISIONED  USED
> volume-80838a69-e544-47eb-b981-a4786be89736  15360M 2528M
>
>
>
> As for qemu-kvm, the guest OS is CentOS7, with:
>
> [centos@testcentos-deco3 tmp]$ rpm -qa | grep qemu
> qemu-guest-agent-2.8.0-2.el7.x86_64
>
> while the host is Ubuntu 16 with:
>
> root@pa1-r2-s10:/home/ubuntu# dpkg -l | grep qemu
> ii  qemu-block-extra:amd64   1:2.8+dfsg-3ubuntu2.9~cloud1
> amd64extra block backend modules for qemu-system and qemu-utils
> ii  qemu-kvm 1:2.8+dfsg-3ubuntu2.9~cloud1
> amd64QEMU Full virtualization
> ii  qemu-system-common   1:2.8+dfsg-3ubuntu2.9~cloud1
> amd64QEMU full system emulation binaries (common files)
> ii  qemu-system-x86  1:2.8+dfsg-3ubuntu2.9~cloud1
> amd64QEMU full system emulation binaries (x86)
> ii  qemu-utils   1:2.8+dfsg-3ubuntu2.9~cloud1
> amd64QEMU utilities
>
>
>   Thanks!
>
> Fulvio
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Issue with fstrim and Nova hw_disk_discard=unmap

2018-03-13 Thread Jason Dillaman
Discards appear like they are being sent to the device.  How big of a
temporary file did you create and then delete? Did you sync the file
to disk before deleting it? What version of qemu-kvm are you running?

On Tue, Mar 13, 2018 at 11:09 AM, Fulvio Galeazzi
 wrote:
> Hallo Jason,
> thanks for your feedback!
>
>  Original Message >>   * decorated a CentOS image with
> hw_scsi_model=virtio--scsi,hw_disk_bus=scsi> > Is that just a typo for
> "hw_scsi_model"?
> Yes, it was a typo when I wrote my message. The image has virtio-scsi as it
> should.
>
>>> I see that commands:
>>> rbd --cluster cephpa1 diff cinder-ceph/${theVol} | awk '{ SUM += $2 } END
>>> {
>>> print SUM/1024/1024 " MB" }' ; rados --cluster cephpa1 -p cinder-ceph ls
>>> |
>>> grep rbd_data.{whatever} | wc -l
>>
>>
>> That's pretty old-school -- you can just use 'rbd du" now to calculate
>> the disk usage.
>
>
> Good to know, thanks!
>
>>>   show the size increases but does not decrease when I execute delete the
>>> temporary file and execute
>>>  sudo fstrim -v /
>>
>>
>> Have you verified that your VM is indeed using virtio-scsi? Does
>> blktrace show SCSI UNMAP operations being issued to the block device
>> when you execute "fstrim"?
>
>
> Thanks for the tip, I think I need some more help, please.
>
> Disk on my VM is indeed /dev/sda rather than /dev/vda. The XML shows:
> .
> 
>   
> .
>name='cinder-ceph/volume-80838a69-e544-47eb-b981-a4786be89736'>
> .
>   
>   80838a69-e544-47eb-b981-a4786be89736
>   
> 
> 
>function='0x0'/>
> 
>
>
> As for blktrace, blkparse shows me tons of lines, please find below the
> first ones and one of the many group of lines which I see:
>
>   8,00   11 4.333917112 24677  Q FWFSM 8406583 + 4 [fstrim]
>   8,00   12 4.333919649 24677  G FWFSM 8406583 + 4 [fstrim]
>   8,00   13 4.333920695 24677  P   N [fstrim]
>   8,00   14 4.333922965 24677  I FWFSM 8406583 + 4 [fstrim]
>   8,00   15 4.333924575 24677  U   N [fstrim] 1
>   8,00   20 4.340140041 24677  Q   D 986016 + 2097152 [fstrim]
>   8,00   21 4.340144908 24677  G   D 986016 + 2097152 [fstrim]
>   8,00   22 4.340145561 24677  P   N [fstrim]
>   8,00   24 4.340147495 24677  Q   D 3083168 + 1112672 [fstrim]
>   8,00   25 4.340149772 24677  G   D 3083168 + 1112672 [fstrim]
> .
>   8,00   50 4.340556955 24677  Q   D 665880 + 20008 [fstrim]
>   8,00   51 4.340558481 24677  G   D 665880 + 20008 [fstrim]
>   8,00   52 4.340558728 24677  P   N [fstrim]
>   8,00   53 4.340559725 24677  I   D 665880 + 20008 [fstrim]
>   8,00   54 4.340560292 24677  U   N [fstrim] 1
>   8,00   55 4.340560801 24677  D   D 665880 + 20008 [fstrim]
> .
>
> Apologies for my ignorance, is the above enough to understand whether SCSI
> UNMAP operations are being issued?
>
>   Thanks a lot!
>
> Fulvio
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD as DB/WAL performance with/without drive write cache

2018-03-13 Thread Caspar Smit
Hi Vladimir,

Yeah, the results are pretty low compared to yours but i think this is due
to the fact that this SSD is in a fairly old server (Supermicro X8, SAS2
expander backplane).

Controller is LSI/Broadcom 9207-8i on the latest IT firmware (Same LSI2308
chipset as yours)

Kind regards,
Caspar

2018-03-13 21:00 GMT+01:00 Дробышевский, Владимир :

> Hello, Caspar!
>
>   Would you mind to share controller model you use? I would say these
> results are pretty low.
>
>   Here are my results on Intel RMS25LB LSI2308 based SAS controller in IT
> mode:
>
> I set write_cache to write through
>
> Test command, fio 2.2.10:
>
> sudo fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k
> --numjobs=XXX --iodepth=1 --runtime=60 --time_based --group_reporting
> --name=journal-test
>
> where XXX - number of jobs
>
> Results:
>
> numjobs: 1
>
>   write: io=5068.6MB, bw=86493KB/s, iops=21623, runt= 60001msec
> clat (usec): min=38, max=8343, avg=45.01, stdev=32.10
>
> numjobs : 5
>
>   write: io=14548MB, bw=248274KB/s, iops=62068, runt= 60001msec
> clat (usec): min=40, max=11291, avg=79.05, stdev=46.37
>
> numjobs : 10
>
>   write: io=14762MB, bw=251939KB/s, iops=62984, runt= 60001msec
> clat (usec): min=52, max=10356, avg=157.16, stdev=65.69
>
> I have got even better results on z97 integrated SATA controller, you can
> find them in comments to the post you have mentioned (
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-te
> st-if-your-ssd-is-suitable-as-a-journal-device/#comment-3273882789).
>
> Still don't know why LSI 2308 SAS performance worse than z97 SATA and
> can't find any info on why write back cache setting has slower write than
> write through.
>
> But I would offer to pay more attention to IOPS than to the sequential
> write speed, especially on the small blocks workload.
>
> 2018-03-13 21:33 GMT+05:00 Caspar Smit :
>
>> Hi all,
>>
>> I've tested some new Samsung SM863 960GB and Intel DC S4600 240GB SSD's
>> using the method described at Sebastien Han's blog:
>>
>> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-tes
>> t-if-your-ssd-is-suitable-as-a-journal-device/
>>
>> The first thing stated there is to disable the drive's write cache, which
>> i did.
>>
>> For the Samsungs i got these results:
>>
>> 1 Job: 85 MB/s
>> 5 Jobs: 179 MB/s
>> 10 Jobs: 179 MB/s
>>
>> I was curious what the results would be with the drive write cache on, so
>> i turned it on.
>>
>> Now i got these results:
>>
>> 1 Job: 49 MB/s
>> 5 Jobs: 110 MB/s
>> 10 Jobs: 132 MB/s
>>
>> So i didn't expect these results to be worse because i would assume a
>> drive write cache would make it faster.
>>
>> For the Intels i got more or less the same conclusion (with different
>> figures) but the performance with drive write cache was about half the
>> performance as without drive write cache.
>>
>> Questions:
>>
>> 1) Is this expected behaviour (for all/most SSD's)? If yes, why?
>> 2) Is this only with this type of test?
>> 3) Should i always disable drive write cache for SSD's during boot?
>> 4) Is there any negative side-effect of disabling the drive's write cache?
>> 5) Are these tests still relevant for DB/WAL devices? The blog is written
>> for Filestore and states all journal writes are sequential but is that also
>> true for bluestore DB/WAL writes? Do i need to test differently for DB/WAL?
>>
>> Kind regards,
>> Caspar
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
>
> С уважением,
> Дробышевский Владимир
> Компания "АйТи Город"
> +7 343 192 <+7%20343%20222-21-92>
>
> ИТ-консалтинг
> Поставка проектов "под ключ"
> Аутсорсинг ИТ-услуг
> Аутсорсинг ИТ-инфраструктуры
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD as DB/WAL performance with/without drive write cache

2018-03-13 Thread Дробышевский , Владимир
Hello, Caspar!

  Would you mind to share controller model you use? I would say these
results are pretty low.

  Here are my results on Intel RMS25LB LSI2308 based SAS controller in IT
mode:

I set write_cache to write through

Test command, fio 2.2.10:

sudo fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k
--numjobs=XXX --iodepth=1 --runtime=60 --time_based --group_reporting
--name=journal-test

where XXX - number of jobs

Results:

numjobs: 1

  write: io=5068.6MB, bw=86493KB/s, iops=21623, runt= 60001msec
clat (usec): min=38, max=8343, avg=45.01, stdev=32.10

numjobs : 5

  write: io=14548MB, bw=248274KB/s, iops=62068, runt= 60001msec
clat (usec): min=40, max=11291, avg=79.05, stdev=46.37

numjobs : 10

  write: io=14762MB, bw=251939KB/s, iops=62984, runt= 60001msec
clat (usec): min=52, max=10356, avg=157.16, stdev=65.69

I have got even better results on z97 integrated SATA controller, you can
find them in comments to the post you have mentioned (
https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-
test-if-your-ssd-is-suitable-as-a-journal-device/#comment-3273882789).

Still don't know why LSI 2308 SAS performance worse than z97 SATA and can't
find any info on why write back cache setting has slower write than write
through.

But I would offer to pay more attention to IOPS than to the sequential
write speed, especially on the small blocks workload.

2018-03-13 21:33 GMT+05:00 Caspar Smit :

> Hi all,
>
> I've tested some new Samsung SM863 960GB and Intel DC S4600 240GB SSD's
> using the method described at Sebastien Han's blog:
>
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-
> test-if-your-ssd-is-suitable-as-a-journal-device/
>
> The first thing stated there is to disable the drive's write cache, which
> i did.
>
> For the Samsungs i got these results:
>
> 1 Job: 85 MB/s
> 5 Jobs: 179 MB/s
> 10 Jobs: 179 MB/s
>
> I was curious what the results would be with the drive write cache on, so
> i turned it on.
>
> Now i got these results:
>
> 1 Job: 49 MB/s
> 5 Jobs: 110 MB/s
> 10 Jobs: 132 MB/s
>
> So i didn't expect these results to be worse because i would assume a
> drive write cache would make it faster.
>
> For the Intels i got more or less the same conclusion (with different
> figures) but the performance with drive write cache was about half the
> performance as without drive write cache.
>
> Questions:
>
> 1) Is this expected behaviour (for all/most SSD's)? If yes, why?
> 2) Is this only with this type of test?
> 3) Should i always disable drive write cache for SSD's during boot?
> 4) Is there any negative side-effect of disabling the drive's write cache?
> 5) Are these tests still relevant for DB/WAL devices? The blog is written
> for Filestore and states all journal writes are sequential but is that also
> true for bluestore DB/WAL writes? Do i need to test differently for DB/WAL?
>
> Kind regards,
> Caspar
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 

С уважением,
Дробышевский Владимир
Компания "АйТи Город"
+7 343 192

ИТ-консалтинг
Поставка проектов "под ключ"
Аутсорсинг ИТ-услуг
Аутсорсинг ИТ-инфраструктуры
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Civetweb log format

2018-03-13 Thread Aaron Bassett
Well I have it mostly wrapped up and writing to graylog, however the ops log 
has a `remote_addr` field, but as far as I can tell it's always blank. I found 
this fix but it seems to only be in v13.0.1 
https://github.com/ceph/ceph/pull/16860

Is there any chance we'd see backports of this to Jewel and/or luminous?


Aaron

On Mar 12, 2018, at 5:50 PM, Aaron Bassett 
> wrote:

Quick update:

adding the following to your config:

rgw log http headers = "http_authorization"
rgw ops log socket path = /tmp/rgw
rgw enable ops log = true
rgw enable usage log = true


and you can now

 nc -U /tmp/rgw |./jq --stream 'fromstream(1|truncate_stream(inputs))'
{
  "time": "2018-03-12 21:42:19.479037Z",
  "time_local": "2018-03-12 21:42:19.479037",
  "remote_addr": "",
  "user": "test",
  "operation": "PUT",
  "uri": "/testbucket/",
  "http_status": "200",
  "error_code": "",
  "bytes_sent": 19,
  "bytes_received": 0,
  "object_size": 0,
  "total_time": 600967,
  "user_agent": "Boto/2.46.1 Python/2.7.12 Linux/4.4.0-42-generic",
  "referrer": "",
  "http_x_headers": [
{
  "HTTP_AUTHORIZATION": "AWS : "
}
  ]
}

pretty good start on getting an audit log going!


On Mar 9, 2018, at 10:52 PM, Konstantin Shalygin 
> wrote:



Unfortunately I can't quite figure out how to use it. I've got "rgw log http 
headers = "authorization" in my ceph.conf but I'm getting no love in the rgw 
log.



I think this shold have 'http_' prefix, like:


rgw log http headers = "http_host, http_x_forwarded_for"





k



CONFIDENTIALITY NOTICE
This e-mail message and any attachments are only for the use of the intended 
recipient and may contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient, any disclosure, distribution or other use of this e-mail message or 
attachments is prohibited. If you have received this e-mail message in error, 
please delete and notify the sender immediately. Thank you.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cephfs MDS slow requests

2018-03-13 Thread David C
Hi All

I have a Samba server that is exporting directories from a Cephfs Kernel
mount. Performance has been pretty good for the last year but users have
recently been complaining of short "freezes", these seem to coincide with
MDS related slow requests in the monitor ceph.log such as:

2018-03-13 13:34:58.461030 osd.15 osd.15 10.10.10.211:6812/13367 5752 :
> cluster [WRN] slow request 31.834418 seconds old, received at 2018-03-13
> 13:34:26.626474: osd_repop(mds.0.5495:810644 3.3e e14085/14019
> 3:7cea5bac:::10001a88b8f.:head v 14085'846936) currently commit_sent
> 2018-03-13 13:34:59.461270 osd.15 osd.15 10.10.10.211:6812/13367 5754 :
> cluster [WRN] slow request 32.832059 seconds old, received at 2018-03-13
> 13:34:26.629151: osd_repop(mds.0.5495:810671 2.dc2 e14085/14020
> 2:43bdcc3f:::10001e91a91.:head v 14085'21394) currently commit_sent
> 2018-03-13 14:23:57.409427 osd.30 osd.30 10.10.10.212:6824/14997 5708 :
> cluster [WRN] slow request 30.536832 seconds old, received at 2018-03-13
> 14:23:26.872513: osd_repop(mds.0.5495:865403 2.fb6 e14085/14077
> 2:6df955ef:::10001e93542.00c4:head v 14085'21296) currently commit_sent
> 2018-03-13 14:23:57.409449 osd.30 osd.30 10.10.10.212:6824/14997 5709 :
> cluster [WRN] slow request 30.529640 seconds old, received at 2018-03-13
> 14:23:26.879704: osd_repop(mds.0.5495:865407 2.595 e14085/14019
> 2:a9a56101:::10001e93542.00c8:head v 14085'20437) currently commit_sent
> 2018-03-13 14:23:57.409453 osd.30 osd.30 10.10.10.212:6824/14997 5710 :
> cluster [WRN] slow request 30.503138 seconds old, received at 2018-03-13
> 14:23:26.906207: osd_repop(mds.0.5495:865423 2.ea e14085/14055
> 2:57096bbf:::10001e93542.00d8:head v 14085'21147) currently commit_sent


-- 

Looking in the MDS log, with debug set to 4, it's full of "setfilelockrule
1" and "setfilelockrule 2":

2018-03-13 14:23:00.446905 7fde43e73700  4 mds.0.server
> handle_client_request client_request(client.9174621:141162337
> setfilelockrule 1, type 4, owner 14971048052668053939, pid 7, start 120,
> length 1, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=1155,
> caller_gid=1131{}) v2
> 2018-03-13 14:23:00.447050 7fde43e73700  4 mds.0.server
> handle_client_request client_request(client.9174621:141162338
> setfilelockrule 2, type 4, owner 14971048137043556787, pid 4632, start 0,
> length 0, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=0,
> caller_gid=0{}) v2
> 2018-03-13 14:23:00.447258 7fde43e73700  4 mds.0.server
> handle_client_request client_request(client.9174621:141162339
> setfilelockrule 2, type 4, owner 14971048137043550643, pid 4632, start 0,
> length 0, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=0,
> caller_gid=0{}) v2
> 2018-03-13 14:23:00.447393 7fde43e73700  4 mds.0.server
> handle_client_request client_request(client.9174621:141162340
> setfilelockrule 1, type 4, owner 14971048052668053939, pid 7, start 124,
> length 1, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=1155,
> caller_gid=1131{}) v2


-- 

I don't have a particularly good monitoring set up on this cluster yet, but
a cursory look at a few things such as iostat doesn't seem to suggest OSDs
are being hammered.

Some questions:

1) Can anyone recommend a way of diagnosing this issue?
2) Are the multiple "setfilelockrule" per inode to be expected? I assume
this is something to do with the Samba oplocks.
3) What's the recommended highest MDS debug setting before performance
starts to be adversely affected (I'm aware log files will get huge)?
4) What's the best way of matching inodes in the MDS log to the file names
in cephfs?

Hardware/Versions:

Luminous 12.1.1
Cephfs client 3.10.0-514.2.2.el7.x86_64
Samba 4.4.4
4 node cluster, each node 1xIntel 3700 NVME, 12x SATA, 40Gbps networking

Thanks in advance!

Cheers,
David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: [ceph bad performance], can't find a bottleneck

2018-03-13 Thread Sergey Kotov
Hi, Maged

Not a big difference in both cases.

Performance of 4 nodes pool with 5x PM863a each is:
4k bs - 33-37kIOPS krbd 128 threads and 42-51kIOPS vs 1024  threads (fio
numjobs 128-256-512)
the same situation happens when we try to increase rbd workload, 3 rbd gets
the same iops #.
Dead end & limit )

Thank you!

2018-03-12 21:49 GMT+03:00 Maged Mokhtar :

> Hi,
>
> Try increasing the queue depth from default 128 to 1024:
>
> rbd map image-XX  -o queue_depth=1024
>
>
> Also if you run multiple rbd images/fio tests, do you get higher combined
> performance ?
>
> Maged
>
>
> On 2018-03-12 17:16, Sergey Kotov wrote:
>
> Dear moderator, i subscribed to ceph list today, could you please post my
> message?
>
> -- Forwarded message --
> From: Sergey Kotov 
> Date: 2018-03-06 10:52 GMT+03:00
> Subject: [ceph bad performance], can't find a bottleneck
> To: ceph-users@lists.ceph.com
> Cc: Житенев Алексей , Anna Anikina <
> anik...@gmail.com>
>
>
> Good day.
>
> Can you please help us to find bottleneck in our ceph installations.
> We have 3 SSD-only clusters with different HW, but situation is the same -
> overall i/o operations between client & ceph lower than 1/6 of summary
> performance all ssd.
>
> For example -
> One of our cluster has 4-nodes with ssd Toshiba 2Tb Enterprise drives,
> installed on Ubuntu server 16.04.
> Servers are connected to the 10G switches. Latency between modes is about
> 0.1ms. Ethernet utilisation is low.
>
> # uname -a
> Linux storage01 4.4.0-101-generic #124-Ubuntu SMP Fri Nov 10 18:29:59 UTC
> 2017 x86_64 x86_64 x86_64 GNU/Linux
>
> # ceph osd versions
> {
> "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e)
> luminous (stable)": 55
> }
>
>
> When we map rbd image direct on the storage nodes via krbd, performance is
> not good enough.
> We use fio for testing. Even we run randwrite with 4k block size test in
> multiple thread mode, our drives don't have utilisation higher then 30% and
> lat is ok.
>
> At the same time iostat tool displays 100% utilisation on /dev/rbdX.
>
> Also we can't enable rbd_cache, because of using scst iscsi over rbd
> mapped images.
>
> How can we resolve the issue?
>
> Ceph config:
>
> [global]
> fsid = beX482fX-6a91-46dX-ad22-21a8a2696abX
> mon_initial_members = storage01, storage02, storage03
> mon_host = X.Y.Z.1,X.Y.Z.2,X.Y.Z.3
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> public_network = X.Y.Z.0/24
> filestore_xattr_use_omap = true
> osd_pool_default_size = 2
> osd_pool_default_min_size = 1
> osd_pool_default_pg_num = 1024
> osd_journal_size = 10240
> osd_mkfs_type = xfs
> filestore_op_threads = 16
> filestore_wbthrottle_enable = False
> throttler_perf_counter = False
> osd crush update on start = false
>
> [osd]
> osd_scrub_begin_hour = 1
> osd_scrub_end_hour = 6
> osd_scrub_priority = 1
>
> osd_enable_op_tracker = False
> osd_max_backfills = 1
> osd heartbeat grace = 20
> osd heartbeat interval = 5
> osd recovery max active = 1
> osd recovery max single start = 1
> osd recovery op priority = 1
> osd recovery threads = 1
> osd backfill scan max = 16
> osd backfill scan min = 4
> osd max scrubs = 1
> osd scrub interval randomize ratio = 1.0
> osd disk thread ioprio class = idle
> osd disk thread ioprio priority = 0
> osd scrub chunk max = 1
> osd scrub chunk min = 1
> osd deep scrub stride = 1048576
> osd scrub load threshold = 5.0
> osd scrub sleep = 0.1
>
> [client]
> rbd_cache = false
>
>
> Sample fio tests:
>
> root@storage04:~# fio --name iops --rw randread --bs 4k --filename
> /dev/rbd2 --numjobs 12 --ioengine=libaio --group_reporting
> iops: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
> ...
> fio-2.2.10
> Starting 12 processes
> ^Cbs: 12 (f=12): [r(12)] [1.2% done] [128.4MB/0KB/0KB /s] [32.9K/0/0 iops]
> [eta 16m:49s]
> fio: terminating on signal 2
>
> iops: (groupid=0, jobs=12): err= 0: pid=29812: Sun Feb 11 23:59:19 2018
>   read : io=1367.8MB, bw=126212KB/s, iops=31553, runt= 11097msec
> slat (usec): min=1, max=59700, avg=375.92, stdev=495.19
> clat (usec): min=0, max=377, avg= 1.12, stdev= 3.16
>  lat (usec): min=1, max=59702, avg=377.61, stdev=495.32
> clat percentiles (usec):
>  |  1.00th=[0],  5.00th=[0], 10.00th=[1], 20.00th=[1],
>  | 30.00th=[1], 40.00th=[1], 50.00th=[1], 60.00th=[1],
>  | 70.00th=[1], 80.00th=[1], 90.00th=[1], 95.00th=[2],
>  | 99.00th=[2], 99.50th=[2], 99.90th=[   73], 99.95th=[   78],
>  | 99.99th=[  115]
> bw (KB  /s): min= 8536, max=11944, per=8.33%, avg=10516.45,
> stdev=635.32
> lat (usec) : 2=91.74%, 4=7.93%, 10=0.14%, 20=0.09%, 50=0.01%
> lat (usec) : 100=0.07%, 250=0.03%, 500=0.01%
>   cpu  : usr=1.32%, sys=3.69%, ctx=329556, majf=0, minf=134
>   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> >=64=0.0%
> 

Re: [ceph-users] Object Gateway - Server Side Encryption

2018-03-13 Thread Casey Bodley


On 03/10/2018 12:58 AM, Amardeep Singh wrote:

On Saturday 10 March 2018 02:01 AM, Casey Bodley wrote:


On 03/08/2018 07:16 AM, Amardeep Singh wrote:

Hi,

I am trying to configure server side encryption using Key Management 
Service as per documentation 
http://docs.ceph.com/docs/master/radosgw/encryption/


Configured Keystone/Barbican integration and its working, tested 
using curl commands. After I configure RadosGW and use 
boto.s3.connection from python or s3cmd client an error is thrown.

*
*/boto.exception.S3ResponseError: S3ResponseError: 403 Forbidden//
//encoding="UTF-8"?>AccessDeniedFailed to 
retrieve the actual key, kms-keyid: 
616b2ce2-053a-41e3-b51e-0ff53e33cf81newbuckettx77750-005aa1274b-ac51-uk-westac51-uk-west-uk//

/
In server side logs its getting the token and barbican is 
authenticating the request then providing secret url, but unable to 
serve key.

/
22:10:03.940091 7f056f7eb700 15 ceph_armor ret=16
 22:10:03.940111 7f056f7eb700 15 
supplied_md5=eb1a3227cdc3fedbaec2fe38bf6c044a
 22:10:03.940129 7f056f7eb700 20 reading from 
uk-west.rgw.meta:root:.bucket.meta.newbucket:ee560b67-c330-4fd0-af50-aefff93735d2.4163.1
 22:10:03.940138 7f056f7eb700 20 get_system_obj_state: 
rctx=0x7f056f7e39f0 
obj=uk-west.rgw.meta:root:.bucket.meta.newbucket:ee560b67-c330-4fd0-af50-aefff93735d2.4163.1 
state=0x56540487a5a0 s->prefetch_data=0
 22:10:03.940145 7f056f7eb700 10 cache get: 
name=uk-west.rgw.meta+root+.bucket.meta.newbucket:ee560b67-c330-4fd0-af50-aefff93735d2.4163.1 
: hit (requested=0x16, cached=0x17)
 22:10:03.940152 7f056f7eb700 20 get_system_obj_state: s->obj_tag 
was set empty
 22:10:03.940155 7f056f7eb700 10 cache get: 
name=uk-west.rgw.meta+root+.bucket.meta.newbucket:ee560b67-c330-4fd0-af50-aefff93735d2.4163.1 
: hit (requested=0x11, cached=0x17)
 22:10:03.944015 7f056f7eb700 20 bucket quota: max_objects=1638400 
max_size=-1
 22:10:03.944030 7f056f7eb700 20 bucket quota OK: 
stats.num_objects=7 stats.size=50
 22:10:03.944176 7f056f7eb700 20 Getting KMS encryption key for 
key=616b2ce2-053a-41e3-b51e-0ff53e33cf81
 22:10:03.944225 7f056f7eb700 20 Requesting secret from barbican 
url=http://keyserver.rados:5000/v3/auth/tokens
 22:10:03.944281 7f056f7eb700 20 sending request to 
http://keyserver.rados:5000/v3/auth/tokens
* 22:10:04.405974 7f056f7eb700 20 sending request to 
http://keyserver.rados:9311/v1/secrets/616b2ce2-053a-41e3-b51e-0ff53e33cf81*
* 22:10:05.519874 7f056f7eb700 5 Failed to retrieve secret from 
barbican:616b2ce2-053a-41e3-b51e-0ff53e33cf81**

*/


It looks like this request is being rejected by barbican. Do you have 
any logs on the barbican side that might show why?

Only get 2 lines in barbican logs, one shows warning.

22:10:08.255 807 WARNING barbican.api.controllers.secrets 
[req-091413d2--46e2-be5f-a3e68a480ac9 
716dad1b8044459c99fea284dbfc47cc - - default default] Decrypted secret 
616b2ce2-053a-41e3-b51e-0ff53e33cf81 requested using deprecated API call.
22:10:08.261 807 INFO barbican.api.middleware.context 
[req-091413d2--46e2-be5f-a3e68a480ac9 
716dad1b8044459c99fea284dbfc47cc - - default default] Processed 
request: 200 OK - GET 
http://keyserver.rados:9311/v1/secrets/616b2ce2-053a-41e3-b51e-0ff53e33cf81




Okay, so barbican is returning 200 OK but radosgw is still converting 
that to EACCES. I'm guessing that's happening in 
request_key_from_barbican() here: 
https://github.com/ceph/ceph/blob/master/src/rgw/rgw_crypt.cc#L779 - is 
it possible the key in barbican is something other than AES256?






/*** 22:10:05.519901 7f056f7eb700 5 ERROR: failed to retrieve actual 
key from key_id: 616b2ce2-053a-41e3-b51e-0ff53e33cf81*
 22:10:05.519980 7f056f7eb700 2 req 387:1.581432:s3:PUT 
/encrypted.txt:put_obj:completing
 22:10:05.520187 7f056f7eb700 2 req 387:1.581640:s3:PUT 
/encrypted.txt:put_obj:op status=-13
 22:10:05.520193 7f056f7eb700 2 req 387:1.581645:s3:PUT 
/encrypted.txt:put_obj:http status=403
 22:10:05.520206 7f056f7eb700 1 == req done req=0x7f056f7e5190 
op status=-13 http_status=403 ==

 22:10:05.520225 7f056f7eb700 20 process_request() returned -13
 22:10:05.520280 7f056f7eb700 1 civetweb: 0x5654042a1000: 
192.168.100.200 - - [02/Mar/2018:22:10:03 +0530] "PUT /encrypted.txt 
HTTP/1.1" 1 0 - Boto/2.38.0 Python/2.7.12 Linux/4.12.1-041201-generic

 22:10:06.116527 7f056e7e9700 20 HTTP_ACCEPT=*/*/

The error thrown in from this line 
https://github.com/ceph/ceph/blob/master/src/rgw/rgw_crypt.cc#L1063


I am unable to understand why its throwing the error.

In ceph.conf following settings are done.

[global]
rgw barbican url = http://keyserver.rados:9311
rgw keystone barbican user = rgwcrypt
rgw keystone barbican password = rgwpass
rgw keystone barbican project = service
rgw keystone barbican domain = default
rgw keystone url = http://keyserver.rados:5000
rgw keystone api version = 3
rgw crypt require ssl = false

Can someone help in figuring out what is missing.

Thanks,
Amar



[ceph-users] SSD as DB/WAL performance with/without drive write cache

2018-03-13 Thread Caspar Smit
Hi all,

I've tested some new Samsung SM863 960GB and Intel DC S4600 240GB SSD's
using the method described at Sebastien Han's blog:

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

The first thing stated there is to disable the drive's write cache, which i
did.

For the Samsungs i got these results:

1 Job: 85 MB/s
5 Jobs: 179 MB/s
10 Jobs: 179 MB/s

I was curious what the results would be with the drive write cache on, so i
turned it on.

Now i got these results:

1 Job: 49 MB/s
5 Jobs: 110 MB/s
10 Jobs: 132 MB/s

So i didn't expect these results to be worse because i would assume a drive
write cache would make it faster.

For the Intels i got more or less the same conclusion (with different
figures) but the performance with drive write cache was about half the
performance as without drive write cache.

Questions:

1) Is this expected behaviour (for all/most SSD's)? If yes, why?
2) Is this only with this type of test?
3) Should i always disable drive write cache for SSD's during boot?
4) Is there any negative side-effect of disabling the drive's write cache?
5) Are these tests still relevant for DB/WAL devices? The blog is written
for Filestore and states all journal writes are sequential but is that also
true for bluestore DB/WAL writes? Do i need to test differently for DB/WAL?

Kind regards,
Caspar
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Issue with fstrim and Nova hw_disk_discard=unmap

2018-03-13 Thread Fulvio Galeazzi

Hallo!


Discards appear like they are being sent to the device.  How big of a
temporary file did you create and then delete? Did you sync the file
to disk before deleting it? What version of qemu-kvm are you running?


I made several test with commands like (issuing sync after each operation):

dd if=/dev/zero of=/tmp/fileTest bs=1M count=200 oflag=direct

What I see is that if I repeat the command with count<=200 the size does 
not increase.


Let's try now with count>200:

NAMEPROVISIONED  USED
volume-80838a69-e544-47eb-b981-a4786be89736  15360M 2284M

dd if=/dev/zero of=/tmp/fileTest bs=1M count=750 oflag=direct
dd if=/dev/zero of=/tmp/fileTest2 bs=1M count=750 oflag=direct
sync

NAMEPROVISIONED  USED
volume-80838a69-e544-47eb-b981-a4786be89736  15360M 2528M

rm /tmp/fileTest*
sync
sudo fstrim -v /
/: 14.1 GiB (15145271296 bytes) trimmed

NAMEPROVISIONED  USED
volume-80838a69-e544-47eb-b981-a4786be89736  15360M 2528M



As for qemu-kvm, the guest OS is CentOS7, with:

[centos@testcentos-deco3 tmp]$ rpm -qa | grep qemu
qemu-guest-agent-2.8.0-2.el7.x86_64

while the host is Ubuntu 16 with:

root@pa1-r2-s10:/home/ubuntu# dpkg -l | grep qemu
ii  qemu-block-extra:amd64   1:2.8+dfsg-3ubuntu2.9~cloud1 
   amd64extra block backend modules for qemu-system and 
qemu-utils
ii  qemu-kvm 1:2.8+dfsg-3ubuntu2.9~cloud1 
   amd64QEMU Full virtualization
ii  qemu-system-common   1:2.8+dfsg-3ubuntu2.9~cloud1 
   amd64QEMU full system emulation binaries (common files)
ii  qemu-system-x86  1:2.8+dfsg-3ubuntu2.9~cloud1 
   amd64QEMU full system emulation binaries (x86)
ii  qemu-utils   1:2.8+dfsg-3ubuntu2.9~cloud1 
   amd64QEMU utilities



  Thanks!

Fulvio



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Issue with fstrim and Nova hw_disk_discard=unmap

2018-03-13 Thread Fulvio Galeazzi

Hallo Jason,
thanks for your feedback!

 Original Message >>   * decorated a CentOS image with 
hw_scsi_model=virtio--scsi,hw_disk_bus=scsi> > Is that just a typo for 
"hw_scsi_model"?
Yes, it was a typo when I wrote my message. The image has virtio-scsi as 
it should.



I see that commands:
rbd --cluster cephpa1 diff cinder-ceph/${theVol} | awk '{ SUM += $2 } END {
print SUM/1024/1024 " MB" }' ; rados --cluster cephpa1 -p cinder-ceph ls |
grep rbd_data.{whatever} | wc -l


That's pretty old-school -- you can just use 'rbd du" now to calculate
the disk usage.


Good to know, thanks!


  show the size increases but does not decrease when I execute delete the
temporary file and execute
 sudo fstrim -v /


Have you verified that your VM is indeed using virtio-scsi? Does
blktrace show SCSI UNMAP operations being issued to the block device
when you execute "fstrim"?


Thanks for the tip, I think I need some more help, please.

Disk on my VM is indeed /dev/sda rather than /dev/vda. The XML shows:
.

  
.
  name='cinder-ceph/volume-80838a69-e544-47eb-b981-a4786be89736'>

.
  
  80838a69-e544-47eb-b981-a4786be89736
  


  function='0x0'/>




As for blktrace, blkparse shows me tons of lines, please find below the 
first ones and one of the many group of lines which I see:


  8,00   11 4.333917112 24677  Q FWFSM 8406583 + 4 [fstrim]
  8,00   12 4.333919649 24677  G FWFSM 8406583 + 4 [fstrim]
  8,00   13 4.333920695 24677  P   N [fstrim]
  8,00   14 4.333922965 24677  I FWFSM 8406583 + 4 [fstrim]
  8,00   15 4.333924575 24677  U   N [fstrim] 1
  8,00   20 4.340140041 24677  Q   D 986016 + 2097152 [fstrim]
  8,00   21 4.340144908 24677  G   D 986016 + 2097152 [fstrim]
  8,00   22 4.340145561 24677  P   N [fstrim]
  8,00   24 4.340147495 24677  Q   D 3083168 + 1112672 [fstrim]
  8,00   25 4.340149772 24677  G   D 3083168 + 1112672 [fstrim]
.
  8,00   50 4.340556955 24677  Q   D 665880 + 20008 [fstrim]
  8,00   51 4.340558481 24677  G   D 665880 + 20008 [fstrim]
  8,00   52 4.340558728 24677  P   N [fstrim]
  8,00   53 4.340559725 24677  I   D 665880 + 20008 [fstrim]
  8,00   54 4.340560292 24677  U   N [fstrim] 1
  8,00   55 4.340560801 24677  D   D 665880 + 20008 [fstrim]
.

Apologies for my ignorance, is the above enough to understand whether 
SCSI UNMAP operations are being issued?


  Thanks a lot!

Fulvio



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

2018-03-13 Thread Ilya Dryomov
On Mon, Mar 12, 2018 at 8:20 PM, Maged Mokhtar  wrote:
> On 2018-03-12 21:00, Ilya Dryomov wrote:
>
> On Mon, Mar 12, 2018 at 7:41 PM, Maged Mokhtar  wrote:
>
> On 2018-03-12 14:23, David Disseldorp wrote:
>
> On Fri, 09 Mar 2018 11:23:02 +0200, Maged Mokhtar wrote:
>
> 2)I undertand that before switching the path, the initiator will send a
> TMF ABORT can we pass this to down to the same abort_request() function
> in osd_client that is used for osd_request_timeout expiry ?
>
>
> IIUC, the existing abort_request() codepath only cancels the I/O on the
> client/gw side. A TMF ABORT successful response should only be sent if
> we can guarantee that the I/O is terminated at all layers below, so I
> think this would have to be implemented via an additional OSD epoch
> barrier or similar.
>
> Cheers, David
>
> Hi David,
>
> I was thinking we would get the block request then loop down to all its osd
> requests and cancel those using the same  osd request cancel function.
>
>
> All that function does is tear down OSD client / messenger data
> structures associated with the OSD request.  Any OSD request that hit
> the TCP layer may eventually get through to the OSDs.
>
> Thanks,
>
> Ilya
>
> Hi Ilya,
>
> OK..so i guess this also applies as well to osd_request_timeout expiry, it
> is not guaranteed to stop all stale ios.

Yes.  The purpose of osd_request_timeout is to unblock the client side
by failing the I/O on the client side.  It doesn't attempt to stop any
in-flight I/O -- it simply marks it as failed.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New Ceph cluster design

2018-03-13 Thread Christian Balzer

Hello,

On Sat, 10 Mar 2018 16:14:53 +0100 Vincent Godin wrote:

> Hi,
> 
> As i understand it, you'll have one RAID1 of two SSDs for 12 HDDs. A
> WAL is used for all writes on your host. 

This isn't filestore, AFAIK the WAL/DB will be used for small writes only
to keep latency with Bluestore akin to filestore levels.
Large writes will go directly to the HDDs.

However each write will of course necessitate a write to the DB and thus
IOPS (much more so than bandwidth) are paramount here.

> If you have good SSDs, they
> can handle 450-550 MBpsc. Your 12 HDDs SATA can handle 12 x 100 MBps
> that is to say 1200 GBps. 

Aside from what I wrote above I'd like to repeat myself and others here
for the umpteenth time, focusing on bandwidth is a fallacy in nearly all
use cases, IOPS tend to become the bottleneck.

Also that's 1.2GB/s or 1200MB/s. 

The OP stated 10TB HDDs and many (but not exclusively?) small objects,
so if we're looking at lots of small writes the bandwidth of the SSDs
becomes a factor again and with the sizes involved they appear too small
as well. (going with the rough ratio of 10GB per TB).

Either a RAID1 of at least 1600GB NVMes or 2 800GB NVMes and a resulting
failure domain of 6 HDDs would be better/safer fit. 

> So your RAID 1 will be the bootleneck with
> this design. A good design would be to have one SSD for 4 or 5 HDD. In
> your case, the best option would be to start with 3 SSDs for 12 HDDs
> to have a balances node. Don't forget to choose SSD with a high WDPD
> ratio (>10)
> 
More SSDs/NVMes are of course better and DWPD is important, but probably
less so than with filestore journals.
A DWPD of >10 is overkill for anything I've ever encountered, for many
things 3 will be fine, especially if one knows what is expected.

For example a filestore cache tier SSD with inline journal (800GB DC S3610,
3 DWPD) has a media wearout of 97 (3% used) after 2 years with this
constant and not insignificant load:
---
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sdb   0.0383.097.07  303.24   746.64  5084.9937.59 
0.050.150.710.13   0.06   2.00
---

300 write IOPS and 5MB/s for all that time.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com