[ceph-users] PG numbers don't add up?
I try to add a data pool: OSD_STAT USED AVAIL TOTAL HB_PEERSPG_SUM PRIMARY_PG_SUM 9 1076M 930G 931G [0,1,2,3,4,5,6,7,8]128 5 8 1076M 930G 931G [0,1,2,3,4,5,6,7,9]128 14 7 1076M 930G 931G [0,1,2,3,4,5,6,8,9]128 14 6 1076M 930G 931G [0,1,2,3,4,5,7,8,9]128 19 5 1076M 930G 931G [0,1,2,3,4,6,7,8,9]128 15 4 1076M 930G 931G [0,1,2,3,5,6,7,8,9]128 17 0 1076M 930G 931G [1,2,3,4,5,6,7,8,9]128 16 1 1076M 930G 931G [0,2,3,4,5,6,7,8,9]128 8 2 1076M 930G 931G [0,1,3,4,5,6,7,8,9]128 8 3 1076M 930G 931G [0,1,2,4,5,6,7,8,9]128 12 sum 10765M 9304G 9315G I try to add a metadata pool: sum 0 0 0 0 0 0 0 0 OSD_STAT USED AVAIL TOTAL HB_PEERSPG_SUM PRIMARY_PG_SUM 9 1076M 930G 931G [0,1,2,3,4,5,6,7,8] 73 73 8 1076M 930G 931G [0,1,2,3,4,5,6,7,9] 40 40 7 1076M 930G 931G [0,1,2,3,4,5,6,8,9] 56 56 6 1076M 930G 931G [0,1,2,3,4,5,7,8,9] 42 42 5 1076M 930G 931G [0,1,2,3,4,6,7,8,9] 54 54 4 1076M 930G 931G [0,1,2,3,5,6,7,8,9] 59 59 0 1076M 930G 931G [1,2,3,4,5,6,7,8,9] 38 38 1 1076M 930G 931G [0,2,3,4,5,6,7,8,9] 57 57 2 1076M 930G 931G [0,1,3,4,5,6,7,8,9] 45 45 3 1076M 930G 931G [0,1,2,4,5,6,7,8,9] 48 48 sum 10766M 9304G 9315G I try to add both pools: Error ERANGE: pg_num 128 size 10 would mean 2816 total pgs, which exceeds max 2000 (mon_max_pg_per_osd 200 * num_in_osds 10) That's over a thousand more PGs than both pools combined. Where are they coming from? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Understanding/correcting sudden onslaught of unfound objects
Updated cluster now to 12.2.4 and the cycle of inconsistent->repair->unfound seems to continue, though possibly slightly differently. A pg does pass through an "active+clean" phase after repair, which might be new, but more likely I never observed it at the right time before. I see messages like this in the logs now "attr name mismatch 'hinfo_key'" - perhaps this might cast more light on the cause: 2018-03-02 18:55:11.583850 osd.386 osd.386 10.31.0.72:6817/4057280 401 : cluster [ERR] 70.3dbs0 : soid 70:dbc6ed68:::default.325674.85_bellplants_images%2f1055211.jpg:head attr name mismatch 'hinfo_key' 2018-03-02 19:00:18.031929 osd.386 osd.386 10.31.0.72:6817/4057280 428 : cluster [ERR] 70.3dbs0 : soid 70:dbc97561:::default.325674.85_bellplants_images%2f1017818.jpg:head attr name mismatch 'hinfo_key' 2018-03-02 19:04:50.058477 osd.386 osd.386 10.31.0.72:6817/4057280 452 : cluster [ERR] 70.3dbs0 : soid 70:dbcbcb34:::default.325674.85_bellplants_images%2f1049756.jpg:head attr name mismatch 'hinfo_key' 2018-03-02 19:13:05.689136 osd.386 osd.386 10.31.0.72:6817/4057280 494 : cluster [ERR] 70.3dbs0 : soid 70:dbcfc7c9:::default.325674.85_bellplants_images%2f1021177.jpg:head attr name mismatch 'hinfo_key' 2018-03-02 19:13:30.883100 osd.386 osd.386 10.31.0.72:6817/4057280 495 : cluster [ERR] 70.3dbs0 repair 0 missing, 161 inconsistent objects 2018-03-02 19:13:30.888259 osd.386 osd.386 10.31.0.72:6817/4057280 496 : cluster [ERR] 70.3db repair 161 errors, 161 fixed The only similar-sounding issue I could find is http://tracker.ceph.com/issues/20089 When I look at src/osd/PGBackend.cc be_compare_scrubmaps in luminous, I don't see the changes in the commit here: https://github.com/ceph/ceph/pull/15368/files of course a lot of other things have changed, but is it possible this fix never made it into luminous? Graham On 02/17/2018 12:48 PM, David Zafman wrote: The commits below came after v12.2.2 and may impact this issue. When a pg is active+clean+inconsistent means that scrub has detected issues with 1 or more replicas of 1 or more objects . An unfound object is a potentially temporary state in which the current set of available OSDs doesn't allow an object to be recovered/backfilled/repaired. When the primary OSD restarts, any unfound objects ( an in memory structure) are reset so that the new set of peered OSDs can determine again what objects are unfound. I'm not clear in this scenario whether recovery failed to start, recovery hung before due to a bug or if recovery stopped (as designed) because of the unfound object. The new recovery_unfound and backfill_unfound states indicates that recovery has stopped due to unfound objects. commit 64047e1bac2e775a06423a03cfab69b88462538c Author: David ZafmanDate: Wed Jan 10 13:30:41 2018 -0800 osd: Don't start recovery for missing until active pg state set I was seeing recovery hang when it is started before _activate_committed() The state machine passes into "Active" but this transitions to activating pg state and after commmitted into "active" pg state. Signed-off-by: David Zafman commit 7f8b0ce9e681f727d8217e3ed74a1a3355f364f3 Author: David Zafman Date: Mon Oct 9 08:19:21 2017 -0700 osd, mon: Add new pg states recovery_unfound and backfill_unfound Signed-off-by: David Zafman On 2/16/18 1:40 PM, Gregory Farnum wrote: On Fri, Feb 16, 2018 at 12:17 PM Graham Allan wrote: On 02/16/2018 12:31 PM, Graham Allan wrote: If I set debug rgw=1 and demug ms=1 before running the "object stat" command, it seems to stall in a loop of trying communicate with osds for pool 96, which is .rgw.control 10.32.16.93:0/2689814946 --> 10.31.0.68:6818/8969 -- osd_op(unknown.0.0:541 96.e 96:7759931f:::notify.3:head [watch ping cookie 139709246356176] snapc 0=[] ondisk+write+known_if_redirected e507695) v8 -- 0x7f10ac033610 con 0 10.32.16.93:0/2689814946 <== osd.38 10.31.0.68:6818/8969 59 osd_op_reply(541 notify.3 [watch ping cookie 139709246356176] v0'0 uv3933745 ondisk = 0) v8 152+0+0 (2536111836 <(253)%20611-1836> 0 0) 0x7f1158003e20 con 0x7f117afd8390 Prior to that, probably more relevant, this was the only communication logged with the primary osd of the pg: 10.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 -- osd_op(unknown.0.0:96 70.438s0 70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head [getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e507695) v8 -- 0x7fab79889fa0 con 0 10.32.16.93:0/1552085932 <== osd.175 10.31.0.71:6838/66301 1 osd_backoff(70.438s0 block id 1 [70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head) e507695) v1 209+0+0 (1958971312 0 0) 0x7fab5003d3c0 con 0x7fab79885980 210.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 --
Re: [ceph-users] Cephfs MDS slow requests
On Tue, Mar 13, 2018 at 2:56 PM David Cwrote: > Thanks for the detailed response, Greg. A few follow ups inline: > > > On 13 Mar 2018 20:52, "Gregory Farnum" wrote: > > On Tue, Mar 13, 2018 at 12:17 PM, David C wrote: > > Hi All > > > > I have a Samba server that is exporting directories from a Cephfs Kernel > > mount. Performance has been pretty good for the last year but users have > > recently been complaining of short "freezes", these seem to coincide with > > MDS related slow requests in the monitor ceph.log such as: > > > >> 2018-03-13 13:34:58.461030 osd.15 osd.15 10.10.10.211:6812/13367 5752 : > >> cluster [WRN] slow request 31.834418 seconds old, received at 2018-03-13 > >> 13:34:26.626474: osd_repop(mds.0.5495:810644 3.3e e14085/14019 > >> 3:7cea5bac:::10001a88b8f.:head v 14085'846936) currently > commit_sent > >> 2018-03-13 13:34:59.461270 osd.15 osd.15 10.10.10.211:6812/13367 5754 : > >> cluster [WRN] slow request 32.832059 seconds old, received at 2018-03-13 > >> 13:34:26.629151: osd_repop(mds.0.5495:810671 2.dc2 e14085/14020 > >> 2:43bdcc3f:::10001e91a91.:head v 14085'21394) currently > commit_sent > >> 2018-03-13 14:23:57.409427 osd.30 osd.30 10.10.10.212:6824/14997 5708 : > >> cluster [WRN] slow request 30.536832 seconds old, received at 2018-03-13 > >> 14:23:26.872513: osd_repop(mds.0.5495:865403 2.fb6 e14085/14077 > >> 2:6df955ef:::10001e93542.00c4:head v 14085'21296) currently > commit_sent > >> 2018-03-13 14:23:57.409449 osd.30 osd.30 10.10.10.212:6824/14997 5709 : > >> cluster [WRN] slow request 30.529640 seconds old, received at 2018-03-13 > >> 14:23:26.879704: osd_repop(mds.0.5495:865407 2.595 e14085/14019 > >> 2:a9a56101:::10001e93542.00c8:head v 14085'20437) currently > commit_sent > >> 2018-03-13 14:23:57.409453 osd.30 osd.30 10.10.10.212:6824/14997 5710 : > >> cluster [WRN] slow request 30.503138 seconds old, received at 2018-03-13 > >> 14:23:26.906207: osd_repop(mds.0.5495:865423 2.ea e14085/14055 > >> 2:57096bbf:::10001e93542.00d8:head v 14085'21147) currently > commit_sent > > Well, that means your OSDs are getting operations that commit quickly > to a journal but are taking a while to get into the backing > filesystem. (I assume this is on filestore based on that message > showing up at all, but could be missing something.) > > > Yep it's filestore. Journals are on Intel P3700 NVME, data and metadata > pools both on 7200rpm SATA. Sounds like I might benefit from moving > metadata to a dedicated SSD pool. > > In the meantime, are there any recommended tunables? Filestore max/min > sync interval for example? > Well, you can try. I'm not sure what the most successful deployments look like. If you turn up the min sync interval you stand a better chance of only doing one write to your HDD if files get overwritten, for instance. But it may also mean that your commits end up taking so long that you get worse IO stalls, if there's no opportunity to coalesce and the reality is just that you're trying to push more IOs through the system than the backing HDDs can support. -Greg > > > > > > > -- > > > > Looking in the MDS log, with debug set to 4, it's full of > "setfilelockrule > > 1" and "setfilelockrule 2": > > > >> 2018-03-13 14:23:00.446905 7fde43e73700 4 mds.0.server > >> handle_client_request client_request(client.9174621:141162337 > >> setfilelockrule 1, type 4, owner 14971048052668053939, pid 7, start 120, > >> length 1, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 > caller_uid=1155, > >> caller_gid=1131{}) v2 > >> 2018-03-13 14:23:00.447050 7fde43e73700 4 mds.0.server > >> handle_client_request client_request(client.9174621:141162338 > >> setfilelockrule 2, type 4, owner 14971048137043556787, pid 4632, start > 0, > >> length 0, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=0, > >> caller_gid=0{}) v2 > >> 2018-03-13 14:23:00.447258 7fde43e73700 4 mds.0.server > >> handle_client_request client_request(client.9174621:141162339 > >> setfilelockrule 2, type 4, owner 14971048137043550643, pid 4632, start > 0, > >> length 0, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=0, > >> caller_gid=0{}) v2 > >> 2018-03-13 14:23:00.447393 7fde43e73700 4 mds.0.server > >> handle_client_request client_request(client.9174621:141162340 > >> setfilelockrule 1, type 4, owner 14971048052668053939, pid 7, start 124, > >> length 1, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 > caller_uid=1155, > >> caller_gid=1131{}) v2 > > And that is clients setting (and releasing) advisory locks on files. I > don't think this should directly have anything to do with the slow OSD > requests (file locking is ephemeral state, not committed to disk), but > if you have new applications running which are taking file locks on > shared files that could definitely impede other clients and slow > things down more generally. > -Greg > > > Sounds like that could be a red herring
Re: [ceph-users] Cephfs MDS slow requests
Thanks for the detailed response, Greg. A few follow ups inline: On 13 Mar 2018 20:52, "Gregory Farnum"wrote: On Tue, Mar 13, 2018 at 12:17 PM, David C wrote: > Hi All > > I have a Samba server that is exporting directories from a Cephfs Kernel > mount. Performance has been pretty good for the last year but users have > recently been complaining of short "freezes", these seem to coincide with > MDS related slow requests in the monitor ceph.log such as: > >> 2018-03-13 13:34:58.461030 osd.15 osd.15 10.10.10.211:6812/13367 5752 : >> cluster [WRN] slow request 31.834418 seconds old, received at 2018-03-13 >> 13:34:26.626474: osd_repop(mds.0.5495:810644 3.3e e14085/14019 >> 3:7cea5bac:::10001a88b8f.:head v 14085'846936) currently commit_sent >> 2018-03-13 13:34:59.461270 osd.15 osd.15 10.10.10.211:6812/13367 5754 : >> cluster [WRN] slow request 32.832059 seconds old, received at 2018-03-13 >> 13:34:26.629151: osd_repop(mds.0.5495:810671 2.dc2 e14085/14020 >> 2:43bdcc3f:::10001e91a91.:head v 14085'21394) currently commit_sent >> 2018-03-13 14:23:57.409427 osd.30 osd.30 10.10.10.212:6824/14997 5708 : >> cluster [WRN] slow request 30.536832 seconds old, received at 2018-03-13 >> 14:23:26.872513: osd_repop(mds.0.5495:865403 2.fb6 e14085/14077 >> 2:6df955ef:::10001e93542.00c4:head v 14085'21296) currently commit_sent >> 2018-03-13 14:23:57.409449 osd.30 osd.30 10.10.10.212:6824/14997 5709 : >> cluster [WRN] slow request 30.529640 seconds old, received at 2018-03-13 >> 14:23:26.879704: osd_repop(mds.0.5495:865407 2.595 e14085/14019 >> 2:a9a56101:::10001e93542.00c8:head v 14085'20437) currently commit_sent >> 2018-03-13 14:23:57.409453 osd.30 osd.30 10.10.10.212:6824/14997 5710 : >> cluster [WRN] slow request 30.503138 seconds old, received at 2018-03-13 >> 14:23:26.906207: osd_repop(mds.0.5495:865423 2.ea e14085/14055 >> 2:57096bbf:::10001e93542.00d8:head v 14085'21147) currently commit_sent Well, that means your OSDs are getting operations that commit quickly to a journal but are taking a while to get into the backing filesystem. (I assume this is on filestore based on that message showing up at all, but could be missing something.) Yep it's filestore. Journals are on Intel P3700 NVME, data and metadata pools both on 7200rpm SATA. Sounds like I might benefit from moving metadata to a dedicated SSD pool. In the meantime, are there any recommended tunables? Filestore max/min sync interval for example? > > > -- > > Looking in the MDS log, with debug set to 4, it's full of "setfilelockrule > 1" and "setfilelockrule 2": > >> 2018-03-13 14:23:00.446905 7fde43e73700 4 mds.0.server >> handle_client_request client_request(client.9174621:141162337 >> setfilelockrule 1, type 4, owner 14971048052668053939, pid 7, start 120, >> length 1, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=1155, >> caller_gid=1131{}) v2 >> 2018-03-13 14:23:00.447050 7fde43e73700 4 mds.0.server >> handle_client_request client_request(client.9174621:141162338 >> setfilelockrule 2, type 4, owner 14971048137043556787, pid 4632, start 0, >> length 0, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=0, >> caller_gid=0{}) v2 >> 2018-03-13 14:23:00.447258 7fde43e73700 4 mds.0.server >> handle_client_request client_request(client.9174621:141162339 >> setfilelockrule 2, type 4, owner 14971048137043550643, pid 4632, start 0, >> length 0, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=0, >> caller_gid=0{}) v2 >> 2018-03-13 14:23:00.447393 7fde43e73700 4 mds.0.server >> handle_client_request client_request(client.9174621:141162340 >> setfilelockrule 1, type 4, owner 14971048052668053939, pid 7, start 124, >> length 1, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=1155, >> caller_gid=1131{}) v2 And that is clients setting (and releasing) advisory locks on files. I don't think this should directly have anything to do with the slow OSD requests (file locking is ephemeral state, not committed to disk), but if you have new applications running which are taking file locks on shared files that could definitely impede other clients and slow things down more generally. -Greg Sounds like that could be a red herring then. It seems like my issue is users chucking lots of small writes at the cephfs mount. > > > -- > > I don't have a particularly good monitoring set up on this cluster yet, but > a cursory look at a few things such as iostat doesn't seem to suggest OSDs > are being hammered. > > Some questions: > > 1) Can anyone recommend a way of diagnosing this issue? > 2) Are the multiple "setfilelockrule" per inode to be expected? I assume > this is something to do with the Samba oplocks. Hmm, you might be right about the oplocks. That's an output format error btw, it should be "setfilelock" (the op type) and a separate word " rule " (indicating the type of lock, 1 for shared and 2 for exclusive). > 3) What's the recommended
Re: [ceph-users] Cephfs MDS slow requests
On Tue, Mar 13, 2018 at 12:17 PM, David Cwrote: > Hi All > > I have a Samba server that is exporting directories from a Cephfs Kernel > mount. Performance has been pretty good for the last year but users have > recently been complaining of short "freezes", these seem to coincide with > MDS related slow requests in the monitor ceph.log such as: > >> 2018-03-13 13:34:58.461030 osd.15 osd.15 10.10.10.211:6812/13367 5752 : >> cluster [WRN] slow request 31.834418 seconds old, received at 2018-03-13 >> 13:34:26.626474: osd_repop(mds.0.5495:810644 3.3e e14085/14019 >> 3:7cea5bac:::10001a88b8f.:head v 14085'846936) currently commit_sent >> 2018-03-13 13:34:59.461270 osd.15 osd.15 10.10.10.211:6812/13367 5754 : >> cluster [WRN] slow request 32.832059 seconds old, received at 2018-03-13 >> 13:34:26.629151: osd_repop(mds.0.5495:810671 2.dc2 e14085/14020 >> 2:43bdcc3f:::10001e91a91.:head v 14085'21394) currently commit_sent >> 2018-03-13 14:23:57.409427 osd.30 osd.30 10.10.10.212:6824/14997 5708 : >> cluster [WRN] slow request 30.536832 seconds old, received at 2018-03-13 >> 14:23:26.872513: osd_repop(mds.0.5495:865403 2.fb6 e14085/14077 >> 2:6df955ef:::10001e93542.00c4:head v 14085'21296) currently commit_sent >> 2018-03-13 14:23:57.409449 osd.30 osd.30 10.10.10.212:6824/14997 5709 : >> cluster [WRN] slow request 30.529640 seconds old, received at 2018-03-13 >> 14:23:26.879704: osd_repop(mds.0.5495:865407 2.595 e14085/14019 >> 2:a9a56101:::10001e93542.00c8:head v 14085'20437) currently commit_sent >> 2018-03-13 14:23:57.409453 osd.30 osd.30 10.10.10.212:6824/14997 5710 : >> cluster [WRN] slow request 30.503138 seconds old, received at 2018-03-13 >> 14:23:26.906207: osd_repop(mds.0.5495:865423 2.ea e14085/14055 >> 2:57096bbf:::10001e93542.00d8:head v 14085'21147) currently commit_sent Well, that means your OSDs are getting operations that commit quickly to a journal but are taking a while to get into the backing filesystem. (I assume this is on filestore based on that message showing up at all, but could be missing something.) > > > -- > > Looking in the MDS log, with debug set to 4, it's full of "setfilelockrule > 1" and "setfilelockrule 2": > >> 2018-03-13 14:23:00.446905 7fde43e73700 4 mds.0.server >> handle_client_request client_request(client.9174621:141162337 >> setfilelockrule 1, type 4, owner 14971048052668053939, pid 7, start 120, >> length 1, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=1155, >> caller_gid=1131{}) v2 >> 2018-03-13 14:23:00.447050 7fde43e73700 4 mds.0.server >> handle_client_request client_request(client.9174621:141162338 >> setfilelockrule 2, type 4, owner 14971048137043556787, pid 4632, start 0, >> length 0, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=0, >> caller_gid=0{}) v2 >> 2018-03-13 14:23:00.447258 7fde43e73700 4 mds.0.server >> handle_client_request client_request(client.9174621:141162339 >> setfilelockrule 2, type 4, owner 14971048137043550643, pid 4632, start 0, >> length 0, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=0, >> caller_gid=0{}) v2 >> 2018-03-13 14:23:00.447393 7fde43e73700 4 mds.0.server >> handle_client_request client_request(client.9174621:141162340 >> setfilelockrule 1, type 4, owner 14971048052668053939, pid 7, start 124, >> length 1, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=1155, >> caller_gid=1131{}) v2 And that is clients setting (and releasing) advisory locks on files. I don't think this should directly have anything to do with the slow OSD requests (file locking is ephemeral state, not committed to disk), but if you have new applications running which are taking file locks on shared files that could definitely impede other clients and slow things down more generally. -Greg > > > -- > > I don't have a particularly good monitoring set up on this cluster yet, but > a cursory look at a few things such as iostat doesn't seem to suggest OSDs > are being hammered. > > Some questions: > > 1) Can anyone recommend a way of diagnosing this issue? > 2) Are the multiple "setfilelockrule" per inode to be expected? I assume > this is something to do with the Samba oplocks. Hmm, you might be right about the oplocks. That's an output format error btw, it should be "setfilelock" (the op type) and a separate word " rule " (indicating the type of lock, 1 for shared and 2 for exclusive). > 3) What's the recommended highest MDS debug setting before performance > starts to be adversely affected (I'm aware log files will get huge)? There's not a good answer here. If you actually encounter a bug developers will want at least 10, and probably 20, but both of those have measurable performance impacts. :/ > 4) What's the best way of matching inodes in the MDS log to the file names > in cephfs? If you have an actual log then before the inode gets used it has to get found, and that will be by path. You can also look at the xattrs on the object in rados; one
Re: [ceph-users] Issue with fstrim and Nova hw_disk_discard=unmap
Can you provide the output from "rbd info /volume-80838a69-e544-47eb-b981-a4786be89736"? On Tue, Mar 13, 2018 at 12:30 PM, Fulvio Galeazziwrote: > Hallo! > >> Discards appear like they are being sent to the device. How big of a >> temporary file did you create and then delete? Did you sync the file >> to disk before deleting it? What version of qemu-kvm are you running? > > > I made several test with commands like (issuing sync after each operation): > > dd if=/dev/zero of=/tmp/fileTest bs=1M count=200 oflag=direct > > What I see is that if I repeat the command with count<=200 the size does not > increase. > > Let's try now with count>200: > > NAMEPROVISIONED USED > volume-80838a69-e544-47eb-b981-a4786be89736 15360M 2284M > > dd if=/dev/zero of=/tmp/fileTest bs=1M count=750 oflag=direct > dd if=/dev/zero of=/tmp/fileTest2 bs=1M count=750 oflag=direct > sync > > NAMEPROVISIONED USED > volume-80838a69-e544-47eb-b981-a4786be89736 15360M 2528M > > rm /tmp/fileTest* > sync > sudo fstrim -v / > /: 14.1 GiB (15145271296 bytes) trimmed > > NAMEPROVISIONED USED > volume-80838a69-e544-47eb-b981-a4786be89736 15360M 2528M > > > > As for qemu-kvm, the guest OS is CentOS7, with: > > [centos@testcentos-deco3 tmp]$ rpm -qa | grep qemu > qemu-guest-agent-2.8.0-2.el7.x86_64 > > while the host is Ubuntu 16 with: > > root@pa1-r2-s10:/home/ubuntu# dpkg -l | grep qemu > ii qemu-block-extra:amd64 1:2.8+dfsg-3ubuntu2.9~cloud1 > amd64extra block backend modules for qemu-system and qemu-utils > ii qemu-kvm 1:2.8+dfsg-3ubuntu2.9~cloud1 > amd64QEMU Full virtualization > ii qemu-system-common 1:2.8+dfsg-3ubuntu2.9~cloud1 > amd64QEMU full system emulation binaries (common files) > ii qemu-system-x86 1:2.8+dfsg-3ubuntu2.9~cloud1 > amd64QEMU full system emulation binaries (x86) > ii qemu-utils 1:2.8+dfsg-3ubuntu2.9~cloud1 > amd64QEMU utilities > > > Thanks! > > Fulvio > -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Issue with fstrim and Nova hw_disk_discard=unmap
Discards appear like they are being sent to the device. How big of a temporary file did you create and then delete? Did you sync the file to disk before deleting it? What version of qemu-kvm are you running? On Tue, Mar 13, 2018 at 11:09 AM, Fulvio Galeazziwrote: > Hallo Jason, > thanks for your feedback! > > Original Message >> * decorated a CentOS image with > hw_scsi_model=virtio--scsi,hw_disk_bus=scsi> > Is that just a typo for > "hw_scsi_model"? > Yes, it was a typo when I wrote my message. The image has virtio-scsi as it > should. > >>> I see that commands: >>> rbd --cluster cephpa1 diff cinder-ceph/${theVol} | awk '{ SUM += $2 } END >>> { >>> print SUM/1024/1024 " MB" }' ; rados --cluster cephpa1 -p cinder-ceph ls >>> | >>> grep rbd_data.{whatever} | wc -l >> >> >> That's pretty old-school -- you can just use 'rbd du" now to calculate >> the disk usage. > > > Good to know, thanks! > >>> show the size increases but does not decrease when I execute delete the >>> temporary file and execute >>> sudo fstrim -v / >> >> >> Have you verified that your VM is indeed using virtio-scsi? Does >> blktrace show SCSI UNMAP operations being issued to the block device >> when you execute "fstrim"? > > > Thanks for the tip, I think I need some more help, please. > > Disk on my VM is indeed /dev/sda rather than /dev/vda. The XML shows: > . > > > . >name='cinder-ceph/volume-80838a69-e544-47eb-b981-a4786be89736'> > . > > 80838a69-e544-47eb-b981-a4786be89736 > > > >function='0x0'/> > > > > As for blktrace, blkparse shows me tons of lines, please find below the > first ones and one of the many group of lines which I see: > > 8,00 11 4.333917112 24677 Q FWFSM 8406583 + 4 [fstrim] > 8,00 12 4.333919649 24677 G FWFSM 8406583 + 4 [fstrim] > 8,00 13 4.333920695 24677 P N [fstrim] > 8,00 14 4.333922965 24677 I FWFSM 8406583 + 4 [fstrim] > 8,00 15 4.333924575 24677 U N [fstrim] 1 > 8,00 20 4.340140041 24677 Q D 986016 + 2097152 [fstrim] > 8,00 21 4.340144908 24677 G D 986016 + 2097152 [fstrim] > 8,00 22 4.340145561 24677 P N [fstrim] > 8,00 24 4.340147495 24677 Q D 3083168 + 1112672 [fstrim] > 8,00 25 4.340149772 24677 G D 3083168 + 1112672 [fstrim] > . > 8,00 50 4.340556955 24677 Q D 665880 + 20008 [fstrim] > 8,00 51 4.340558481 24677 G D 665880 + 20008 [fstrim] > 8,00 52 4.340558728 24677 P N [fstrim] > 8,00 53 4.340559725 24677 I D 665880 + 20008 [fstrim] > 8,00 54 4.340560292 24677 U N [fstrim] 1 > 8,00 55 4.340560801 24677 D D 665880 + 20008 [fstrim] > . > > Apologies for my ignorance, is the above enough to understand whether SCSI > UNMAP operations are being issued? > > Thanks a lot! > > Fulvio > -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSD as DB/WAL performance with/without drive write cache
Hi Vladimir, Yeah, the results are pretty low compared to yours but i think this is due to the fact that this SSD is in a fairly old server (Supermicro X8, SAS2 expander backplane). Controller is LSI/Broadcom 9207-8i on the latest IT firmware (Same LSI2308 chipset as yours) Kind regards, Caspar 2018-03-13 21:00 GMT+01:00 Дробышевский, Владимир: > Hello, Caspar! > > Would you mind to share controller model you use? I would say these > results are pretty low. > > Here are my results on Intel RMS25LB LSI2308 based SAS controller in IT > mode: > > I set write_cache to write through > > Test command, fio 2.2.10: > > sudo fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k > --numjobs=XXX --iodepth=1 --runtime=60 --time_based --group_reporting > --name=journal-test > > where XXX - number of jobs > > Results: > > numjobs: 1 > > write: io=5068.6MB, bw=86493KB/s, iops=21623, runt= 60001msec > clat (usec): min=38, max=8343, avg=45.01, stdev=32.10 > > numjobs : 5 > > write: io=14548MB, bw=248274KB/s, iops=62068, runt= 60001msec > clat (usec): min=40, max=11291, avg=79.05, stdev=46.37 > > numjobs : 10 > > write: io=14762MB, bw=251939KB/s, iops=62984, runt= 60001msec > clat (usec): min=52, max=10356, avg=157.16, stdev=65.69 > > I have got even better results on z97 integrated SATA controller, you can > find them in comments to the post you have mentioned ( > https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-te > st-if-your-ssd-is-suitable-as-a-journal-device/#comment-3273882789). > > Still don't know why LSI 2308 SAS performance worse than z97 SATA and > can't find any info on why write back cache setting has slower write than > write through. > > But I would offer to pay more attention to IOPS than to the sequential > write speed, especially on the small blocks workload. > > 2018-03-13 21:33 GMT+05:00 Caspar Smit : > >> Hi all, >> >> I've tested some new Samsung SM863 960GB and Intel DC S4600 240GB SSD's >> using the method described at Sebastien Han's blog: >> >> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-tes >> t-if-your-ssd-is-suitable-as-a-journal-device/ >> >> The first thing stated there is to disable the drive's write cache, which >> i did. >> >> For the Samsungs i got these results: >> >> 1 Job: 85 MB/s >> 5 Jobs: 179 MB/s >> 10 Jobs: 179 MB/s >> >> I was curious what the results would be with the drive write cache on, so >> i turned it on. >> >> Now i got these results: >> >> 1 Job: 49 MB/s >> 5 Jobs: 110 MB/s >> 10 Jobs: 132 MB/s >> >> So i didn't expect these results to be worse because i would assume a >> drive write cache would make it faster. >> >> For the Intels i got more or less the same conclusion (with different >> figures) but the performance with drive write cache was about half the >> performance as without drive write cache. >> >> Questions: >> >> 1) Is this expected behaviour (for all/most SSD's)? If yes, why? >> 2) Is this only with this type of test? >> 3) Should i always disable drive write cache for SSD's during boot? >> 4) Is there any negative side-effect of disabling the drive's write cache? >> 5) Are these tests still relevant for DB/WAL devices? The blog is written >> for Filestore and states all journal writes are sequential but is that also >> true for bluestore DB/WAL writes? Do i need to test differently for DB/WAL? >> >> Kind regards, >> Caspar >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > > > -- > > С уважением, > Дробышевский Владимир > Компания "АйТи Город" > +7 343 192 <+7%20343%20222-21-92> > > ИТ-консалтинг > Поставка проектов "под ключ" > Аутсорсинг ИТ-услуг > Аутсорсинг ИТ-инфраструктуры > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSD as DB/WAL performance with/without drive write cache
Hello, Caspar! Would you mind to share controller model you use? I would say these results are pretty low. Here are my results on Intel RMS25LB LSI2308 based SAS controller in IT mode: I set write_cache to write through Test command, fio 2.2.10: sudo fio --filename=/dev/sdb --direct=1 --sync=1 --rw=write --bs=4k --numjobs=XXX --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test where XXX - number of jobs Results: numjobs: 1 write: io=5068.6MB, bw=86493KB/s, iops=21623, runt= 60001msec clat (usec): min=38, max=8343, avg=45.01, stdev=32.10 numjobs : 5 write: io=14548MB, bw=248274KB/s, iops=62068, runt= 60001msec clat (usec): min=40, max=11291, avg=79.05, stdev=46.37 numjobs : 10 write: io=14762MB, bw=251939KB/s, iops=62984, runt= 60001msec clat (usec): min=52, max=10356, avg=157.16, stdev=65.69 I have got even better results on z97 integrated SATA controller, you can find them in comments to the post you have mentioned ( https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to- test-if-your-ssd-is-suitable-as-a-journal-device/#comment-3273882789). Still don't know why LSI 2308 SAS performance worse than z97 SATA and can't find any info on why write back cache setting has slower write than write through. But I would offer to pay more attention to IOPS than to the sequential write speed, especially on the small blocks workload. 2018-03-13 21:33 GMT+05:00 Caspar Smit: > Hi all, > > I've tested some new Samsung SM863 960GB and Intel DC S4600 240GB SSD's > using the method described at Sebastien Han's blog: > > https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to- > test-if-your-ssd-is-suitable-as-a-journal-device/ > > The first thing stated there is to disable the drive's write cache, which > i did. > > For the Samsungs i got these results: > > 1 Job: 85 MB/s > 5 Jobs: 179 MB/s > 10 Jobs: 179 MB/s > > I was curious what the results would be with the drive write cache on, so > i turned it on. > > Now i got these results: > > 1 Job: 49 MB/s > 5 Jobs: 110 MB/s > 10 Jobs: 132 MB/s > > So i didn't expect these results to be worse because i would assume a > drive write cache would make it faster. > > For the Intels i got more or less the same conclusion (with different > figures) but the performance with drive write cache was about half the > performance as without drive write cache. > > Questions: > > 1) Is this expected behaviour (for all/most SSD's)? If yes, why? > 2) Is this only with this type of test? > 3) Should i always disable drive write cache for SSD's during boot? > 4) Is there any negative side-effect of disabling the drive's write cache? > 5) Are these tests still relevant for DB/WAL devices? The blog is written > for Filestore and states all journal writes are sequential but is that also > true for bluestore DB/WAL writes? Do i need to test differently for DB/WAL? > > Kind regards, > Caspar > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- С уважением, Дробышевский Владимир Компания "АйТи Город" +7 343 192 ИТ-консалтинг Поставка проектов "под ключ" Аутсорсинг ИТ-услуг Аутсорсинг ИТ-инфраструктуры ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Civetweb log format
Well I have it mostly wrapped up and writing to graylog, however the ops log has a `remote_addr` field, but as far as I can tell it's always blank. I found this fix but it seems to only be in v13.0.1 https://github.com/ceph/ceph/pull/16860 Is there any chance we'd see backports of this to Jewel and/or luminous? Aaron On Mar 12, 2018, at 5:50 PM, Aaron Bassett> wrote: Quick update: adding the following to your config: rgw log http headers = "http_authorization" rgw ops log socket path = /tmp/rgw rgw enable ops log = true rgw enable usage log = true and you can now nc -U /tmp/rgw |./jq --stream 'fromstream(1|truncate_stream(inputs))' { "time": "2018-03-12 21:42:19.479037Z", "time_local": "2018-03-12 21:42:19.479037", "remote_addr": "", "user": "test", "operation": "PUT", "uri": "/testbucket/", "http_status": "200", "error_code": "", "bytes_sent": 19, "bytes_received": 0, "object_size": 0, "total_time": 600967, "user_agent": "Boto/2.46.1 Python/2.7.12 Linux/4.4.0-42-generic", "referrer": "", "http_x_headers": [ { "HTTP_AUTHORIZATION": "AWS : " } ] } pretty good start on getting an audit log going! On Mar 9, 2018, at 10:52 PM, Konstantin Shalygin > wrote: Unfortunately I can't quite figure out how to use it. I've got "rgw log http headers = "authorization" in my ceph.conf but I'm getting no love in the rgw log. I think this shold have 'http_' prefix, like: rgw log http headers = "http_host, http_x_forwarded_for" k CONFIDENTIALITY NOTICE This e-mail message and any attachments are only for the use of the intended recipient and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If you are not the intended recipient, any disclosure, distribution or other use of this e-mail message or attachments is prohibited. If you have received this e-mail message in error, please delete and notify the sender immediately. Thank you. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Cephfs MDS slow requests
Hi All I have a Samba server that is exporting directories from a Cephfs Kernel mount. Performance has been pretty good for the last year but users have recently been complaining of short "freezes", these seem to coincide with MDS related slow requests in the monitor ceph.log such as: 2018-03-13 13:34:58.461030 osd.15 osd.15 10.10.10.211:6812/13367 5752 : > cluster [WRN] slow request 31.834418 seconds old, received at 2018-03-13 > 13:34:26.626474: osd_repop(mds.0.5495:810644 3.3e e14085/14019 > 3:7cea5bac:::10001a88b8f.:head v 14085'846936) currently commit_sent > 2018-03-13 13:34:59.461270 osd.15 osd.15 10.10.10.211:6812/13367 5754 : > cluster [WRN] slow request 32.832059 seconds old, received at 2018-03-13 > 13:34:26.629151: osd_repop(mds.0.5495:810671 2.dc2 e14085/14020 > 2:43bdcc3f:::10001e91a91.:head v 14085'21394) currently commit_sent > 2018-03-13 14:23:57.409427 osd.30 osd.30 10.10.10.212:6824/14997 5708 : > cluster [WRN] slow request 30.536832 seconds old, received at 2018-03-13 > 14:23:26.872513: osd_repop(mds.0.5495:865403 2.fb6 e14085/14077 > 2:6df955ef:::10001e93542.00c4:head v 14085'21296) currently commit_sent > 2018-03-13 14:23:57.409449 osd.30 osd.30 10.10.10.212:6824/14997 5709 : > cluster [WRN] slow request 30.529640 seconds old, received at 2018-03-13 > 14:23:26.879704: osd_repop(mds.0.5495:865407 2.595 e14085/14019 > 2:a9a56101:::10001e93542.00c8:head v 14085'20437) currently commit_sent > 2018-03-13 14:23:57.409453 osd.30 osd.30 10.10.10.212:6824/14997 5710 : > cluster [WRN] slow request 30.503138 seconds old, received at 2018-03-13 > 14:23:26.906207: osd_repop(mds.0.5495:865423 2.ea e14085/14055 > 2:57096bbf:::10001e93542.00d8:head v 14085'21147) currently commit_sent -- Looking in the MDS log, with debug set to 4, it's full of "setfilelockrule 1" and "setfilelockrule 2": 2018-03-13 14:23:00.446905 7fde43e73700 4 mds.0.server > handle_client_request client_request(client.9174621:141162337 > setfilelockrule 1, type 4, owner 14971048052668053939, pid 7, start 120, > length 1, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=1155, > caller_gid=1131{}) v2 > 2018-03-13 14:23:00.447050 7fde43e73700 4 mds.0.server > handle_client_request client_request(client.9174621:141162338 > setfilelockrule 2, type 4, owner 14971048137043556787, pid 4632, start 0, > length 0, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=0, > caller_gid=0{}) v2 > 2018-03-13 14:23:00.447258 7fde43e73700 4 mds.0.server > handle_client_request client_request(client.9174621:141162339 > setfilelockrule 2, type 4, owner 14971048137043550643, pid 4632, start 0, > length 0, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=0, > caller_gid=0{}) v2 > 2018-03-13 14:23:00.447393 7fde43e73700 4 mds.0.server > handle_client_request client_request(client.9174621:141162340 > setfilelockrule 1, type 4, owner 14971048052668053939, pid 7, start 124, > length 1, wait 0 #0x10001e8dc37 2018-03-13 14:22:58.838521 caller_uid=1155, > caller_gid=1131{}) v2 -- I don't have a particularly good monitoring set up on this cluster yet, but a cursory look at a few things such as iostat doesn't seem to suggest OSDs are being hammered. Some questions: 1) Can anyone recommend a way of diagnosing this issue? 2) Are the multiple "setfilelockrule" per inode to be expected? I assume this is something to do with the Samba oplocks. 3) What's the recommended highest MDS debug setting before performance starts to be adversely affected (I'm aware log files will get huge)? 4) What's the best way of matching inodes in the MDS log to the file names in cephfs? Hardware/Versions: Luminous 12.1.1 Cephfs client 3.10.0-514.2.2.el7.x86_64 Samba 4.4.4 4 node cluster, each node 1xIntel 3700 NVME, 12x SATA, 40Gbps networking Thanks in advance! Cheers, David ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fwd: [ceph bad performance], can't find a bottleneck
Hi, Maged Not a big difference in both cases. Performance of 4 nodes pool with 5x PM863a each is: 4k bs - 33-37kIOPS krbd 128 threads and 42-51kIOPS vs 1024 threads (fio numjobs 128-256-512) the same situation happens when we try to increase rbd workload, 3 rbd gets the same iops #. Dead end & limit ) Thank you! 2018-03-12 21:49 GMT+03:00 Maged Mokhtar: > Hi, > > Try increasing the queue depth from default 128 to 1024: > > rbd map image-XX -o queue_depth=1024 > > > Also if you run multiple rbd images/fio tests, do you get higher combined > performance ? > > Maged > > > On 2018-03-12 17:16, Sergey Kotov wrote: > > Dear moderator, i subscribed to ceph list today, could you please post my > message? > > -- Forwarded message -- > From: Sergey Kotov > Date: 2018-03-06 10:52 GMT+03:00 > Subject: [ceph bad performance], can't find a bottleneck > To: ceph-users@lists.ceph.com > Cc: Житенев Алексей , Anna Anikina < > anik...@gmail.com> > > > Good day. > > Can you please help us to find bottleneck in our ceph installations. > We have 3 SSD-only clusters with different HW, but situation is the same - > overall i/o operations between client & ceph lower than 1/6 of summary > performance all ssd. > > For example - > One of our cluster has 4-nodes with ssd Toshiba 2Tb Enterprise drives, > installed on Ubuntu server 16.04. > Servers are connected to the 10G switches. Latency between modes is about > 0.1ms. Ethernet utilisation is low. > > # uname -a > Linux storage01 4.4.0-101-generic #124-Ubuntu SMP Fri Nov 10 18:29:59 UTC > 2017 x86_64 x86_64 x86_64 GNU/Linux > > # ceph osd versions > { > "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) > luminous (stable)": 55 > } > > > When we map rbd image direct on the storage nodes via krbd, performance is > not good enough. > We use fio for testing. Even we run randwrite with 4k block size test in > multiple thread mode, our drives don't have utilisation higher then 30% and > lat is ok. > > At the same time iostat tool displays 100% utilisation on /dev/rbdX. > > Also we can't enable rbd_cache, because of using scst iscsi over rbd > mapped images. > > How can we resolve the issue? > > Ceph config: > > [global] > fsid = beX482fX-6a91-46dX-ad22-21a8a2696abX > mon_initial_members = storage01, storage02, storage03 > mon_host = X.Y.Z.1,X.Y.Z.2,X.Y.Z.3 > auth_cluster_required = cephx > auth_service_required = cephx > auth_client_required = cephx > public_network = X.Y.Z.0/24 > filestore_xattr_use_omap = true > osd_pool_default_size = 2 > osd_pool_default_min_size = 1 > osd_pool_default_pg_num = 1024 > osd_journal_size = 10240 > osd_mkfs_type = xfs > filestore_op_threads = 16 > filestore_wbthrottle_enable = False > throttler_perf_counter = False > osd crush update on start = false > > [osd] > osd_scrub_begin_hour = 1 > osd_scrub_end_hour = 6 > osd_scrub_priority = 1 > > osd_enable_op_tracker = False > osd_max_backfills = 1 > osd heartbeat grace = 20 > osd heartbeat interval = 5 > osd recovery max active = 1 > osd recovery max single start = 1 > osd recovery op priority = 1 > osd recovery threads = 1 > osd backfill scan max = 16 > osd backfill scan min = 4 > osd max scrubs = 1 > osd scrub interval randomize ratio = 1.0 > osd disk thread ioprio class = idle > osd disk thread ioprio priority = 0 > osd scrub chunk max = 1 > osd scrub chunk min = 1 > osd deep scrub stride = 1048576 > osd scrub load threshold = 5.0 > osd scrub sleep = 0.1 > > [client] > rbd_cache = false > > > Sample fio tests: > > root@storage04:~# fio --name iops --rw randread --bs 4k --filename > /dev/rbd2 --numjobs 12 --ioengine=libaio --group_reporting > iops: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1 > ... > fio-2.2.10 > Starting 12 processes > ^Cbs: 12 (f=12): [r(12)] [1.2% done] [128.4MB/0KB/0KB /s] [32.9K/0/0 iops] > [eta 16m:49s] > fio: terminating on signal 2 > > iops: (groupid=0, jobs=12): err= 0: pid=29812: Sun Feb 11 23:59:19 2018 > read : io=1367.8MB, bw=126212KB/s, iops=31553, runt= 11097msec > slat (usec): min=1, max=59700, avg=375.92, stdev=495.19 > clat (usec): min=0, max=377, avg= 1.12, stdev= 3.16 > lat (usec): min=1, max=59702, avg=377.61, stdev=495.32 > clat percentiles (usec): > | 1.00th=[0], 5.00th=[0], 10.00th=[1], 20.00th=[1], > | 30.00th=[1], 40.00th=[1], 50.00th=[1], 60.00th=[1], > | 70.00th=[1], 80.00th=[1], 90.00th=[1], 95.00th=[2], > | 99.00th=[2], 99.50th=[2], 99.90th=[ 73], 99.95th=[ 78], > | 99.99th=[ 115] > bw (KB /s): min= 8536, max=11944, per=8.33%, avg=10516.45, > stdev=635.32 > lat (usec) : 2=91.74%, 4=7.93%, 10=0.14%, 20=0.09%, 50=0.01% > lat (usec) : 100=0.07%, 250=0.03%, 500=0.01% > cpu : usr=1.32%, sys=3.69%, ctx=329556, majf=0, minf=134 > IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, > >=64=0.0% >
Re: [ceph-users] Object Gateway - Server Side Encryption
On 03/10/2018 12:58 AM, Amardeep Singh wrote: On Saturday 10 March 2018 02:01 AM, Casey Bodley wrote: On 03/08/2018 07:16 AM, Amardeep Singh wrote: Hi, I am trying to configure server side encryption using Key Management Service as per documentation http://docs.ceph.com/docs/master/radosgw/encryption/ Configured Keystone/Barbican integration and its working, tested using curl commands. After I configure RadosGW and use boto.s3.connection from python or s3cmd client an error is thrown. * */boto.exception.S3ResponseError: S3ResponseError: 403 Forbidden// //encoding="UTF-8"?>AccessDeniedFailed to retrieve the actual key, kms-keyid: 616b2ce2-053a-41e3-b51e-0ff53e33cf81newbuckettx77750-005aa1274b-ac51-uk-westac51-uk-west-uk// / In server side logs its getting the token and barbican is authenticating the request then providing secret url, but unable to serve key. / 22:10:03.940091 7f056f7eb700 15 ceph_armor ret=16 22:10:03.940111 7f056f7eb700 15 supplied_md5=eb1a3227cdc3fedbaec2fe38bf6c044a 22:10:03.940129 7f056f7eb700 20 reading from uk-west.rgw.meta:root:.bucket.meta.newbucket:ee560b67-c330-4fd0-af50-aefff93735d2.4163.1 22:10:03.940138 7f056f7eb700 20 get_system_obj_state: rctx=0x7f056f7e39f0 obj=uk-west.rgw.meta:root:.bucket.meta.newbucket:ee560b67-c330-4fd0-af50-aefff93735d2.4163.1 state=0x56540487a5a0 s->prefetch_data=0 22:10:03.940145 7f056f7eb700 10 cache get: name=uk-west.rgw.meta+root+.bucket.meta.newbucket:ee560b67-c330-4fd0-af50-aefff93735d2.4163.1 : hit (requested=0x16, cached=0x17) 22:10:03.940152 7f056f7eb700 20 get_system_obj_state: s->obj_tag was set empty 22:10:03.940155 7f056f7eb700 10 cache get: name=uk-west.rgw.meta+root+.bucket.meta.newbucket:ee560b67-c330-4fd0-af50-aefff93735d2.4163.1 : hit (requested=0x11, cached=0x17) 22:10:03.944015 7f056f7eb700 20 bucket quota: max_objects=1638400 max_size=-1 22:10:03.944030 7f056f7eb700 20 bucket quota OK: stats.num_objects=7 stats.size=50 22:10:03.944176 7f056f7eb700 20 Getting KMS encryption key for key=616b2ce2-053a-41e3-b51e-0ff53e33cf81 22:10:03.944225 7f056f7eb700 20 Requesting secret from barbican url=http://keyserver.rados:5000/v3/auth/tokens 22:10:03.944281 7f056f7eb700 20 sending request to http://keyserver.rados:5000/v3/auth/tokens * 22:10:04.405974 7f056f7eb700 20 sending request to http://keyserver.rados:9311/v1/secrets/616b2ce2-053a-41e3-b51e-0ff53e33cf81* * 22:10:05.519874 7f056f7eb700 5 Failed to retrieve secret from barbican:616b2ce2-053a-41e3-b51e-0ff53e33cf81** */ It looks like this request is being rejected by barbican. Do you have any logs on the barbican side that might show why? Only get 2 lines in barbican logs, one shows warning. 22:10:08.255 807 WARNING barbican.api.controllers.secrets [req-091413d2--46e2-be5f-a3e68a480ac9 716dad1b8044459c99fea284dbfc47cc - - default default] Decrypted secret 616b2ce2-053a-41e3-b51e-0ff53e33cf81 requested using deprecated API call. 22:10:08.261 807 INFO barbican.api.middleware.context [req-091413d2--46e2-be5f-a3e68a480ac9 716dad1b8044459c99fea284dbfc47cc - - default default] Processed request: 200 OK - GET http://keyserver.rados:9311/v1/secrets/616b2ce2-053a-41e3-b51e-0ff53e33cf81 Okay, so barbican is returning 200 OK but radosgw is still converting that to EACCES. I'm guessing that's happening in request_key_from_barbican() here: https://github.com/ceph/ceph/blob/master/src/rgw/rgw_crypt.cc#L779 - is it possible the key in barbican is something other than AES256? /*** 22:10:05.519901 7f056f7eb700 5 ERROR: failed to retrieve actual key from key_id: 616b2ce2-053a-41e3-b51e-0ff53e33cf81* 22:10:05.519980 7f056f7eb700 2 req 387:1.581432:s3:PUT /encrypted.txt:put_obj:completing 22:10:05.520187 7f056f7eb700 2 req 387:1.581640:s3:PUT /encrypted.txt:put_obj:op status=-13 22:10:05.520193 7f056f7eb700 2 req 387:1.581645:s3:PUT /encrypted.txt:put_obj:http status=403 22:10:05.520206 7f056f7eb700 1 == req done req=0x7f056f7e5190 op status=-13 http_status=403 == 22:10:05.520225 7f056f7eb700 20 process_request() returned -13 22:10:05.520280 7f056f7eb700 1 civetweb: 0x5654042a1000: 192.168.100.200 - - [02/Mar/2018:22:10:03 +0530] "PUT /encrypted.txt HTTP/1.1" 1 0 - Boto/2.38.0 Python/2.7.12 Linux/4.12.1-041201-generic 22:10:06.116527 7f056e7e9700 20 HTTP_ACCEPT=*/*/ The error thrown in from this line https://github.com/ceph/ceph/blob/master/src/rgw/rgw_crypt.cc#L1063 I am unable to understand why its throwing the error. In ceph.conf following settings are done. [global] rgw barbican url = http://keyserver.rados:9311 rgw keystone barbican user = rgwcrypt rgw keystone barbican password = rgwpass rgw keystone barbican project = service rgw keystone barbican domain = default rgw keystone url = http://keyserver.rados:5000 rgw keystone api version = 3 rgw crypt require ssl = false Can someone help in figuring out what is missing. Thanks, Amar
[ceph-users] SSD as DB/WAL performance with/without drive write cache
Hi all, I've tested some new Samsung SM863 960GB and Intel DC S4600 240GB SSD's using the method described at Sebastien Han's blog: https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ The first thing stated there is to disable the drive's write cache, which i did. For the Samsungs i got these results: 1 Job: 85 MB/s 5 Jobs: 179 MB/s 10 Jobs: 179 MB/s I was curious what the results would be with the drive write cache on, so i turned it on. Now i got these results: 1 Job: 49 MB/s 5 Jobs: 110 MB/s 10 Jobs: 132 MB/s So i didn't expect these results to be worse because i would assume a drive write cache would make it faster. For the Intels i got more or less the same conclusion (with different figures) but the performance with drive write cache was about half the performance as without drive write cache. Questions: 1) Is this expected behaviour (for all/most SSD's)? If yes, why? 2) Is this only with this type of test? 3) Should i always disable drive write cache for SSD's during boot? 4) Is there any negative side-effect of disabling the drive's write cache? 5) Are these tests still relevant for DB/WAL devices? The blog is written for Filestore and states all journal writes are sequential but is that also true for bluestore DB/WAL writes? Do i need to test differently for DB/WAL? Kind regards, Caspar ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Issue with fstrim and Nova hw_disk_discard=unmap
Hallo! Discards appear like they are being sent to the device. How big of a temporary file did you create and then delete? Did you sync the file to disk before deleting it? What version of qemu-kvm are you running? I made several test with commands like (issuing sync after each operation): dd if=/dev/zero of=/tmp/fileTest bs=1M count=200 oflag=direct What I see is that if I repeat the command with count<=200 the size does not increase. Let's try now with count>200: NAMEPROVISIONED USED volume-80838a69-e544-47eb-b981-a4786be89736 15360M 2284M dd if=/dev/zero of=/tmp/fileTest bs=1M count=750 oflag=direct dd if=/dev/zero of=/tmp/fileTest2 bs=1M count=750 oflag=direct sync NAMEPROVISIONED USED volume-80838a69-e544-47eb-b981-a4786be89736 15360M 2528M rm /tmp/fileTest* sync sudo fstrim -v / /: 14.1 GiB (15145271296 bytes) trimmed NAMEPROVISIONED USED volume-80838a69-e544-47eb-b981-a4786be89736 15360M 2528M As for qemu-kvm, the guest OS is CentOS7, with: [centos@testcentos-deco3 tmp]$ rpm -qa | grep qemu qemu-guest-agent-2.8.0-2.el7.x86_64 while the host is Ubuntu 16 with: root@pa1-r2-s10:/home/ubuntu# dpkg -l | grep qemu ii qemu-block-extra:amd64 1:2.8+dfsg-3ubuntu2.9~cloud1 amd64extra block backend modules for qemu-system and qemu-utils ii qemu-kvm 1:2.8+dfsg-3ubuntu2.9~cloud1 amd64QEMU Full virtualization ii qemu-system-common 1:2.8+dfsg-3ubuntu2.9~cloud1 amd64QEMU full system emulation binaries (common files) ii qemu-system-x86 1:2.8+dfsg-3ubuntu2.9~cloud1 amd64QEMU full system emulation binaries (x86) ii qemu-utils 1:2.8+dfsg-3ubuntu2.9~cloud1 amd64QEMU utilities Thanks! Fulvio smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Issue with fstrim and Nova hw_disk_discard=unmap
Hallo Jason, thanks for your feedback! Original Message >> * decorated a CentOS image with hw_scsi_model=virtio--scsi,hw_disk_bus=scsi> > Is that just a typo for "hw_scsi_model"? Yes, it was a typo when I wrote my message. The image has virtio-scsi as it should. I see that commands: rbd --cluster cephpa1 diff cinder-ceph/${theVol} | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }' ; rados --cluster cephpa1 -p cinder-ceph ls | grep rbd_data.{whatever} | wc -l That's pretty old-school -- you can just use 'rbd du" now to calculate the disk usage. Good to know, thanks! show the size increases but does not decrease when I execute delete the temporary file and execute sudo fstrim -v / Have you verified that your VM is indeed using virtio-scsi? Does blktrace show SCSI UNMAP operations being issued to the block device when you execute "fstrim"? Thanks for the tip, I think I need some more help, please. Disk on my VM is indeed /dev/sda rather than /dev/vda. The XML shows: . . name='cinder-ceph/volume-80838a69-e544-47eb-b981-a4786be89736'> . 80838a69-e544-47eb-b981-a4786be89736 function='0x0'/> As for blktrace, blkparse shows me tons of lines, please find below the first ones and one of the many group of lines which I see: 8,00 11 4.333917112 24677 Q FWFSM 8406583 + 4 [fstrim] 8,00 12 4.333919649 24677 G FWFSM 8406583 + 4 [fstrim] 8,00 13 4.333920695 24677 P N [fstrim] 8,00 14 4.333922965 24677 I FWFSM 8406583 + 4 [fstrim] 8,00 15 4.333924575 24677 U N [fstrim] 1 8,00 20 4.340140041 24677 Q D 986016 + 2097152 [fstrim] 8,00 21 4.340144908 24677 G D 986016 + 2097152 [fstrim] 8,00 22 4.340145561 24677 P N [fstrim] 8,00 24 4.340147495 24677 Q D 3083168 + 1112672 [fstrim] 8,00 25 4.340149772 24677 G D 3083168 + 1112672 [fstrim] . 8,00 50 4.340556955 24677 Q D 665880 + 20008 [fstrim] 8,00 51 4.340558481 24677 G D 665880 + 20008 [fstrim] 8,00 52 4.340558728 24677 P N [fstrim] 8,00 53 4.340559725 24677 I D 665880 + 20008 [fstrim] 8,00 54 4.340560292 24677 U N [fstrim] 1 8,00 55 4.340560801 24677 D D 665880 + 20008 [fstrim] . Apologies for my ignorance, is the above enough to understand whether SCSI UNMAP operations are being issued? Thanks a lot! Fulvio smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock
On Mon, Mar 12, 2018 at 8:20 PM, Maged Mokhtarwrote: > On 2018-03-12 21:00, Ilya Dryomov wrote: > > On Mon, Mar 12, 2018 at 7:41 PM, Maged Mokhtar wrote: > > On 2018-03-12 14:23, David Disseldorp wrote: > > On Fri, 09 Mar 2018 11:23:02 +0200, Maged Mokhtar wrote: > > 2)I undertand that before switching the path, the initiator will send a > TMF ABORT can we pass this to down to the same abort_request() function > in osd_client that is used for osd_request_timeout expiry ? > > > IIUC, the existing abort_request() codepath only cancels the I/O on the > client/gw side. A TMF ABORT successful response should only be sent if > we can guarantee that the I/O is terminated at all layers below, so I > think this would have to be implemented via an additional OSD epoch > barrier or similar. > > Cheers, David > > Hi David, > > I was thinking we would get the block request then loop down to all its osd > requests and cancel those using the same osd request cancel function. > > > All that function does is tear down OSD client / messenger data > structures associated with the OSD request. Any OSD request that hit > the TCP layer may eventually get through to the OSDs. > > Thanks, > > Ilya > > Hi Ilya, > > OK..so i guess this also applies as well to osd_request_timeout expiry, it > is not guaranteed to stop all stale ios. Yes. The purpose of osd_request_timeout is to unblock the client side by failing the I/O on the client side. It doesn't attempt to stop any in-flight I/O -- it simply marks it as failed. Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New Ceph cluster design
Hello, On Sat, 10 Mar 2018 16:14:53 +0100 Vincent Godin wrote: > Hi, > > As i understand it, you'll have one RAID1 of two SSDs for 12 HDDs. A > WAL is used for all writes on your host. This isn't filestore, AFAIK the WAL/DB will be used for small writes only to keep latency with Bluestore akin to filestore levels. Large writes will go directly to the HDDs. However each write will of course necessitate a write to the DB and thus IOPS (much more so than bandwidth) are paramount here. > If you have good SSDs, they > can handle 450-550 MBpsc. Your 12 HDDs SATA can handle 12 x 100 MBps > that is to say 1200 GBps. Aside from what I wrote above I'd like to repeat myself and others here for the umpteenth time, focusing on bandwidth is a fallacy in nearly all use cases, IOPS tend to become the bottleneck. Also that's 1.2GB/s or 1200MB/s. The OP stated 10TB HDDs and many (but not exclusively?) small objects, so if we're looking at lots of small writes the bandwidth of the SSDs becomes a factor again and with the sizes involved they appear too small as well. (going with the rough ratio of 10GB per TB). Either a RAID1 of at least 1600GB NVMes or 2 800GB NVMes and a resulting failure domain of 6 HDDs would be better/safer fit. > So your RAID 1 will be the bootleneck with > this design. A good design would be to have one SSD for 4 or 5 HDD. In > your case, the best option would be to start with 3 SSDs for 12 HDDs > to have a balances node. Don't forget to choose SSD with a high WDPD > ratio (>10) > More SSDs/NVMes are of course better and DWPD is important, but probably less so than with filestore journals. A DWPD of >10 is overkill for anything I've ever encountered, for many things 3 will be fine, especially if one knows what is expected. For example a filestore cache tier SSD with inline journal (800GB DC S3610, 3 DWPD) has a media wearout of 97 (3% used) after 2 years with this constant and not insignificant load: --- Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdb 0.0383.097.07 303.24 746.64 5084.9937.59 0.050.150.710.13 0.06 2.00 --- 300 write IOPS and 5MB/s for all that time. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Rakuten Communications ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com