Re: [ceph-users] WAL/DB size
Thanks — interesting reading. Distilling the discussion there, below are my takeaways. Am I interpreting correctly? 1) The spillover phenomenon and thus the small number of discrete sizes that are effective without being wasteful — are recognized 2) "I don't think we should plan teh block.db size based on the rocksdb stairstep pattern. A better solution would be to tweak the rocksdb level sizes at mkfs time based on the block.db size!” 3) Neither 1) nor 2) was actually acted upon, so we got arbitrary guidance based on a calculation of the number of metadata objects, with no input from or action upon how the DB actually behaves? Am I interpreting correctly? > Btw, the original discussion leading to the 4% recommendation is here: > https://github.com/ceph/ceph/pull/23210 > > > -- > Paul Emmerich > > >> 30gb already includes WAL, see >> http://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing >> >> 15 августа 2019 г. 1:15:58 GMT+03:00, Anthony D'Atri >> пишет: >>> >>> Good points in both posts, but I think there’s still some unclarity. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] WAL/DB size
Btw, the original discussion leading to the 4% recommendation is here: https://github.com/ceph/ceph/pull/23210 -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Thu, Aug 15, 2019 at 11:23 AM Виталий Филиппов wrote: > > 30gb already includes WAL, see > http://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing > > 15 августа 2019 г. 1:15:58 GMT+03:00, Anthony D'Atri > пишет: >> >> Good points in both posts, but I think there’s still some unclarity. >> >> Absolutely let’s talk about DB and WAL together. By “bluestore goes on >> flash” I assume you mean WAL+DB? >> >> “Simply allocate DB and WAL will appear there automatically” >> >> Forgive me please if this is obvious, but I’d like to see a holistic >> explanation of WAL and DB sizing *together*, which I think would help folks >> put these concepts together and plan deployments with some sense of >> confidence. >> >> We’ve seen good explanations on the list of why only specific DB sizes, say >> 30GB, are actually used _for the DB_. >> If the WAL goes along with the DB, shouldn’t we also explicitly determine an >> appropriate size N for the WAL, and make the partition (30+N) GB? >> If so, how do we derive N? Or is it a constant? >> >> Filestore was so much simpler, 10GB set+forget for the journal. Not that I >> miss XFS, mind you. >> >> Actually standalone WAL is required when you have either very small fast device (and don't want db to use it) or three devices (different in performance) behind OSD (e.g. hdd, ssd, nvme). So WAL is to be located at the fastest one. For the given use case you just have HDD and NVMe and DB and WAL can safely collocate. Which means you don't need to allocate specific volume for WAL. Hence no need to answer the question how many space is needed for WAL. Simply allocate DB and WAL will appear there automatically. >>> Yes, i'm surprised how often people talk about the DB and WAL separately >>> for no good reason. In common setups bluestore goes on flash and the >>> storage goes on the HDDs, simple. >>> >>> In the event flash is 100s of GB and would be wasted, is there anything >>> that needs to be done to set rocksdb to use the highest level? 600 I >>> believe >> >> >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > -- > With best regards, > Vitaliy Filippov > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDSs report damaged metadata
Hi Paul, thank you for your help. But I get the following error: # ceph tell mds.mds3 scrub start "~mds0/stray7/15161f7/dovecot.index.backup" repair 2019-08-16 13:29:40.208 7f7e927fc700 0 client.881878 ms_handle_reset on v2:192.168.16.23:6800/176704036 2019-08-16 13:29:40.240 7f7e937fe700 0 client.867786 ms_handle_reset on v2:192.168.16.23:6800/176704036 { "return_code": -116 } Lars Fri, 16 Aug 2019 13:17:08 +0200 Paul Emmerich ==> Lars Täuber : > Hi, > > damage_type backtrace is rather harmless and can indeed be repaired > with the repair command, but it's called scrub_path. > Also you need to pass the name and not the rank of the MDS as id, it should be > > # (on the server where the MDS is actually running) > ceph daemon mds.mds3 scrub_path ... > > But you should also be able to use ceph tell since nautilus which is a > little bit easier because it can be run from any node: > > ceph tell mds.mds3 scrub start 'PATH' repair > > > Paul > -- Informationstechnologie Berlin-Brandenburgische Akademie der Wissenschaften Jägerstraße 22-23 10117 Berlin Tel.: +49 30 20370-352 http://www.bbaw.de ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDSs report damaged metadata
Hi, damage_type backtrace is rather harmless and can indeed be repaired with the repair command, but it's called scrub_path. Also you need to pass the name and not the rank of the MDS as id, it should be # (on the server where the MDS is actually running) ceph daemon mds.mds3 scrub_path ... But you should also be able to use ceph tell since nautilus which is a little bit easier because it can be run from any node: ceph tell mds.mds3 scrub start 'PATH' repair Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Fri, Aug 16, 2019 at 8:40 AM Lars Täuber wrote: > > Hi all! > > The mds of our ceph cluster produces a health_err state. > It is a nautilus 14.2.2 on debian buster installed from the repo made by > croit.io with osds on bluestore. > > The symptom: > # ceph -s > cluster: > health: HEALTH_ERR > 1 MDSs report damaged metadata > > services: > mon: 3 daemons, quorum mon1,mon2,mon3 (age 2d) > mgr: mon3(active, since 2d), standbys: mon2, mon1 > mds: cephfs_1:1 {0=mds3=up:active} 2 up:standby > osd: 30 osds: 30 up (since 17h), 29 in (since 19h) > > data: > pools: 3 pools, 1153 pgs > objects: 435.21k objects, 806 GiB > usage: 4.7 TiB used, 162 TiB / 167 TiB avail > pgs: 1153 active+clean > > > # ceph health detail > HEALTH_ERR 1 MDSs report damaged metadata > MDS_DAMAGE 1 MDSs report damaged metadata > mdsmds3(mds.0): Metadata damage detected > > #ceph tell mds.0 damage ls > 2019-08-16 07:20:09.415 7f1254ff9700 0 client.840758 ms_handle_reset on > v2:192.168.16.23:6800/176704036 > 2019-08-16 07:20:09.431 7f1255ffb700 0 client.840764 ms_handle_reset on > v2:192.168.16.23:6800/176704036 > [ > { > "damage_type": "backtrace", > "id": 3760765989, > "ino": 1099518115802, > "path": "~mds0/stray7/15161f7/dovecot.index.backup" > } > ] > > > > I tried this without much luck: > # ceph daemon mds.0 "~mds0/stray7/15161f7/dovecot.index.backup" recursive > repair > admin_socket: exception getting command descriptions: [Errno 2] No such file > or directory > > > Is there a way out of this error? > > Thanks and best regards, > Lars > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pgs inconsistent
On 15.08.2019 16:38, huxia...@horebdata.cn wrote: Dear folks, I had a Ceph cluster with replication 2, 3 nodes, each node with 3 OSDs, on Luminous 12.2.12. Some days ago i had one OSD down (the disk is still fine) due to some errors on rocksdb crash. I tried to restart that OSD but failed. So I tried to rebalance but encountered PGs inconsistent. what can i do to make the cluster working again? thanks a lot for helping me out Samuel ** # ceph -s cluster: id: 289e3afa-f188-49b0-9bea-1ab57cc2beb8 health: HEALTH_ERR pauserd,pausewr,noout flag(s) set 191444 scrub errors Possible data damage: 376 pgs inconsistent services: mon: 3 daemons, quorum horeb71,horeb72,horeb73 mgr: horeb73(active), standbys: horeb71, horeb72 osd: 9 osds: 8 up, 8 in flags pauserd,pausewr,noout data: pools: 1 pools, 1024 pgs objects: 524.29k objects, 1.99TiB usage: 3.67TiB used, 2.58TiB / 6.25TiB avail pgs: 645 active+clean 376 active+clean+inconsistent 3 active+clean+scrubbing+deep that was a lot of inconsistent pg's. When you say replication = 2 do you mean you have 2 copies as in size=3 min-size=2 , or that you have size=2 min-size=1 ? the reason i ask is that min-size=1 is a well known way to get into lots of problems. (one disk can accept a write alone, and before it is recoverd/backfilled the drive can die) if you have min-size=1 i would recommend you set min-size=2 as the first step, to avoid creating more inconsistency while troubleshooting. if you have the space for it in the cluster you should also set size=3 if you run "#ceph health detail" you will get a list of the pg's that are inconsistent. check if there is a repeat offender osd in that list of pg's, and check that disk for issues. check dmesg and logs of the osd, and if there are smart errors. You can try to repair the inconsistent pg's automagically by running the command "#ceph pg repair [pg id]" but make sure the hardware is good first. good luck Ronny ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com