Re: [ceph-users] WAL/DB size

2019-08-16 Thread Anthony D'Atri
Thanks — interesting reading.

Distilling the discussion there, below are my takeaways.  Am I interpreting 
correctly?

1) The spillover phenomenon and thus the small number of discrete sizes that 
are effective without being wasteful — are recognized

2) "I don't think we should plan teh block.db size based on the rocksdb 
stairstep pattern. A better solution would be to tweak the rocksdb level sizes 
at mkfs time based on the block.db size!”

3) Neither 1) nor 2) was actually acted upon, so we got arbitrary guidance 
based on a calculation of the number of metadata objects, with no input from or 
action upon how the DB actually behaves?


Am I interpreting correctly?


> Btw, the original discussion leading to the 4% recommendation is here:
> https://github.com/ceph/ceph/pull/23210
> 
> 
> -- 
> Paul Emmerich
> 
> 
>> 30gb already includes WAL, see 
>> http://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing
>> 
>> 15 августа 2019 г. 1:15:58 GMT+03:00, Anthony D'Atri  
>> пишет:
>>> 
>>> Good points in both posts, but I think there’s still some unclarity.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-16 Thread Paul Emmerich
Btw, the original discussion leading to the 4% recommendation is here:
https://github.com/ceph/ceph/pull/23210


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Thu, Aug 15, 2019 at 11:23 AM Виталий Филиппов  wrote:
>
> 30gb already includes WAL, see 
> http://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing
>
> 15 августа 2019 г. 1:15:58 GMT+03:00, Anthony D'Atri  
> пишет:
>>
>> Good points in both posts, but I think there’s still some unclarity.
>>
>> Absolutely let’s talk about DB and WAL together.  By “bluestore goes on 
>> flash” I assume you mean WAL+DB?
>>
>> “Simply allocate DB and WAL will appear there automatically”
>>
>> Forgive me please if this is obvious, but I’d like to see a holistic 
>> explanation of WAL and DB sizing *together*, which I think would help folks 
>> put these concepts together and plan deployments with some sense of 
>> confidence.
>>
>> We’ve seen good explanations on the list of why only specific DB sizes, say 
>> 30GB, are actually used _for the DB_.
>> If the WAL goes along with the DB, shouldn’t we also explicitly determine an 
>> appropriate size N for the WAL, and make the partition (30+N) GB?
>> If so, how do we derive N?  Or is it a constant?
>>
>> Filestore was so much simpler, 10GB set+forget for the journal.  Not that I 
>> miss XFS, mind you.
>>
>>
 Actually standalone WAL is required when you have either very small fast
 device (and don't want db to use it) or three devices (different in
 performance) behind OSD (e.g. hdd, ssd, nvme). So WAL is to be located
 at the fastest one.

 For the given use case you just have HDD and NVMe and DB and WAL can
 safely collocate. Which means you don't need to allocate specific volume
 for WAL. Hence no need to answer the question how many space is needed
 for WAL. Simply allocate DB and WAL will appear there automatically.


>>> Yes, i'm surprised how often people talk about the DB and WAL separately
>>> for no good reason.  In common setups bluestore goes on flash and the
>>> storage goes on the HDDs, simple.
>>>
>>> In the event flash is 100s of GB and would be wasted, is there anything
>>> that needs to be done to set rocksdb to use the highest level?  600 I
>>> believe
>>
>> 
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> With best regards,
> Vitaliy Filippov
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDSs report damaged metadata

2019-08-16 Thread Lars Täuber
Hi Paul,

thank you for your help. But I get the following error:

# ceph tell mds.mds3 scrub start 
"~mds0/stray7/15161f7/dovecot.index.backup" repair
2019-08-16 13:29:40.208 7f7e927fc700  0 client.881878 ms_handle_reset on 
v2:192.168.16.23:6800/176704036
2019-08-16 13:29:40.240 7f7e937fe700  0 client.867786 ms_handle_reset on 
v2:192.168.16.23:6800/176704036
{
"return_code": -116
}



Lars


Fri, 16 Aug 2019 13:17:08 +0200
Paul Emmerich  ==> Lars Täuber  :
> Hi,
> 
> damage_type backtrace is rather harmless and can indeed be repaired
> with the repair command, but it's called scrub_path.
> Also you need to pass the name and not the rank of the MDS as id, it should be
> 
> # (on the server where the MDS is actually running)
> ceph daemon mds.mds3 scrub_path ...
> 
> But you should also be able to use ceph tell since nautilus which is a
> little bit easier because it can be run from any node:
> 
> ceph tell mds.mds3 scrub start 'PATH' repair
> 
> 
> Paul
> 


-- 
Informationstechnologie
Berlin-Brandenburgische Akademie der Wissenschaften
Jägerstraße 22-23  10117 Berlin
Tel.: +49 30 20370-352   http://www.bbaw.de
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDSs report damaged metadata

2019-08-16 Thread Paul Emmerich
Hi,

damage_type backtrace is rather harmless and can indeed be repaired
with the repair command, but it's called scrub_path.
Also you need to pass the name and not the rank of the MDS as id, it should be

# (on the server where the MDS is actually running)
ceph daemon mds.mds3 scrub_path ...

But you should also be able to use ceph tell since nautilus which is a
little bit easier because it can be run from any node:

ceph tell mds.mds3 scrub start 'PATH' repair


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Aug 16, 2019 at 8:40 AM Lars Täuber  wrote:
>
> Hi all!
>
> The mds of our ceph cluster produces a health_err state.
> It is a nautilus 14.2.2 on debian buster installed from the repo made by 
> croit.io with osds on bluestore.
>
> The symptom:
> # ceph -s
>   cluster:
> health: HEALTH_ERR
> 1 MDSs report damaged metadata
>
>   services:
> mon: 3 daemons, quorum mon1,mon2,mon3 (age 2d)
> mgr: mon3(active, since 2d), standbys: mon2, mon1
> mds: cephfs_1:1 {0=mds3=up:active} 2 up:standby
> osd: 30 osds: 30 up (since 17h), 29 in (since 19h)
>
>   data:
> pools:   3 pools, 1153 pgs
> objects: 435.21k objects, 806 GiB
> usage:   4.7 TiB used, 162 TiB / 167 TiB avail
> pgs: 1153 active+clean
>
>
> # ceph health detail
> HEALTH_ERR 1 MDSs report damaged metadata
> MDS_DAMAGE 1 MDSs report damaged metadata
> mdsmds3(mds.0): Metadata damage detected
>
> #ceph tell mds.0 damage ls
> 2019-08-16 07:20:09.415 7f1254ff9700  0 client.840758 ms_handle_reset on 
> v2:192.168.16.23:6800/176704036
> 2019-08-16 07:20:09.431 7f1255ffb700  0 client.840764 ms_handle_reset on 
> v2:192.168.16.23:6800/176704036
> [
> {
> "damage_type": "backtrace",
> "id": 3760765989,
> "ino": 1099518115802,
> "path": "~mds0/stray7/15161f7/dovecot.index.backup"
> }
> ]
>
>
>
> I tried this without much luck:
> # ceph daemon mds.0 "~mds0/stray7/15161f7/dovecot.index.backup" recursive 
> repair
> admin_socket: exception getting command descriptions: [Errno 2] No such file 
> or directory
>
>
> Is there a way out of this error?
>
> Thanks and best regards,
> Lars
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgs inconsistent

2019-08-16 Thread Ronny Aasen

On 15.08.2019 16:38, huxia...@horebdata.cn wrote:

Dear folks,

I had a Ceph cluster with replication 2, 3 nodes, each node with 3 OSDs, 
on Luminous 12.2.12. Some days ago i had one OSD down (the disk is still 
fine) due to some errors on rocksdb crash. I tried to restart that OSD 
but failed. So I tried to rebalance but encountered PGs inconsistent.


what can i do to make the cluster working again?

thanks a lot for helping me out

Samuel

**
# ceph -s
   cluster:
     id:     289e3afa-f188-49b0-9bea-1ab57cc2beb8
     health: HEALTH_ERR
             pauserd,pausewr,noout flag(s) set
             191444 scrub errors
             Possible data damage: 376 pgs inconsistent
   services:
     mon: 3 daemons, quorum horeb71,horeb72,horeb73
     mgr: horeb73(active), standbys: horeb71, horeb72
     osd: 9 osds: 8 up, 8 in
          flags pauserd,pausewr,noout
   data:
     pools:   1 pools, 1024 pgs
     objects: 524.29k objects, 1.99TiB
     usage:   3.67TiB used, 2.58TiB / 6.25TiB avail
     pgs:     645 active+clean
              376 active+clean+inconsistent
              3   active+clean+scrubbing+deep



that was a lot of inconsistent pg's. When you say replication = 2 do you 
mean you have 2 copies as in size=3 min-size=2 , or that you have size=2 
min-size=1 ?


the reason i ask is that min-size=1 is a well known way to get into lots 
of problems. (one disk can accept a write alone, and before it is 
recoverd/backfilled the drive can die)


if you have min-size=1 i would recommend you set min-size=2 as the first 
step, to avoid creating more inconsistency while troubleshooting. if you 
have the space for it in the cluster you should also set size=3


if you run "#ceph health detail" you will get a list of the pg's that 
are inconsistent. check if there is a repeat offender osd in that list 
of pg's, and check that disk for issues. check dmesg and logs of the 
osd, and if there are smart errors.


You can try to repair the inconsistent pg's automagically by running the 
command  "#ceph pg repair [pg id]" but make sure the hardware is good 
first.



good luck
Ronny


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com