Re: [ceph-users] Consumer-grade SSD in Ceph

2020-01-03 Thread Reed Dier
Also, just for more diversity, Samsung has the 883 DCT and the 860 DCT models 
as well.
Both less than 1 DWPD, but they are enterprise rated.

Reed

> On Jan 3, 2020, at 2:10 AM, Eneko Lacunza  wrote:
> 
> I'm sure you know also the following, but just in case:
> - Intel SATA D3-S4610 (I think they're out of stock right now)
> - Intel SATA D3-S4510 (I see stock of these right now)
> 
> El 27/12/19 a las 17:56, vita...@yourcmc.ru escribió:
>> SATA: Micron 5100-5200-5300, Seagate Nytro 1351/1551 (don't forget to 
>> disable their cache with hdparm -W 0)
>> 
>> NVMe: Intel P4500, Micron 9300
>> 
>>> Thanks for all the replies. In summary; consumer grade SSD is a no go.
>>> 
>>> What is an alternative to SM863a? Since it is quite hard to get these
>>> due non non-stock.
> 
> 
> -- 
> Zuzendari Teknikoa / Director Técnico
> Binovo IT Human Project, S.L.
> Telf. 943569206
> Astigarragako bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
> www.binovo.es
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Local Device Health PG inconsistent

2019-10-02 Thread Reed Dier
And now to fill in the full circle.

Sadly my solution was to run
> $ ceph pg repair 33.0
which returned
> 2019-10-02 15:38:54.499318 osd.12 (osd.12) 181 : cluster [DBG] 33.0 repair 
> starts
> 2019-10-02 15:38:55.502606 osd.12 (osd.12) 182 : cluster [ERR] 33.0 repair : 
> stat mismatch, got 264/265 objects, 0/0 clones, 264/265 dirty, 264/265 omap, 
> 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes, 0/0 manifest 
> objects, 0/0 hit_set_archive bytes.
> 2019-10-02 15:38:55.503066 osd.12 (osd.12) 183 : cluster [ERR] 33.0 repair 1 
> errors, 1 fixed
And now my cluster is happy once more.

So, in case anyone else runs into this issue, and doesn't think to run pg 
repair on the pg in question, in this case, go for it.

Reed

> On Sep 23, 2019, at 9:07 AM, Reed Dier  wrote:
> 
> And to come full circle,
> 
> After this whole saga, I now have a scrub error on the new device health 
> metrics pool/PG in what looks to be the exact same way.
> So I am at a loss for what ever it is that I am doing incorrectly, as a scrub 
> error obviously makes the monitoring suite very happy.
> 
>> $ ceph health detail
> 
>> OSD_SCRUB_ERRORS 1 scrub errors
>> PG_DAMAGED Possible data damage: 1 pg inconsistent
>> pg 33.0 is active+clean+inconsistent, acting [12,138,15]
>> $ rados list-inconsistent-pg device_health_metrics
>> ["33.0"]
>> $ rados list-inconsistent-obj 33.0 | jq
>> {
>>   "epoch": 176348,
>>   "inconsistents": []
>> }
> 
> I assume that this is the root cause:
>> ceph.log.5.gz:2019-09-18 11:12:16.466118 osd.138 (osd.138) 154 : cluster 
>> [WRN] bad locator @33 on object @33 op osd_op(client.1769585636.0:466 33.0 
>> 33:b08b92bdhead [omap-set-vals] snapc 0=[] 
>> ondisk+write+known_if_redirected e176327) v8
>> ceph.log.1.gz:2019-09-22 20:41:44.937841 osd.12 (osd.12) 53 : cluster [DBG] 
>> 33.0 scrub starts
>> ceph.log.1.gz:2019-09-22 20:41:45.000638 osd.12 (osd.12) 54 : cluster [ERR] 
>> 33.0 scrub : stat mismatch, got 237/238 objects, 0/0 clones, 237/238 dirty, 
>> 237/238 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes, 0/0 
>> manifest objects, 0/0 hit_set_archive bytes.
>> ceph.log.1.gz:2019-09-22 20:41:45.000643 osd.12 (osd.12) 55 : cluster [ERR] 
>> 33.0 scrub 1 errors
> 
> Nothing fancy set for the plugin:
>> $ ceph config dump | grep device
>> global  basicdevice_failure_prediction_mode local
>>   mgr   advanced mgr/devicehealth/enable_monitoring true
> 
> 
> Reed
> 
>> On Sep 18, 2019, at 11:33 AM, Reed Dier > <mailto:reed.d...@focusvq.com>> wrote:
>> 
>> And to provide some further updates,
>> 
>> I was able to get OSDs to boot by updating from 14.2.2 to 14.2.4.
>> Unclear why this would improve things, but it at least got me running again.
>> 
>>> $ ceph versions
>>> {
>>> "mon": {
>>> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
>>> nautilus (stable)": 3
>>> },
>>> "mgr": {
>>> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
>>> nautilus (stable)": 3
>>> },
>>> "osd": {
>>> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
>>> nautilus (stable)": 199,
>>> "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) 
>>> nautilus (stable)": 5
>>> },
>>> "mds": {
>>> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
>>> nautilus (stable)": 1
>>> },
>>> "overall": {
>>> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
>>> nautilus (stable)": 206,
>>> "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) 
>>> nautilus (stable)": 5
>>> }
>>> }
>> 
>> 
>> Reed
>> 
>>> On Sep 18, 2019, at 10:12 AM, Reed Dier >> <mailto:reed.d...@focusvq.com>> wrote:
>>> 
>>> To answer the question, if it is safe to disable the module and delete the 
>>> pool, the answer is no.
>>> 
>>> After disabling the diskprediction_local module, I then proceeded to remove 
>>> the pool created by the module, device_health_metrics.
>>> 
>>> This is where things went south quickly,
>>> 
>>> Ceph health showed: 
>>>> Module 'devicehealth' has failed: [errno 2] Failed to operate write

Re: [ceph-users] Local Device Health PG inconsistent

2019-09-23 Thread Reed Dier
And to come full circle,

After this whole saga, I now have a scrub error on the new device health 
metrics pool/PG in what looks to be the exact same way.
So I am at a loss for what ever it is that I am doing incorrectly, as a scrub 
error obviously makes the monitoring suite very happy.

> $ ceph health detail

> OSD_SCRUB_ERRORS 1 scrub errors
> PG_DAMAGED Possible data damage: 1 pg inconsistent
> pg 33.0 is active+clean+inconsistent, acting [12,138,15]
> $ rados list-inconsistent-pg device_health_metrics
> ["33.0"]
> $ rados list-inconsistent-obj 33.0 | jq
> {
>   "epoch": 176348,
>   "inconsistents": []
> }

I assume that this is the root cause:
> ceph.log.5.gz:2019-09-18 11:12:16.466118 osd.138 (osd.138) 154 : cluster 
> [WRN] bad locator @33 on object @33 op osd_op(client.1769585636.0:466 33.0 
> 33:b08b92bdhead [omap-set-vals] snapc 0=[] 
> ondisk+write+known_if_redirected e176327) v8
> ceph.log.1.gz:2019-09-22 20:41:44.937841 osd.12 (osd.12) 53 : cluster [DBG] 
> 33.0 scrub starts
> ceph.log.1.gz:2019-09-22 20:41:45.000638 osd.12 (osd.12) 54 : cluster [ERR] 
> 33.0 scrub : stat mismatch, got 237/238 objects, 0/0 clones, 237/238 dirty, 
> 237/238 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes, 0/0 
> manifest objects, 0/0 hit_set_archive bytes.
> ceph.log.1.gz:2019-09-22 20:41:45.000643 osd.12 (osd.12) 55 : cluster [ERR] 
> 33.0 scrub 1 errors

Nothing fancy set for the plugin:
> $ ceph config dump | grep device
> global  basicdevice_failure_prediction_mode local
>   mgr   advanced mgr/devicehealth/enable_monitoring true


Reed

> On Sep 18, 2019, at 11:33 AM, Reed Dier  wrote:
> 
> And to provide some further updates,
> 
> I was able to get OSDs to boot by updating from 14.2.2 to 14.2.4.
> Unclear why this would improve things, but it at least got me running again.
> 
>> $ ceph versions
>> {
>> "mon": {
>> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
>> nautilus (stable)": 3
>> },
>> "mgr": {
>> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
>> nautilus (stable)": 3
>> },
>> "osd": {
>> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
>> nautilus (stable)": 199,
>> "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) 
>> nautilus (stable)": 5
>> },
>> "mds": {
>> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
>> nautilus (stable)": 1
>> },
>> "overall": {
>> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
>> nautilus (stable)": 206,
>> "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) 
>> nautilus (stable)": 5
>> }
>> }
> 
> 
> Reed
> 
>> On Sep 18, 2019, at 10:12 AM, Reed Dier > <mailto:reed.d...@focusvq.com>> wrote:
>> 
>> To answer the question, if it is safe to disable the module and delete the 
>> pool, the answer is no.
>> 
>> After disabling the diskprediction_local module, I then proceeded to remove 
>> the pool created by the module, device_health_metrics.
>> 
>> This is where things went south quickly,
>> 
>> Ceph health showed: 
>>> Module 'devicehealth' has failed: [errno 2] Failed to operate write op for 
>>> oid SAMSUNG_$MODEL_$SERIAL
>> 
>> That module apparently can't be disabled:
>>> $ ceph mgr module disable devicehealth
>>> Error EINVAL: module 'devicehealth' cannot be disabled (always-on)
>> 
>> Then 5 osd's went down, crashing with:
>>>-12> 2019-09-18 10:53:00.299 7f95940ac700  5 osd.5 pg_epoch: 176304 
>>> pg[17.3d1( v 176297'568491 lc 176269'568471 (175914'565388,176297'568491] 
>>> local-lis/les=176302/176303 n=107092 ec=11397/11397 lis/c 176302/172990 
>>> les/c/f 176303/172991/107766 176304/176304/176304) [5,81,162] r=0 
>>> lpr=176304 pi=[172990,176304)/1 crt=176297'568491 lcod 0'0 mlcod 0'0 
>>> peering m=17 mbc={}] enter Started/Primary/Peering/WaitUpThru
>>>-11> 2019-09-18 10:53:00.303 7f959fd6f700  2 osd.5 176304 
>>> ms_handle_reset con 0x564078474d00 session 0x56407878ea00
>>>-10> 2019-09-18 10:53:00.303 7f95b10e6700 10 monclient: 
>>> handle_auth_request added challenge on 0x564077ac1b00
>>> -9> 2019-09-18 10:53:00.307 7f95b10e6700 10 monclient: 
>>> handle_auth_request added challenge on 0x564077ac3180
>>> -8&g

Re: [ceph-users] Local Device Health PG inconsistent

2019-09-18 Thread Reed Dier
And to provide some further updates,

I was able to get OSDs to boot by updating from 14.2.2 to 14.2.4.
Unclear why this would improve things, but it at least got me running again.

> $ ceph versions
> {
> "mon": {
> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
> nautilus (stable)": 3
> },
> "mgr": {
> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
> nautilus (stable)": 3
> },
> "osd": {
> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
> nautilus (stable)": 199,
> "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) 
> nautilus (stable)": 5
> },
> "mds": {
> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
> nautilus (stable)": 1
> },
> "overall": {
> "ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) 
> nautilus (stable)": 206,
> "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) 
> nautilus (stable)": 5
> }
> }


Reed

> On Sep 18, 2019, at 10:12 AM, Reed Dier  wrote:
> 
> To answer the question, if it is safe to disable the module and delete the 
> pool, the answer is no.
> 
> After disabling the diskprediction_local module, I then proceeded to remove 
> the pool created by the module, device_health_metrics.
> 
> This is where things went south quickly,
> 
> Ceph health showed: 
>> Module 'devicehealth' has failed: [errno 2] Failed to operate write op for 
>> oid SAMSUNG_$MODEL_$SERIAL
> 
> That module apparently can't be disabled:
>> $ ceph mgr module disable devicehealth
>> Error EINVAL: module 'devicehealth' cannot be disabled (always-on)
> 
> Then 5 osd's went down, crashing with:
>>-12> 2019-09-18 10:53:00.299 7f95940ac700  5 osd.5 pg_epoch: 176304 
>> pg[17.3d1( v 176297'568491 lc 176269'568471 (175914'565388,176297'568491] 
>> local-lis/les=176302/176303 n=107092 ec=11397/11397 lis/c 176302/172990 
>> les/c/f 176303/172991/107766 176304/176304/176304) [5,81,162] r=0 lpr=176304 
>> pi=[172990,176304)/1 crt=176297'568491 lcod 0'0 mlcod 0'0 peering m=17 
>> mbc={}] enter Started/Primary/Peering/WaitUpThru
>>-11> 2019-09-18 10:53:00.303 7f959fd6f700  2 osd.5 176304 ms_handle_reset 
>> con 0x564078474d00 session 0x56407878ea00
>>-10> 2019-09-18 10:53:00.303 7f95b10e6700 10 monclient: 
>> handle_auth_request added challenge on 0x564077ac1b00
>> -9> 2019-09-18 10:53:00.307 7f95b10e6700 10 monclient: 
>> handle_auth_request added challenge on 0x564077ac3180
>> -8> 2019-09-18 10:53:00.307 7f95b10e6700 10 monclient: 
>> handle_auth_request added challenge on 0x564077ac3600
>> -7> 2019-09-18 10:53:00.307 7f95950ae700 -1 
>> bluestore(/var/lib/ceph/osd/ceph-5) _txc_add_transaction error (39) 
>> Directory not empty not handled on operation 21 (op 1, counting from 0)
>> -6> 2019-09-18 10:53:00.307 7f95950ae700  0 _dump_transaction 
>> transaction dump:
>> {
>> "ops": [
>> {
>> "op_num": 0,
>> "op_name": "remove",
>> "collection": "30.0_head",
>> "oid": "#30:head#"
>> },
>> {
>> "op_num": 1,
>> "op_name": "rmcoll",
>> "collection": "30.0_head"
>> }
>> ]
>> }
>> -5> 2019-09-18 10:53:00.311 7f95948ad700  5 osd.5 pg_epoch: 176304 
>> pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919] 
>> local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285 
>> les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0 
>> lpr=176304 pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0 peering 
>> m=32 mbc={}] exit Started/Primary/Peering/GetLog 0.023847 2 0.000123
>> -4> 2019-09-18 10:53:00.311 7f95948ad700  5 osd.5 pg_epoch: 176304 
>> pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919] 
>> local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285 
>> les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0 
>> lpr=176304 pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0 peering 
>> m=32 mbc={}] enter Started/Primary/Peering/GetMissing
>> -3> 2019-09-18 10:53:00.311 7f95948ad700  5 osd.5 pg_epoch: 176304 
>> pg[17.353( v 176300'586919 lc 17

Re: [ceph-users] Local Device Health PG inconsistent

2019-09-18 Thread Reed Dier
To answer the question, if it is safe to disable the module and delete the 
pool, the answer is no.

After disabling the diskprediction_local module, I then proceeded to remove the 
pool created by the module, device_health_metrics.

This is where things went south quickly,

Ceph health showed: 
> Module 'devicehealth' has failed: [errno 2] Failed to operate write op for 
> oid SAMSUNG_$MODEL_$SERIAL

That module apparently can't be disabled:
> $ ceph mgr module disable devicehealth
> Error EINVAL: module 'devicehealth' cannot be disabled (always-on)

Then 5 osd's went down, crashing with:
>-12> 2019-09-18 10:53:00.299 7f95940ac700  5 osd.5 pg_epoch: 176304 
> pg[17.3d1( v 176297'568491 lc 176269'568471 (175914'565388,176297'568491] 
> local-lis/les=176302/176303 n=107092 ec=11397/11397 lis/c 176302/172990 
> les/c/f 176303/172991/107766 176304/176304/176304) [5,81,162] r=0 lpr=176304 
> pi=[172990,176304)/1 crt=176297'568491 lcod 0'0 mlcod 0'0 peering m=17 
> mbc={}] enter Started/Primary/Peering/WaitUpThru
>-11> 2019-09-18 10:53:00.303 7f959fd6f700  2 osd.5 176304 ms_handle_reset 
> con 0x564078474d00 session 0x56407878ea00
>-10> 2019-09-18 10:53:00.303 7f95b10e6700 10 monclient: 
> handle_auth_request added challenge on 0x564077ac1b00
> -9> 2019-09-18 10:53:00.307 7f95b10e6700 10 monclient: 
> handle_auth_request added challenge on 0x564077ac3180
> -8> 2019-09-18 10:53:00.307 7f95b10e6700 10 monclient: 
> handle_auth_request added challenge on 0x564077ac3600
> -7> 2019-09-18 10:53:00.307 7f95950ae700 -1 
> bluestore(/var/lib/ceph/osd/ceph-5) _txc_add_transaction error (39) Directory 
> not empty not handled on operation 21 (op 1, counting from 0)
> -6> 2019-09-18 10:53:00.307 7f95950ae700  0 _dump_transaction transaction 
> dump:
> {
> "ops": [
> {
> "op_num": 0,
> "op_name": "remove",
> "collection": "30.0_head",
> "oid": "#30:head#"
> },
> {
> "op_num": 1,
> "op_name": "rmcoll",
> "collection": "30.0_head"
> }
> ]
> }
> -5> 2019-09-18 10:53:00.311 7f95948ad700  5 osd.5 pg_epoch: 176304 
> pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919] 
> local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285 
> les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0 lpr=176304 
> pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0 peering m=32 
> mbc={}] exit Started/Primary/Peering/GetLog 0.023847 2 0.000123
> -4> 2019-09-18 10:53:00.311 7f95948ad700  5 osd.5 pg_epoch: 176304 
> pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919] 
> local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285 
> les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0 lpr=176304 
> pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0 peering m=32 
> mbc={}] enter Started/Primary/Peering/GetMissing
> -3> 2019-09-18 10:53:00.311 7f95948ad700  5 osd.5 pg_epoch: 176304 
> pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919] 
> local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285 
> les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0 lpr=176304 
> pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0 peering m=32 
> mbc={}] exit Started/Primary/Peering/GetMissing 0.19 0 0.00
> -2> 2019-09-18 10:53:00.311 7f95948ad700  5 osd.5 pg_epoch: 176304 
> pg[17.353( v 176300'586919 lc 176207'586887 (175912'583861,176300'586919] 
> local-lis/les=176302/176303 n=107009 ec=11397/11397 lis/c 176302/176285 
> les/c/f 176303/176286/107766 176304/176304/176304) [5,167,137] r=0 lpr=176304 
> pi=[176285,176304)/1 crt=176300'586919 lcod 0'0 mlcod 0'0 peering m=32 
> mbc={}] enter Started/Primary/Peering/WaitUpThru
> -1> 2019-09-18 10:53:00.315 7f95950ae700 -1 
> /build/ceph-14.2.2/src/os/bluestore/BlueStore.cc: In function 'void 
> BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
> ObjectStore::Transaction*)' thread 7f95950ae700 time 2019-09-18 
> 10:53:00.312755
> /build/ceph-14.2.2/src/os/bluestore/BlueStore.cc: 11208: 
> ceph_abort_msg("unexpected error")


Of the 5 OSD's now down, 3 of them are the serving OSD's for pg 30.0 (that has 
now been erased),

> OSD_DOWN 5 osds down
> osd.5 is down
> osd.12 is down
> osd.128 is down
> osd.183 is down
> osd.190 is down


But 190 and 5 were never acting members for that PG, so I have no clue why they 
are implicated.


I re-enabled

[ceph-users] Local Device Health PG inconsistent

2019-09-12 Thread Reed Dier
Trying to narrow down a strange issue where the single PG for the 
device_health_metrics that was created when I enabled the 
'diskprediction_local' module in the ceph-mgr. But I never see any inconsistent 
objects in the PG.

> $ ceph health detail
> OSD_SCRUB_ERRORS 1 scrub errors
> PG_DAMAGED Possible data damage: 1 pg inconsistent
> pg 30.0 is active+clean+inconsistent, acting [128,12,183]

> $ rados list-inconsistent-pg device_health_metrics
> ["30.0"]

> $ rados list-inconsistent-obj 30.0 | jq
> {
>   "epoch": 172979,
>   "inconsistents": []
> }

This is the log message from osd.128 most recently during the last deep scrub
> 2019-09-12 18:07:19.436 7f977744a700 -1 log_channel(cluster) log [ERR] : 30.0 
> deep-scrub : stat mismatch, got 237/238 objects, 0/0 clones, 237/238 dirty, 
> 237/238 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes, 0/0 
> manifest objects, 0/0 hit_set_archive bytes.

Here is a pg query on the one PG:
https://pastebin.com/bnzVKd6t 

The data I have collected hasn't been useful at all, and I don't particularly 
care if I lose it, so would it be feasible (ie no bad effects) to just disable 
the disk prediction module, delete the pool, and then start over and it will 
create a new pool for itself?

Thanks,

Reed

smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iostat and dashboard freezing

2019-09-12 Thread Reed Dier
> 1. Multi-root. You should deprecate your 'ssd' root and move your osds of 
> this root to 'default' root.
> 
I would love to deprecate the multi-root, and may try to do just that in my 
next OSD add, just worried about data shuffling unnecessarily.
Would this in theory help my distribution across disparate OSD topologies?

> 2. Some of your OSD's are reweighted, for example 
> osd.44,osd.50,osd.57,osd.60,osd.102,osd.107. For proper upmap work all osds 
> should be not reweighted.
> 
This was done to reclaim space, due to balancer not finding more optimizations, 
and also not running, due to some OSDs being marked as nearfull, again, because 
of poor distribution.

Since running with balancer turned off, I have had very few issues with my MGRs.

Reed

> On Sep 9, 2019, at 11:19 PM, Konstantin Shalygin  wrote:
> 
> On 8/29/19 9:56 PM, Reed Dier wrote:
>> "config/mgr/mgr/balancer/active",
>> "config/mgr/mgr/balancer/max_misplaced",
>> "config/mgr/mgr/balancer/mode",
>> "config/mgr/mgr/balancer/pool_ids",
> This is useless keys, you may to remove it.
> 
> 
>> https://pastebin.com/bXPs28h1 <https://pastebin.com/bXPs28h1>Issues that you 
>> have:
> 
> 1. Multi-root. You should deprecate your 'ssd' root and move your osds of 
> this root to 'default' root.
> 
> 2. Some of your OSD's are reweighted, for example 
> osd.44,osd.50,osd.57,osd.60,osd.102,osd.107. For proper upmap work all osds 
> should be not reweighted.
> 
>>> $ time ceph balancer optimize newplan1
>>> Error EALREADY: Unable to find further optimization, or pool(s)' pg_num is 
>>> decreasing, or distribution is already perfect
>>> 
>>> real3m10.627s
>>> user0m0.352s
>>> sys 0m0.055s
> Set the key `mgr/balancer/upmap_max_iterations` to '2' should decrease this 
> time.
> 
> k
> 



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iostat and dashboard freezing

2019-08-29 Thread Reed Dier
See responses below.

> On Aug 28, 2019, at 11:13 PM, Konstantin Shalygin  wrote:
>> Just a follow up 24h later, and the mgr's seem to be far more stable, and 
>> have had no issues or weirdness after disabling the balancer module.
>> 
>> Which isn't great, because the balancer plays an important role, but after 
>> fighting distribution for a few weeks and getting it 'good enough' I'm 
>> taking the stability.
>> 
>> Just wanted to follow up with another 2¢.
> What is your balancer settings (`ceph config-key ls`)? Your mgr running in 
> virtual environment or on bare metal?

bare metal
>> $ ceph config-key ls | grep balance
>> "config/mgr/mgr/balancer/active",
>> "config/mgr/mgr/balancer/max_misplaced",
>> "config/mgr/mgr/balancer/mode",
>> "config/mgr/mgr/balancer/pool_ids",
>> "mgr/balancer/active",
>> "mgr/balancer/max_misplaced",
>> "mgr/balancer/mode",


> How much pools you have? Please also paste `ceph osd tree` & `ceph osd df 
> tree`. 

$ ceph osd pool ls detail
>> pool 16 replicated crush_rule 1 object_hash rjenkins pg_num 4
>> autoscale_mode warn last_change 157895 lfor 0/157895/157893 flags 
>> hashpspool,nodelete stripe_width 0 application cephfs
>> pool 17 replicated crush_rule 0 object_hash rjenkins pg_num 1024 
>> autoscale_mode warn last_change 174817 flags hashpspool,nodelete 
>> stripe_width 0 compression_algorithm snappy compression_mode aggressive 
>> application cephfs
>> pool 20 replicated crush_rule 2 object_hash rjenkins pg_num 4096 
>> autoscale_mode warn last_change 174817 flags hashpspool,nodelete 
>> stripe_width 0 application freeform
>> pool 24 replicated crush_rule 0 object_hash rjenkins pg_num 16   
>> autoscale_mode warn last_change 174817 lfor 0/157704/157702 flags hashpspool 
>> stripe_width 0 compression_algorithm snappy compression_mode none 
>> application freeform
>> pool 29 replicated crush_rule 2 object_hash rjenkins pg_num 128  
>> autoscale_mode warn last_change 174817 lfor 0/0/142604 flags 
>> hashpspool,selfmanaged_snaps stripe_width 0 application rbd
>> pool 30 replicated crush_rule 0 object_hash rjenkins pg_num 1
>> autoscale_mode warn last_change 174817 flags hashpspool stripe_width 0 
>> pg_num_min 1 application mgr_devicehealth
>> pool 31 replicated crush_rule 2 object_hash rjenkins pg_num 16   
>> autoscale_mode warn last_change 174926 flags hashpspool,selfmanaged_snaps 
>> stripe_width 0 application rbd

https://pastebin.com/bXPs28h1 Measure time of 
balancer plan creation: `time ceph balancer optimize new`.
> 
I hadn't seen this optimize command yet, I was always doing balancer eval 
$plan, balancer execute $plan.
>> $ time ceph balancer optimize newplan1
>> Error EALREADY: Unable to find further optimization, or pool(s)' pg_num is 
>> decreasing, or distribution is already perfect
>> 
>> real3m10.627s
>> user0m0.352s
>> sys 0m0.055s

Reed



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iostat and dashboard freezing

2019-08-28 Thread Reed Dier
Just a follow up 24h later, and the mgr's seem to be far more stable, and have 
had no issues or weirdness after disabling the balancer module.

Which isn't great, because the balancer plays an important role, but after 
fighting distribution for a few weeks and getting it 'good enough' I'm taking 
the stability.

Just wanted to follow up with another 2¢.

Reed

> On Aug 27, 2019, at 11:53 AM, Reed Dier  wrote:
> 
> Just to further piggyback,
> 
> Probably the most "hard" the mgr seems to get pushed is when the balancer is 
> engaged.
> When trying to eval a pool or cluster, it takes upwards of 30-120 seconds for 
> it to score it, and then another 30-120 seconds to execute the plan, and it 
> never seems to engage automatically.
> 
>> $ time ceph balancer status
>> {
>> "active": true,
>> "plans": [],
>> "mode": "upmap"
>> }
>> 
>> real0m36.490s
>> user0m0.259s
>> sys 0m0.044s
> 
> 
> I'm going to disable mine as well, and see if I can stop waking up to 'No 
> Active MGR.'
> 
> 
> You can see when I lose mgr's because RBD image stats go to 0 until I catch 
> it.
> 
> Thanks,
> 
> Reed
> 
>> On Aug 27, 2019, at 11:24 AM, Jake Grimmett > <mailto:j...@mrc-lmb.cam.ac.uk>> wrote:
>> 
>> Hi Reed, Lenz, John
>> 
>> I've just tried disabling the balancer, so far ceph-mgr is keeping it's
>> CPU mostly under 20%, even with both the iostat and dashboard back on.
>> 
>> # ceph balancer off
>> 
>> was
>> [root@ceph-s1 backup]# ceph balancer status
>> {
>>"active": true,
>>"plans": [],
>>"mode": "upmap"
>> }
>> 
>> now
>> [root@ceph-s1 backup]# ceph balancer status
>> {
>>"active": false,
>>"plans": [],
>>"mode": "upmap"
>> }
>> 
>> We are using 8:2 erasure encoding across 324 12TB OSD, plus 4 NVMe OSD
>> for a replicated cephfs metadata pool.
>> 
>> let me know if the balancer is your problem too...
>> 
>> best,
>> 
>> Jake
>> 
>> On 8/27/19 3:57 PM, Jake Grimmett wrote:
>>> Yes, the problem still occurs with the dashboard disabled...
>>> 
>>> Possibly relevant, when both the dashboard and iostat plugins are
>>> disabled, I occasionally see ceph-mgr rise to 100% CPU.
>>> 
>>> as suggested by John Hearns, the output of  gstack ceph-mgr when at 100%
>>> is here:
>>> 
>>> http://p.ip.fi/52sV <http://p.ip.fi/52sV>
>>> 
>>> many thanks
>>> 
>>> Jake
>>> 
>>> On 8/27/19 3:09 PM, Reed Dier wrote:
>>>> I'm currently seeing this with the dashboard disabled.
>>>> 
>>>> My instability decreases, but isn't wholly cured, by disabling
>>>> prometheus and rbd_support, which I use in tandem, as the only thing I'm
>>>> using the prom-exporter for is the per-rbd metrics.
>>>> 
>>>>> ceph mgr module ls
>>>>> {
>>>>> "enabled_modules": [
>>>>> "diskprediction_local",
>>>>> "influx",
>>>>> "iostat",
>>>>> "prometheus",
>>>>> "rbd_support",
>>>>> "restful",
>>>>> "telemetry"
>>>>> ],
>>>> 
>>>> I'm on Ubuntu 18.04, so that doesn't corroborate with some possible OS
>>>> correlation.
>>>> 
>>>> Thanks,
>>>> 
>>>> Reed
>>>> 
>>>>> On Aug 27, 2019, at 8:37 AM, Lenz Grimmer >>>> <mailto:lgrim...@suse.com>> wrote:
>>>>> 
>>>>> Hi Jake,
>>>>> 
>>>>> On 8/27/19 3:22 PM, Jake Grimmett wrote:
>>>>> 
>>>>>> That exactly matches what I'm seeing:
>>>>>> 
>>>>>> when iostat is working OK, I see ~5% CPU use by ceph-mgr
>>>>>> and when iostat freezes, ceph-mgr CPU increases to 100%
>>>>> 
>>>>> Does this also occur if the dashboard module is disabled? Just wondering
>>>>> if this is isolatable to the iostat module. Thanks!
>>>>> 
>>>>> Lenz
>>>>> 
>>>>> -- 
>>>>> SUSE Software Solutions Germany GmbH - Maxfeldstr. 5 - 90409 Nuernberg
>>>>> GF: Felix Imendörffer, HRB 247165 (AG Nürnberg)
>>>>> 
>>>>> ___
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>>>> 
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>>> 
>>> 
>> 
>> 
>> -- 
>> MRC Laboratory of Molecular Biology
>> Francis Crick Avenue,
>> Cambridge CB2 0QH, UK.
>> 
> 



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iostat and dashboard freezing

2019-08-27 Thread Reed Dier
Just to further piggyback,

Probably the most "hard" the mgr seems to get pushed is when the balancer is 
engaged.
When trying to eval a pool or cluster, it takes upwards of 30-120 seconds for 
it to score it, and then another 30-120 seconds to execute the plan, and it 
never seems to engage automatically.

> $ time ceph balancer status
> {
> "active": true,
> "plans": [],
> "mode": "upmap"
> }
> 
> real0m36.490s
> user0m0.259s
> sys 0m0.044s


I'm going to disable mine as well, and see if I can stop waking up to 'No 
Active MGR.'


You can see when I lose mgr's because RBD image stats go to 0 until I catch it.

Thanks,

Reed

> On Aug 27, 2019, at 11:24 AM, Jake Grimmett  wrote:
> 
> Hi Reed, Lenz, John
> 
> I've just tried disabling the balancer, so far ceph-mgr is keeping it's
> CPU mostly under 20%, even with both the iostat and dashboard back on.
> 
> # ceph balancer off
> 
> was
> [root@ceph-s1 backup]# ceph balancer status
> {
>"active": true,
>"plans": [],
>"mode": "upmap"
> }
> 
> now
> [root@ceph-s1 backup]# ceph balancer status
> {
>"active": false,
>"plans": [],
>"mode": "upmap"
> }
> 
> We are using 8:2 erasure encoding across 324 12TB OSD, plus 4 NVMe OSD
> for a replicated cephfs metadata pool.
> 
> let me know if the balancer is your problem too...
> 
> best,
> 
> Jake
> 
> On 8/27/19 3:57 PM, Jake Grimmett wrote:
>> Yes, the problem still occurs with the dashboard disabled...
>> 
>> Possibly relevant, when both the dashboard and iostat plugins are
>> disabled, I occasionally see ceph-mgr rise to 100% CPU.
>> 
>> as suggested by John Hearns, the output of  gstack ceph-mgr when at 100%
>> is here:
>> 
>> http://p.ip.fi/52sV
>> 
>> many thanks
>> 
>> Jake
>> 
>> On 8/27/19 3:09 PM, Reed Dier wrote:
>>> I'm currently seeing this with the dashboard disabled.
>>> 
>>> My instability decreases, but isn't wholly cured, by disabling
>>> prometheus and rbd_support, which I use in tandem, as the only thing I'm
>>> using the prom-exporter for is the per-rbd metrics.
>>> 
>>>> ceph mgr module ls
>>>> {
>>>> "enabled_modules": [
>>>> "diskprediction_local",
>>>> "influx",
>>>> "iostat",
>>>> "prometheus",
>>>> "rbd_support",
>>>> "restful",
>>>> "telemetry"
>>>> ],
>>> 
>>> I'm on Ubuntu 18.04, so that doesn't corroborate with some possible OS
>>> correlation.
>>> 
>>> Thanks,
>>> 
>>> Reed
>>> 
>>>> On Aug 27, 2019, at 8:37 AM, Lenz Grimmer >>> <mailto:lgrim...@suse.com>> wrote:
>>>> 
>>>> Hi Jake,
>>>> 
>>>> On 8/27/19 3:22 PM, Jake Grimmett wrote:
>>>> 
>>>>> That exactly matches what I'm seeing:
>>>>> 
>>>>> when iostat is working OK, I see ~5% CPU use by ceph-mgr
>>>>> and when iostat freezes, ceph-mgr CPU increases to 100%
>>>> 
>>>> Does this also occur if the dashboard module is disabled? Just wondering
>>>> if this is isolatable to the iostat module. Thanks!
>>>> 
>>>> Lenz
>>>> 
>>>> -- 
>>>> SUSE Software Solutions Germany GmbH - Maxfeldstr. 5 - 90409 Nuernberg
>>>> GF: Felix Imendörffer, HRB 247165 (AG Nürnberg)
>>>> 
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>>> 
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>> 
>> 
> 
> 
> -- 
> MRC Laboratory of Molecular Biology
> Francis Crick Avenue,
> Cambridge CB2 0QH, UK.
> 



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iostat and dashboard freezing

2019-08-27 Thread Reed Dier
I'm currently seeing this with the dashboard disabled.

My instability decreases, but isn't wholly cured, by disabling prometheus and 
rbd_support, which I use in tandem, as the only thing I'm using the 
prom-exporter for is the per-rbd metrics.

> ceph mgr module ls
> {
> "enabled_modules": [
> "diskprediction_local",
> "influx",
> "iostat",
> "prometheus",
> "rbd_support",
> "restful",
> "telemetry"
> ],

I'm on Ubuntu 18.04, so that doesn't corroborate with some possible OS 
correlation.

Thanks,

Reed

> On Aug 27, 2019, at 8:37 AM, Lenz Grimmer  wrote:
> 
> Hi Jake,
> 
> On 8/27/19 3:22 PM, Jake Grimmett wrote:
> 
>> That exactly matches what I'm seeing:
>> 
>> when iostat is working OK, I see ~5% CPU use by ceph-mgr
>> and when iostat freezes, ceph-mgr CPU increases to 100%
> 
> Does this also occur if the dashboard module is disabled? Just wondering
> if this is isolatable to the iostat module. Thanks!
> 
> Lenz
> 
> -- 
> SUSE Software Solutions Germany GmbH - Maxfeldstr. 5 - 90409 Nuernberg
> GF: Felix Imendörffer, HRB 247165 (AG Nürnberg)
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iostat and dashboard freezing

2019-08-27 Thread Reed Dier
Curious what dist you're running on, as I've been having similar issues with 
instability in the mgr as well, curious if any similar threads to pull at.

While the iostat command is running, is the active mgr using 100% CPU in top?

Reed

> On Aug 27, 2019, at 6:41 AM, Jake Grimmett  wrote:
> 
> Dear All,
> 
> We have a new Nautilus (14.2.2) cluster, with 328 OSDs spread over 40 nodes.
> 
> Unfortunately "ceph iostat" spends most of it's time frozen, with
> occasional periods of working normally for less than a minute, then
> freeze again for a couple of minutes, then come back to life, and so so
> on...
> 
> No errors are seen on screen, unless I press CTRL+C when iostat is stalled:
> 
> [root@ceph-s3 ~]# ceph iostat
> ^CInterrupted
> Traceback (most recent call last):
>  File "/usr/bin/ceph", line 1263, in 
>retval = main()
>  File "/usr/bin/ceph", line 1194, in main
>verbose)
>  File "/usr/bin/ceph", line 619, in new_style_command
>ret, outbuf, outs = do_command(parsed_args, target, cmdargs,
> sigdict, inbuf, verbose)
>  File "/usr/bin/ceph", line 593, in do_command
>return ret, '', ''
> UnboundLocalError: local variable 'ret' referenced before assignment
> 
> Observations:
> 
> 1) This problem does not seem to be related to load on the cluster.
> 
> 2) When iostat is stalled the dashboard is also non-responsive, if
> iostat is working, the dashboard also works.
> 
> Presumably the iostat and dashboard problems are due to the same
> underlying fault? Perhaps a problem with the mgr?
> 
> 
> 3) With iostat working, tailing /var/log/ceph/ceph-mgr.ceph-s3.log
> shows:
> 
> 2019-08-27 09:09:56.817 7f8149834700  0 log_channel(audit) log [DBG] :
> from='client.4120202 -' entity='client.admin' cmd=[{"width": 95,
> "prefix": "iostat", "poll": true, "target": ["mgr", ""], "print_header":
> false}]: dispatch
> 
> 4) When iostat isn't working, we see no obvious errors in the mgr log.
> 
> 5) When the dashboard is not working, mgr log sometimes shows:
> 
> 2019-08-27 09:18:18.810 7f813e533700  0 mgr[dashboard]
> [:::10.91.192.36:43606] [GET] [500] [2.724s] [jake] [1.6K]
> /api/health/minimal
> 2019-08-27 09:18:18.887 7f813e533700  0 mgr[dashboard] ['{"status": "500
> Internal Server Error", "version": "3.2.2", "detail": "The server
> encountered an unexpected condition which prevented it from fulfilling
> the request.", "traceback": "Traceback (most recent call last):\\n  File
> \\"/usr/lib/python2.7/site-packages/cherrypy/_cprequest.py\\", line 656,
> in respond\\nresponse.body = self.handler()\\n  File
> \\"/usr/lib/python2.7/site-packages/cherrypy/lib/encoding.py\\", line
> 188, in __call__\\nself.body = self.oldhandler(*args, **kwargs)\\n
> File \\"/usr/lib/python2.7/site-packages/cherrypy/_cptools.py\\", line
> 221, in wrap\\nreturn self.newhandler(innerfunc, *args, **kwargs)\\n
> File \\"/usr/share/ceph/mgr/dashboard/services/exception.py\\", line
> 88, in dashboard_exception_handler\\nreturn handler(*args,
> **kwargs)\\n  File
> \\"/usr/lib/python2.7/site-packages/cherrypy/_cpdispatch.py\\", line 34,
> in __call__\\nreturn self.callable(*self.args, **self.kwargs)\\n
> File \\"/usr/share/ceph/mgr/dashboard/controllers/__init__.py\\", line
> 649, in inner\\nret = func(*args, **kwargs)\\n  File
> \\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 192, in
> minimal\\nreturn self.health_minimal.all_health()\\n  File
> \\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 51, in
> all_health\\nresult[\'pools\'] = self.pools()\\n  File
> \\"/usr/share/ceph/mgr/dashboard/controllers/health.py\\", line 167, in
> pools\\npools = CephService.get_pool_list_with_stats()\\n  File
> \\"/usr/share/ceph/mgr/dashboard/services/ceph_service.py\\", line 124,
> in get_pool_list_with_stats\\n\'series\': [i for i in
> stat_series]\\nRuntimeError: deque mutated during iteration\\n"}']
> 
> 
> 6) IPV6 is normally disabled on our machines at the kernel level, via
> grubby --update-kernel=ALL --args="ipv6.disable=1"
> 
> This was done as 'disabling ipv6' interfered with the dashboard (giving
> "HEALTH_ERR Module 'dashboard' has failed: error('No socket could be
> created',) we re-enabling ipv6 on the mgr nodes only to fix this.
> 
> 
> Ideas...?
> 
> Should ipv6 be enabled, even if not configured, on all ceph nodes?
> 
> Any ideas on fixing this gratefully received!
> 
> many thanks
> 
> Jake
> 
> -- 
> MRC Laboratory of Molecular Biology
> Francis Crick Avenue,
> Cambridge CB2 0QH, UK.
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph status: pg backfill_toofull, but all OSDs have enough space

2019-08-21 Thread Reed Dier
Just chiming in to say that I too had some issues with backfill_toofull PGs, 
despite no OSD's being in a backfill_full state, albeit, there were some 
nearfull OSDs.

I was able to get through it by reweighting down the OSD that was the target 
reported by ceph pg dump | grep 'backfill_toofull'.

This was on 14.2.2.

Reed

> On Aug 21, 2019, at 2:50 PM, Vladimir Brik  
> wrote:
> 
> Hello
> 
> After increasing number of PGs in a pool, ceph status is reporting "Degraded 
> data redundancy (low space): 1 pg backfill_toofull", but I don't understand 
> why, because all OSDs seem to have enough space.
> 
> ceph health detail says:
> pg 40.155 is active+remapped+backfill_toofull, acting [20,57,79,85]
> 
> $ ceph pg map 40.155
> osdmap e3952 pg 40.155 (40.155) -> up [20,57,66,85] acting [20,57,79,85]
> 
> So I guess Ceph wants to move 40.155 from 66 to 79 (or other way around?). 
> According to "osd df", OSD 66's utilization is 71.90%, OSD 79's utilization 
> is 58.45%. The OSD with least free space in the cluster is 81.23% full, and 
> it's not any of the ones above.
> 
> OSD backfillfull_ratio is 90% (is there a better way to determine this?):
> $ ceph osd dump | grep ratio
> full_ratio 0.95
> backfillfull_ratio 0.9
> nearfull_ratio 0.7
> 
> Does anybody know why a PG could be in the backfill_toofull state if no OSD 
> is in the backfillfull state?
> 
> 
> Vlad
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] compat weight reset

2019-08-02 Thread Reed Dier
Hi all,

I am trying to find a simple way that might help me better distribute my data, 
as I wrap up my Nautilus upgrades.

Currently rebuilding some OSD's with bigger block.db to prevent BlueFS 
spillover where it isn't difficult to do so, and I'm once again struggling with 
unbalanced distribution, despite having used upmap balancer.

I recently discovered that my previous usage of the balancer module with 
crush-compat mode before the upmap mode has left some lingering compat weight 
sets, which I believe may account for my less than stellar distribution, as I 
now have 2-3 weightings fighting against each other (upmap balancer, compat 
weight set, reweight). Below is a snippet showing the compat differing.

> $ ceph osd crush tree
> ID  CLASS WEIGHT(compat)  TYPE NAME
> -5543.70700  42.70894 chassis node2425
>  -221.85399  20.90097 host node24
>   0   hdd   7.28499   7.75699 osd.0
>   8   hdd   7.28499   6.85500 osd.8
>  16   hdd   7.28499   6.28899 osd.16
>  -321.85399  21.80797 host node25
>   1   hdd   7.28499   7.32899 osd.1
>   9   hdd   7.28499   7.24399 osd.9
>  17   hdd   7.28499   7.23499 osd.17

So my main question is how do I [re]set the compat value, to match the weight, 
so that the upmap balancer can more precisely balance the data?

It looks like I may have two options, with 
> ceph osd crush weight-set reweight-compat {name} {weight}
or
> ceph osd crush weight-set rm-compat

I assume the first would be to manage a single device/host/chassis/etc and the 
latter would nuke all compat values across the board?

And in looking at this, I started poking at my tunables, and I have no clue how 
to interpret the values, nor what I believe what they should be.

> $ ceph osd crush show-tunables
> {
> "choose_local_tries": 0,
> "choose_local_fallback_tries": 0,
> "choose_total_tries": 50,
> "chooseleaf_descend_once": 1,
> "chooseleaf_vary_r": 1,
> "chooseleaf_stable": 0,
> "straw_calc_version": 1,
> "allowed_bucket_algs": 22,
> "profile": "firefly",
> "optimal_tunables": 0,
> "legacy_tunables": 0,
> "minimum_required_version": "hammer",
> "require_feature_tunables": 1,
> "require_feature_tunables2": 1,
> "has_v2_rules": 0,
> "require_feature_tunables3": 1,
> "has_v3_rules": 0,
> "has_v4_buckets": 1,
> "require_feature_tunables5": 0,
> "has_v5_rules": 0
> }

This is a Jewel -> Luminous -> Mimic -> Nautilus cluster, and pretty much all 
the clients support Jewel/Luminous+ feature sets (jewel clients are 
kernel-cephfs clients, even though recent (4.15-4.18) kernels).
> $ ceph features | grep release
> "release": "luminous",
> "release": "luminous",
> "release": "luminous",
> "release": "jewel",
> "release": "jewel",
> "release": "luminous",
> "release": "luminous",
> "release": "luminous",
> "release": "luminous",

I feel like I should be running optimal tunables, but I believe I am running 
default?
Not sure how much of a difference exists there, or if that will trigger a bunch 
of data movement either.

Hopefully someone will be able to steer me in a positive direction here, and I 
can mostly trigger a single, large data movement and return to a happy, 
balanced cluster once again.

Thanks,

Reed

smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to add 100 new OSDs...

2019-07-24 Thread Reed Dier
Just chiming in to say that this too has been my preferred method for adding 
[large numbers of] OSDs.

Set the norebalance nobackfill flags.
Create all the OSDs, and verify everything looks good.
Make sure my max_backfills, recovery_max_active are as expected.
Make sure everything has peered.
Unset flags and let it run.

One crush map change, one data movement.

Reed

> 
> That works, but with newer releases I've been doing this:
> 
> - Make sure cluster is HEALTH_OK
> - Set the 'norebalance' flag (and usually nobackfill)
> - Add all the OSDs
> - Wait for the PGs to peer. I usually wait a few minutes
> - Remove the norebalance and nobackfill flag
> - Wait for HEALTH_OK
> 
> Wido
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need to replace OSD. How do I find physical disk

2019-07-18 Thread Reed Dier
You can use ceph-volume to get the LV ID

> # ceph-volume lvm list
> 
> == osd.24 ==
> 
>   [block]
> /dev/ceph-edeb727e-c6d3-4347-bfbb-b9ce7f60514b/osd-block-1da5910e-136a-48a7-8cf1-1c265b7b612a
> 
>   type  block
>   osd id24
>   osd fsid  1da5910e-136a-48a7-8cf1-1c265b7b612a
>   db device /dev/nvme0n1p4
>   db uuid   c4939e17-c787-4630-9ec7-b44565ecf845
>   block uuidn8mCnv-PW4n-43R6-I4uN-P1E0-7qDh-I5dslh
>   block device  
> /dev/ceph-edeb727e-c6d3-4347-bfbb-b9ce7f60514b/osd-block-1da5910e-136a-48a7-8cf1-1c265b7b612a
>   devices   /dev/sda
> 
>   [  db]/dev/nvme0n1p4
> 
>   PARTUUID  c4939e17-c787-4630-9ec7-b44565ecf845

And you can then match this against lsblk which should give you the LV

> $ lsblk -a
> NAME  
> MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
> sda   
>   8:00   1.8T  0 disk
> └─ceph--edeb727e--c6d3--4347--bfbb--b9ce7f60514b-osd--block--1da5910e--136a--48a7--8cf1--1c265b7b612a
>  253:60   1.8T  0 lvm
> nvme0n1   
> 259:00 372.6G  0 disk
> ├─nvme0n1p4   
> 259:40  14.9G  0 part

And if the device has just dropped off, which I have seen before, you should be 
able to see that in dmesg

> [Sat May 11 22:56:27 2019] sd 1:0:17:0: attempting task abort! 
> scmd(2d043ad6)
> [Sat May 11 22:56:27 2019] sd 1:0:17:0: [sdr] tag#0 CDB: Inquiry 12 00 00 00 
> 24 00
> [Sat May 11 22:56:27 2019] scsi target1:0:17: handle(0x001b), 
> sas_address(0x500304801f12eca1), phy(33)
> [Sat May 11 22:56:27 2019] scsi target1:0:17: enclosure logical 
> id(0x500304801f12ecbf), slot(17)
> [Sat May 11 22:56:27 2019] scsi target1:0:17: enclosure level(0x), 
> connector name( )
> [Sat May 11 22:56:28 2019] sd 1:0:17:0: device_block, handle(0x001b)
> [Sat May 11 22:56:30 2019] sd 1:0:17:0: device_unblock and setting to 
> running, handle(0x001b)
> [Sat May 11 22:56:30 2019] sd 1:0:17:0: [sdr] Synchronizing SCSI cache
> [Sat May 11 22:56:30 2019] sd 1:0:17:0: [sdr] Synchronize Cache(10) failed: 
> Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> [Sat May 11 22:56:31 2019] scsi 1:0:17:0: task abort: SUCCESS 
> scmd(2d043ad6)
> [Sat May 11 22:56:31 2019] mpt3sas_cm0: removing handle(0x001b), 
> sas_addr(0x500304801f12eca1)
> [Sat May 11 22:56:31 2019] mpt3sas_cm0: enclosure logical 
> id(0x500304801f12ecbf), slot(17)
> [Sat May 11 22:56:31 2019] mpt3sas_cm0: enclosure level(0x), connector 
> name( )
> [Sat May 11 23:00:57 2019] Buffer I/O error on dev dm-20, logical block 
> 488378352, async page read
> [Sat May 11 23:00:57 2019] Buffer I/O error on dev dm-20, logical block 1, 
> async page read
> [Sat May 11 23:00:58 2019] Buffer I/O error on dev dm-20, logical block 
> 488378352, async page read
> [Sat May 11 23:00:58 2019] Buffer I/O error on dev dm-20, logical block 1, 
> async page read
> 
> # smartctl -a /dev/sdr
> smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-46-generic] (local build)
> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
> 
> Smartctl open device: /dev/sdr failed: No such device
> Hopefully that helps.

Reed

> On Jul 18, 2019, at 1:11 PM, Paul Emmerich  wrote:
> 
> 
> 
> On Thu, Jul 18, 2019 at 8:10 PM John Petrini  > wrote:
> Try ceph-disk list
> 
> no, this system is running ceph-volume not ceph-disk because the mountpoints 
> are in tmpfs
> 
> ceph-volume lvm list
> 
> But it looks like the disk is just completely broken and disappeared from the 
> system.
> 
> 
> -- 
> Paul Emmerich
> 
> Looking for help with your Ceph cluster? Contact us at https://croit.io 
> 
> 
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io 
> Tel: +49 89 1896585 90
>  
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ubuntu 18.04 - Mimic - Nautilus

2019-07-10 Thread Reed Dier
It would appear by looking at the Ceph repo: https://download.ceph.com/ 
<https://download.ceph.com/>
That Nautilus and Mimic are being built for Xenial and Bionic, where Luminous 
and Jewel are being built for Trusty and Xenial.

Then from Ubuntu in their main repos, they are publishing Jewel for Xenial, 
Luminous for Bionic, Mimic for Cosmic and Disco, and Nautilus for Eoan.
These packages are not updated at the bleeding edge mind you.

So there should be no issues with package availability between Nautilus and 
Xenial, however there should also be no impediment to moving from Xenial to 
Bionic as well.

I would expect to see Octopus (and likely P as well) target Bionic and whatever 
the 20.04 F release will be called.

Hope that clarifies things.

Reed

> On Jul 10, 2019, at 2:36 PM, Edward Kalk  wrote:
> 
> Interesting. So is it not good that I am running Ubuntu 16.04 and 14.2.1. ?
> -Ed
> 
>> On Jul 10, 2019, at 1:46 PM, Reed Dier > <mailto:reed.d...@focusvq.com>> wrote:
>> 
>> It does not appear that that page has been updated in a while.
>> 
>> The official Ceph deb repos only include Mimic and Nautilus packages for 
>> 18.04,
>> While the Ubuntu-bionic repos include a Luminous build.
>> 
>> Hope that helps.
>> 
>> Reed
>> 
>>> On Jul 10, 2019, at 1:20 PM, Edward Kalk >> <mailto:ek...@socket.net>> wrote:
>>> 
>>> When reviewing: http://docs.ceph.com/docs/master/start/os-recommendations/ 
>>> <http://docs.ceph.com/docs/master/start/os-recommendations/> I see there is 
>>> no mention of “mimic” or “nautilus”.
>>> What are the OS recommendations for them, specifically nautilus which is 
>>> the one I’m running?
>>> 
>>> Is 18.04 advisable at all?
>>> 
>>> -Ed
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>> 
> 



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ubuntu 18.04 - Mimic - Nautilus

2019-07-10 Thread Reed Dier
It does not appear that that page has been updated in a while.

The official Ceph deb repos only include Mimic and Nautilus packages for 18.04,
While the Ubuntu-bionic repos include a Luminous build.

Hope that helps.

Reed

> On Jul 10, 2019, at 1:20 PM, Edward Kalk  wrote:
> 
> When reviewing: http://docs.ceph.com/docs/master/start/os-recommendations/ 
>  I see there is 
> no mention of “mimic” or “nautilus”.
> What are the OS recommendations for them, specifically nautilus which is the 
> one I’m running?
> 
> Is 18.04 advisable at all?
> 
> -Ed
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Faux-Jewel Client Features

2019-07-02 Thread Reed Dier
Hi all,

Starting to make preparations for Nautilus upgrades from Mimic, and I'm looking 
over my client/session features and trying to fully grasp the situation.

> $ ceph versions
> {
> "mon": {
> "ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic 
> (stable)": 3},
> "mgr": {
> "ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic 
> (stable)": 3},
> "osd": {
> "ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic 
> (stable)": 204},
> "mds": {
> "ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic 
> (stable)": 2},
> "overall": {
> "ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic 
> (stable)": 212}
> }


> $ ceph features
> {
> "mon": [
> {  "features": "0x3ffddff8ffacfffb",   "release": 
> "luminous", "num": 3   }],
> "mds": [
> {  "features": "0x3ffddff8ffacfffb",   "release": 
> "luminous"  "num": 2   }],
> "osd": [
> {  "features": "0x3ffddff8ffacfffb",   "num": 204 
> }],
> "client": [
> {  "features": "0x7010fb86aa42ada","release": 
> "jewel","num": 4},
> {  "features": "0x7018fb86aa42ada","release": 
> "jewel","num": 1},
> {  "features": "0x3ffddff8eea4fffb",   "release": 
> "luminous", "num": 344  },
> {  "features": "0x3ffddff8eeacfffb",   "release": 
> "luminous", "num": 200  },
> {  "features": "0x3ffddff8ffa4fffb",   "release": 
> "luminous", "num": 49   },
> {  "features": "0x3ffddff8ffacfffb",   "release": 
> "luminous", "num": 213  }],
> "mgr": [
> {  "features": "0x3ffddff8ffacfffb",   "release": 
> "luminous", "num": 3}]
> }

> $ ceph osd dump | grep compat
> require_min_compat_client luminous
> min_compat_client luminous


I flattened the output to make it a bit more vertical scrolling friendly.

Diving into the actual clients with those features:
> # ceph daemon mon.mon1 sessions | grep jewel
> "MonSession(client.1649789192 ip.2:0/3697083337 is open allow *, features 
> 0x7010fb86aa42ada (jewel))",
> "MonSession(client.1656508179 ip.202:0/2664244117 is open allow *, 
> features 0x7018fb86aa42ada (jewel))",
> "MonSession(client.1637479106 ip.250:0/1882319989 is open allow *, 
> features 0x7010fb86aa42ada (jewel))",
> "MonSession(client.1662023903 ip.249:0/3198281565 is open allow *, 
> features 0x7010fb86aa42ada (jewel))",
> "MonSession(client.1658312940 ip.251:0/3538168209 is open allow *, 
> features 0x7010fb86aa42ada (jewel))",

ip.2 is a cephfs kernel client with 4.15.0-51-generic
ip.202 is a krbd client with kernel 4.18.0-22-generic
ip.250 is a krbd client with kernel 4.15.0-43-generic
ip.249 is a krbd client with kernel 4.15.0-45-generic
ip.251 is a krbd client with kernel 4.15.0-45-generic

For the krbd clients, the features are " features: layering, exclusive-lock".

My min_compat and require_min_compat clients are already set to Luminous, 
however, I would love some reassurance that I'm not going to run into issues 
with the krbd/kcephfs clients, and trying to make use of new features like the 
PG autoscaler for instance.
I should have full upmap compatibility as the balancer in upmap mode has been 
functioning, and given that they are relatively recent kernels.

Just looking for some sanity checks to make sure I don't have any surprises for 
these 'jewel' clients come a nautilus rollout.

Appreciate any help.
Thanks,
Reed___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] obj_size_info_mismatch error handling

2019-06-06 Thread Reed Dier
Sadly I never discovered anything more.

It ended up clearing up on its own, which was disconcerting, but I resigned to 
not making things worse in an attempt to make them better.

I assume someone touched the file in CephFS, which triggered the metadata to be 
updated, and everyone was able to reach consensus.

Wish I had more for you.

Reed

> On Jun 3, 2019, at 7:43 AM, Dan van der Ster  wrote:
> 
> Hi Reed and Brad,
> 
> Did you ever learn more about this problem?
> We currently have a few inconsistencies arriving with the same env
> (cephfs, v13.2.5) and symptoms.
> 
> PG Repair doesn't fix the inconsistency, nor does Brad's omap
> workaround earlier in the thread.
> In our case, we can fix by cp'ing the file to a new inode, deleting
> the inconsistent file, then scrubbing the PG.
> 
> -- Dan
> 
> 
> On Fri, May 3, 2019 at 3:18 PM Reed Dier  wrote:
>> 
>> Just to follow up for the sake of the mailing list,
>> 
>> I had not had a chance to attempt your steps yet, but things appear to have 
>> worked themselves out on their own.
>> 
>> Both scrub errors cleared without intervention, and I'm not sure if it is 
>> the results of that object getting touched in CephFS that triggered the 
>> update of the size info, or if something else was able to clear it.
>> 
>> Didn't see anything relating to the clearing in mon, mgr, or osd logs.
>> 
>> So, not entirely sure what fixed it, but it is resolved on its own.
>> 
>> Thanks,
>> 
>> Reed
>> 
>> On Apr 30, 2019, at 8:01 PM, Brad Hubbard  wrote:
>> 
>> On Wed, May 1, 2019 at 10:54 AM Brad Hubbard  wrote:
>> 
>> 
>> Which size is correct?
>> 
>> 
>> Sorry, accidental discharge =D
>> 
>> If the object info size is *incorrect* try forcing a write to the OI
>> with something like the following.
>> 
>> 1. rados -p [name_of_pool_17] setomapval 10008536718.
>> temporary-key anything
>> 2. ceph pg deep-scrub 17.2b9
>> 3. Wait for the scrub to finish
>> 4. rados -p [name_of_pool_2] rmomapkey 10008536718. temporary-key
>> 
>> If the object info size is *correct* you could try just doing a rados
>> get followed by a rados put of the object to see if the size is
>> updated correctly.
>> 
>> It's more likely the object info size is wrong IMHO.
>> 
>> 
>> On Tue, Apr 30, 2019 at 1:06 AM Reed Dier  wrote:
>> 
>> 
>> Hi list,
>> 
>> Woke up this morning to two PG's reporting scrub errors, in a way that I 
>> haven't seen before.
>> 
>> $ ceph versions
>> {
>>   "mon": {
>>   "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic 
>> (stable)": 3
>>   },
>>   "mgr": {
>>   "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic 
>> (stable)": 3
>>   },
>>   "osd": {
>>   "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic 
>> (stable)": 156
>>   },
>>   "mds": {
>>   "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic 
>> (stable)": 2
>>   },
>>   "overall": {
>>   "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic 
>> (stable)": 156,
>>   "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic 
>> (stable)": 8
>>   }
>> }
>> 
>> 
>> OSD_SCRUB_ERRORS 8 scrub errors
>> PG_DAMAGED Possible data damage: 2 pgs inconsistent
>>   pg 17.72 is active+clean+inconsistent, acting [3,7,153]
>>   pg 17.2b9 is active+clean+inconsistent, acting [19,7,16]
>> 
>> 
>> Here is what $rados list-inconsistent-obj 17.2b9 --format=json-pretty yields:
>> 
>> {
>>   "epoch": 134582,
>>   "inconsistents": [
>>   {
>>   "object": {
>>   "name": "10008536718.",
>>   "nspace": "",
>>   "locator": "",
>>   "snap": "head",
>>   "version": 0
>>   },
>>   "errors": [],
>>   "union_shard_errors": [
>>   "obj_size_info_mismatch"
>>   ],
>>   "shards": [
>>   {
>>   "osd": 7,
>>   "primary": false,
>>   

Re: [ceph-users] performance in a small cluster

2019-05-31 Thread Reed Dier
Is there any other evidence of this?

I have 20 5100 MAX (MTFDDAK1T9TCC) and have not experienced any real issues 
with them.
I would pick my Samsung SM863a's or any of my Intel's over the Micron's, but I 
haven't seen the Micron's cause any issues for me.
For what its worth, they are all FW D0MU027, which is likely more out of date, 
but it is working for me.

However, I would steer people away from the Micron 9100 MAX 
(MTFDHAX1T2MCF-1AN1ZABYY) as an NVMe disk to use for WAL/DB, as I have seen 
performance, and reliability issues with those.

Just my 2¢

Reed

> On May 29, 2019, at 12:52 PM, Paul Emmerich  wrote:
> 
> 
> 
> On Wed, May 29, 2019 at 9:36 AM Robert Sander  > wrote:
> Am 24.05.19 um 14:43 schrieb Paul Emmerich:
> > * SSD model? Lots of cheap SSDs simply can't handle more than that
> 
> The customer currently has 12 Micron 5100 1,92TB (Micron_5100_MTFDDAK1)
> SSDs and will get a batch of Micron 5200 in the next days
> 
> And there's your bottleneck ;)
> The Micron 5100 performs horribly in Ceph, I've seen similar performance in 
> another cluster with these disks.
> Basically they max out at around 1000 IOPS and report 100% utilization and 
> feel slow.
> 
> Haven't seen the 5200 yet.
> 
> 
> Paul
>  
> 
> We have identified the performance settings in the BIOS as a major
> factor. Ramping that up we got a remarkable performance increase.
> 
> Regards
> -- 
> Robert Sander
> Heinlein Support GmbH
> Linux: Akademie - Support - Hosting
> http://www.heinlein-support.de 
> 
> Tel: 030-405051-43
> Fax: 030-405051-19
> 
> Zwangsangaben lt. §35a GmbHG:
> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] obj_size_info_mismatch error handling

2019-05-03 Thread Reed Dier
Just to follow up for the sake of the mailing list,

I had not had a chance to attempt your steps yet, but things appear to have 
worked themselves out on their own.

Both scrub errors cleared without intervention, and I'm not sure if it is the 
results of that object getting touched in CephFS that triggered the update of 
the size info, or if something else was able to clear it.

Didn't see anything relating to the clearing in mon, mgr, or osd logs.

So, not entirely sure what fixed it, but it is resolved on its own.

Thanks,

Reed

> On Apr 30, 2019, at 8:01 PM, Brad Hubbard  wrote:
> 
> On Wed, May 1, 2019 at 10:54 AM Brad Hubbard  <mailto:bhubb...@redhat.com>> wrote:
>> 
>> Which size is correct?
> 
> Sorry, accidental discharge =D
> 
> If the object info size is *incorrect* try forcing a write to the OI
> with something like the following.
> 
> 1. rados -p [name_of_pool_17] setomapval 10008536718.
> temporary-key anything
> 2. ceph pg deep-scrub 17.2b9
> 3. Wait for the scrub to finish
> 4. rados -p [name_of_pool_2] rmomapkey 10008536718. temporary-key
> 
> If the object info size is *correct* you could try just doing a rados
> get followed by a rados put of the object to see if the size is
> updated correctly.
> 
> It's more likely the object info size is wrong IMHO.
> 
>> 
>> On Tue, Apr 30, 2019 at 1:06 AM Reed Dier  wrote:
>>> 
>>> Hi list,
>>> 
>>> Woke up this morning to two PG's reporting scrub errors, in a way that I 
>>> haven't seen before.
>>> 
>>> $ ceph versions
>>> {
>>>"mon": {
>>>"ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) 
>>> mimic (stable)": 3
>>>},
>>>"mgr": {
>>>"ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) 
>>> mimic (stable)": 3
>>>},
>>>"osd": {
>>>"ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) 
>>> mimic (stable)": 156
>>>},
>>>"mds": {
>>>"ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) 
>>> mimic (stable)": 2
>>>},
>>>"overall": {
>>>"ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) 
>>> mimic (stable)": 156,
>>>"ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) 
>>> mimic (stable)": 8
>>>}
>>> }
>>> 
>>> 
>>> OSD_SCRUB_ERRORS 8 scrub errors
>>> PG_DAMAGED Possible data damage: 2 pgs inconsistent
>>>pg 17.72 is active+clean+inconsistent, acting [3,7,153]
>>>pg 17.2b9 is active+clean+inconsistent, acting [19,7,16]
>>> 
>>> 
>>> Here is what $rados list-inconsistent-obj 17.2b9 --format=json-pretty 
>>> yields:
>>> 
>>> {
>>>"epoch": 134582,
>>>"inconsistents": [
>>>{
>>>"object": {
>>>"name": "10008536718.",
>>>"nspace": "",
>>>"locator": "",
>>>"snap": "head",
>>>"version": 0
>>>},
>>>"errors": [],
>>>"union_shard_errors": [
>>>"obj_size_info_mismatch"
>>>],
>>>"shards": [
>>>{
>>>"osd": 7,
>>>"primary": false,
>>>"errors": [
>>>"obj_size_info_mismatch"
>>>],
>>>"size": 5883,
>>>"object_info": {
>>>"oid": {
>>>"oid": "10008536718.",
>>>"key": "",
>>>"snapid": -2,
>>>"hash": 1752643257,
>>>"max": 0,
>>>"pool": 17,
>>>"namespace": "&

[ceph-users] obj_size_info_mismatch error handling

2019-04-29 Thread Reed Dier
Hi list,

Woke up this morning to two PG's reporting scrub errors, in a way that I 
haven't seen before.
> $ ceph versions
> {
> "mon": {
> "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic 
> (stable)": 3
> },
> "mgr": {
> "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic 
> (stable)": 3
> },
> "osd": {
> "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic 
> (stable)": 156
> },
> "mds": {
> "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic 
> (stable)": 2
> },
> "overall": {
> "ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic 
> (stable)": 156,
> "ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic 
> (stable)": 8
> }
> }


> OSD_SCRUB_ERRORS 8 scrub errors
> PG_DAMAGED Possible data damage: 2 pgs inconsistent
> pg 17.72 is active+clean+inconsistent, acting [3,7,153]
> pg 17.2b9 is active+clean+inconsistent, acting [19,7,16]

Here is what $rados list-inconsistent-obj 17.2b9 --format=json-pretty yields:
> {
> "epoch": 134582,
> "inconsistents": [
> {
> "object": {
> "name": "10008536718.",
> "nspace": "",
> "locator": "",
> "snap": "head",
> "version": 0
> },
> "errors": [],
> "union_shard_errors": [
> "obj_size_info_mismatch"
> ],
> "shards": [
> {
> "osd": 7,
> "primary": false,
> "errors": [
> "obj_size_info_mismatch"
> ],
> "size": 5883,
> "object_info": {
> "oid": {
> "oid": "10008536718.",
> "key": "",
> "snapid": -2,
> "hash": 1752643257,
> "max": 0,
> "pool": 17,
> "namespace": ""
> },
> "version": "134599'448331",
> "prior_version": "134599'448330",
> "last_reqid": "client.1580931080.0:671854",
> "user_version": 448331,
> "size": 3505,
> "mtime": "2019-04-28 15:32:20.003519",
> "local_mtime": "2019-04-28 15:32:25.991015",
> "lost": 0,
> "flags": [
> "dirty",
> "data_digest",
> "omap_digest"
> ],
> "truncate_seq": 899,
> "truncate_size": 0,
> "data_digest": "0xf99a3bd3",
> "omap_digest": "0x",
> "expected_object_size": 0,
> "expected_write_size": 0,
> "alloc_hint_flags": 0,
> "manifest": {
> "type": 0
> },
> "watchers": {}
> }
> },
> {
> "osd": 16,
> "primary": false,
> "errors": [
> "obj_size_info_mismatch"
> ],
> "size": 5883,
> "object_info": {
> "oid": {
> "oid": "10008536718.",
> "key": "",
> "snapid": -2,
> "hash": 1752643257,
> "max": 0,
> "pool": 17,
> "namespace": ""
> },
> "version": "134599'448331",
> "prior_version": "134599'448330",
> "last_reqid": "client.1580931080.0:671854",
> "user_version": 448331,
> "size": 3505,
> "mtime": "2019-04-28 15:32:20.003519",
> "local_mtime": "2019-04-28 15:32:25.991015",
> "lost": 0,
> "flags": [
> "dirty",
> "data_digest",
> "omap_digest"
> ],
> "truncate_seq": 899,
> "truncate_size": 0,
> "data_digest": "0xf99a3bd3",
> "omap_digest": "0x",
> "expected_object_size": 0,
> 

Re: [ceph-users] SSD Recovery Settings

2019-03-20 Thread Reed Dier
Grafana  is the web frontend for creating the graphs.

InfluxDB  holds the 
time series data that Grafana pulls from.

To collect data, I am using collectd 
 daemons running on each ceph 
node (mon,mds,osd), as this was my initial way of ingesting metrics.
I am also now using the influx plugin in ceph-mgr 
 to have ceph-mgr directly 
report statistics to InfluxDB.

I know two other popular methods of collecting data are Telegraf 
 and Prometheus 
, both of which are popular, both of which have 
ceph-mgr plugins as well here  
and here .
Influx Data also has a Grafana like graphing front end Chronograf 
, which some 
prefer to Grafana.

Hopefully thats enough to get you headed in the right direction.
I would recommend not going down the CollectD path, as the project doesn't move 
as quickly as Telegraf and Prometheus, and the majority of the metrics I am 
pulling from these days are provided from the ceph-mgr plugin.

Hope that helps,
Reed

> On Mar 20, 2019, at 11:30 AM, Brent Kennedy  wrote:
> 
> Reed:  If you don’t mind me asking, what was the graphing tool you had in the 
> post?  I am using the ceph health web panel right now but it doesn’t go that 
> deep.
>  
> Regards,
> Brent



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Recovery Settings

2019-03-20 Thread Reed Dier
Not sure what your OSD config looks like,

When I was moving from Filestore to Bluestore on my SSD OSD's (and NVMe FS 
journal to NVMe Bluestore block.db),
I had an issue where the OSD was incorrectly being reported as rotational in 
some part of the chain.
Once I overcame that, I had a huge boost in recovery performance (repaving 
OSDs).
Might be something useful in there.

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/025039.html 


Reed

> On Mar 19, 2019, at 11:29 PM, Konstantin Shalygin  wrote:
> 
> 
>> I setup an SSD Luminous 12.2.11 cluster and realized after data had been
>> added that pg_num was not set properly on the default.rgw.buckets.data pool
>> ( where all the data goes ).  I adjusted the settings up, but recovery is
>> going really slow ( like 56-110MiB/s ) ticking down at .002 per log
>> entry(ceph -w).  These are all SSDs on luminous 12.2.11 ( no journal drives
>> ) with a set of 2 10Gb fiber twinax in a bonded LACP config.  There are six
>> servers, 60 OSDs, each OSD is 2TB.  There was about 4TB of data ( 3 million
>> objects ) added to the cluster before I noticed the red blinking lights.
>> 
>>  
>> 
>> I tried adjusting the recovery to:
>> 
>> ceph tell 'osd.*' injectargs '--osd-max-backfills 16'
>> 
>> ceph tell 'osd.*' injectargs '--osd-recovery-max-active 30'
>> 
>>  
>> 
>> Which did help a little, but didn't seem to have the impact I was looking
>> for.  I have used the settings on HDD clusters before to speed things up (
>> using 8 backfills and 4 max active though ).  Did I miss something or is
>> this part of the pg expansion process.  Should I be doing something else
>> with SSD clusters?
>> 
>>  
>> 
>> Regards,
>> 
>> -Brent
>> 
>>  
>> 
>> Existing Clusters:
>> 
>> Test: Luminous 12.2.11 with 3 osd servers, 1 mon/man, 1 gateway ( all
>> virtual on SSD )
>> 
>> US Production(HDD): Jewel 10.2.11 with 5 osd servers, 3 mons, 3 gateways
>> behind haproxy LB
>> 
>> UK Production(HDD): Luminous 12.2.11 with 15 osd servers, 3 mons/man, 3
>> gateways behind haproxy LB
>> 
>> US Production(SSD): Luminous 12.2.11 with 6 osd servers, 3 mons/man, 3
>> gateways behind haproxy LB
> 
> Try to lower `osd_recovery_sleep*` options.
> 
> You can get your current values from ceph admin socket like this:
> 
> ```
> 
> ceph daemon osd.0 config show | jq 'to_entries[] | if 
> (.key|test("^(osd_recovery_sleep)(.*)")) then (.) else empty end'
> 
> ```
> 
> 
> 
> k
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] collectd problems with pools

2019-02-28 Thread Reed Dier
I've been collecting with collectd since Jewel, and experienced the growing 
pains when moving to Luminous and collectd-ceph needing to be reworked to 
support Luminous.

It is also worth mentioning that in Luminous+ there is an Influx plugin for 
ceph-mgr that has some per pool statistics.

Reed

> On Feb 28, 2019, at 11:04 AM, Matthew Vernon  wrote:
> 
> Hi,
> 
> On 28/02/2019 17:00, Marc Roos wrote:
> 
>> Should you not be pasting that as an issue on github collectd-ceph? I
>> hope you don't mind me asking, I am also using collectd and dumping the
>> data to influx. Are you downsampling with influx? ( I am not :/ [0])
> 
> It might be "ask collectd-ceph authors nicely" is the answer, but I figured 
> I'd ask here first, since there might be a solution available already.
> 
> Also, given collectd-ceph works currently by asking the various daemons about 
> their perf data, there's not an obvious analogue for pool-related metrics, 
> since there isn't a daemon socket to poke in the same manner.
> 
> We use graphite/carbon as our data store, so no, nothing influx-related 
> (we're trying to get rid of our last few uses of influxdb here).
> 
> Regards,
> 
> Matthew
> 
> 
> 
> -- 
> The Wellcome Sanger Institute is operated by Genome Research Limited, a 
> charity registered in England with number 1021457 and a company registered in 
> England with number 2742969, whose registered office is 215 Euston Road, 
> London, NW1 2BE. ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bionic Upgrade 12.2.10

2019-01-14 Thread Reed Dier
This is because Luminous is not being built for Bionic for whatever reason.
There are some other mailing list entries detailing this.

Right now you have ceph installed from the Ubuntu bionic-updates repo, which 
has 12.2.8, but does not get regular release updates.

This is what I ended up having to do for my ceph nodes that were upgraded from 
Xenial to Bionic, as well as new ceph nodes that installed straight to Bionic, 
due to the repo issues. Even if you try to use the xenial packages, you will 
run into issues with libcurl4 and libcurl3 I imagine.

Reed

> On Jan 14, 2019, at 12:21 PM, Scottix  wrote:
> 
> https://download.ceph.com/debian-luminous/ 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic 13.2.3?

2019-01-10 Thread Reed Dier
> Could I suggest building Luminous for Bionic

+1 for Luminous on Bionic.

Ran into issues with bionic upgrades, and had to eventually revert from the 
ceph repos to the Ubuntu repos where they have 12.2.8, which isn’t ideal.

Reed

> On Jan 9, 2019, at 10:27 AM, Matthew Vernon  wrote:
> 
> Hi,
> 
> On 08/01/2019 18:58, David Galloway wrote:
> 
>> The current distro matrix is:
>> 
>> Luminous: xenial centos7 trusty jessie stretch
>> Mimic: bionic xenial centos7
> 
> Thanks for clarifying :)
> 
>> This may have been different in previous point releases because, as Greg
>> mentioned in an earlier post in this thread, the release process has
>> changed hands and I'm still working on getting a solid/bulletproof
>> process documented, in place, and (more) automated.
>> 
>> I wouldn't be the final decision maker but if you think we should be
>> building Mimic packages for Debian (for example), we could consider it.
>> The build process should support it I believe.
> 
> Could I suggest building Luminous for Bionic, and Mimic for Buster, please?
> 
> Regards,
> 
> Matthew
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic 13.2.3?

2019-01-04 Thread Reed Dier
Piggy backing for a +1 on this.
Really would love if bad packages would be recalled, and also if packages would 
follow release announcement, rather than precede it.

For anyone wondering, this is the likely changelog for 13.2.3 in case people 
want to know what is in it.

https://github.com/ceph/ceph/pull/25637/commits/fe854ac1729a5353fa646298a0b4550101f9c6b2
 


Reed

> On Jan 4, 2019, at 10:07 AM, Matthew Vernon  wrote:
> 
> Hi,
> 
> On 04/01/2019 15:34, Abhishek Lekshmanan wrote:
>> Ashley Merrick  writes:
>> 
>>> If this is another nasty bug like .2? Can’t you remove .3 from being
>>> available till .4 comes around?
>> 
>> This time there isn't a nasty bug, just a a couple of more fixes in .4
>> which would be better to have. We're building 12.2.4 as we speak
>>> Myself will wait for proper confirmation always but others may run an apt
>>> upgrade for any other reason and end up with .3 packages.
> 
> Without wishing to bang on about this, how is it still the case that
> packages are being pushed onto the official ceph.com repos that people
> shouldn't install? This has caused plenty of people problems on several
> occasions now, and a number of people have offered help to fix it...
> 
> Regards,
> 
> Matthew
> 
> 
> 
> -- 
> The Wellcome Sanger Institute is operated by Genome Research 
> Limited, a charity registered in England with number 1021457 and a 
> company registered in England with number 2742969, whose registered 
> office is 215 Euston Road, London, NW1 2BE. 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Should ceph build against libcurl4 for Ubuntu 18.04 and later?

2018-12-13 Thread Reed Dier
Figured I would chime in as also having this issue.

Moving from 16.04 to 18.04 on some OSD nodes.
I have been using the ceph apt repo
> deb https://download.ceph.com/debian-luminous/ xenial main

During the release-upgrade, it can’t find a candidate package, and actually 
removes the ceph-osd package.

This is what dpkg showed after the reboot:
> $ dpkg -l | grep ceph
> rc  ceph-base 12.2.10-1xenial 
> amd64common ceph daemon libraries and management tools
> rc  ceph-common   12.2.10-1xenial 
> amd64common utilities to mount and interact with a ceph 
> storage cluster
> ii  ceph-deploy   2.0.1   
> all  Ceph-deploy is an easy to use configuration tool
> ii  ceph-fuse 12.2.10-1xenial 
> amd64FUSE-based client for the Ceph distributed file system
> rc  ceph-mds  12.2.10-1xenial 
> amd64metadata server for the ceph distributed file system
> rc  ceph-mgr  12.2.10-1xenial 
> amd64manager for the ceph distributed storage system
> rc  ceph-mon  12.2.10-1xenial 
> amd64monitor server for the ceph storage system
> rc  ceph-osd  12.2.10-1xenial 
> amd64OSD server for the ceph storage system
> ii  libcephfs212.2.10-1xenial 
> amd64Ceph distributed file system client library
> ii  python-cephfs 12.2.10-1xenial 
> amd64Python 2 libraries for the Ceph libcephfs library


After the reboot to complete the upgrade, I tried to install the ceph-osd 
package back, using the bionic repo, and get complaints about ceph-common and 
ceph-base, which was an endless rabbit hole of dependencies that lead to 
libcurl3.

My solution in the end to get things back up and running was to remove the ceph 
repo and move to the Ubuntu repo, and manually specify the package in apt for 
every package in the chain.
> $ sudo apt install ceph-common=12.2.7-0ubuntu0.18.04.1 
> ceph-base=12.2.7-0ubuntu0.18.04.1 ceph-fuse=12.2.7-0ubuntu0.18.04.1 
> ceph-osd=12.2.7-0ubuntu0.18.04.1 libcephfs2=12.2.7-0ubuntu0.18.04.1 
> python-cephfs=12.2.7-0ubuntu0.18.04.1 python-rados=12.2.7-0ubuntu0.18.04.1 
> librbd1=12.2.7-0ubuntu0.18.04.1 python-rbd=12.2.7-0ubuntu0.18.04.1 
> librados2=12.2.7-0ubuntu0.18.04.1 librados-dev=12.2.7-0ubuntu0.18.04.1 
> libradosstriper1=12.2.7-0ubuntu0.18.04.1

Obviously not ideal, but I assume that it will hold me over until the 
luminous-bionic repos are able to handle libcurl4 dependencies, or my Mimic 
window rolls around.

Just wanted to chime in that I ran into this issue, and how I worked around it.

Reed

> On Nov 26, 2018, at 11:11 AM, Ken Dreyer  wrote:
> 
> On Thu, Nov 22, 2018 at 11:47 AM Matthew Vernon  wrote:
>> 
>> On 22/11/2018 13:40, Paul Emmerich wrote:
>>> We've encountered the same problem on Debian Buster
>> 
>> It looks to me like this could be fixed simply by building the Bionic
>> packages in a Bionic chroot (ditto Buster); maybe that could be done in
>> future? Given I think the packaging process is being reviewed anyway at
>> the moment (hopefully 12.2.10 will be along at some point...)
> 
> That's how we're building it currently. We build ceph in pbuilder
> chroots that correspond to each distro.
> 
> On master, debian/control has Build-Depends: libcurl4-openssl-dev so
> I'm not sure why we'd end up with a dependency on libcurl3.
> 
> Would you please give me a minimal set of `apt-get` reproduction steps
> on Bionic for this issue? Then we can get it into tracker.ceph.com.
> 
> - Ken
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Favorite SSD

2018-09-17 Thread Reed Dier
SM863a were always good to me.
Micron 5100 MAX are fine, but felt less consistent than the Samsung’s.
Haven’t had any issues with Intel S4600.

Intel S3710’s obviously not available anymore, but those were a crowd favorite.
Micron 5200 line seems to not have a high endurance SKU like the 5100 line 
sadly.

Samsung SM883 should be shipping in volume shortly, which will be a close 
replacement to the SM863a.

Reed

> On Sep 17, 2018, at 1:04 PM, Serkan Çoban  wrote:
> 
> Intel DC series also popular both nvme and ssd use case.
> https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-d3-s4610-series.html
> 
> On Mon, Sep 17, 2018 at 8:10 PM Robert Stanford  
> wrote:
>> 
>> 
>> Awhile back the favorite SSD for Ceph was the Samsung SM863a.  Are there any 
>> larger SSDs that are known to work well with Ceph?  I'd like around 1TB if 
>> possible.  Is there any better alternative to the SM863a?
>> 
>> Regards
>>   R
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client hangs

2018-08-07 Thread Reed Dier
This is the first I am hearing about this as well.

Granted, I am using ceph-fuse rather than the kernel client at this point, but 
that isn’t etched in stone.

Curious if there is more to share.

Reed

> On Aug 7, 2018, at 9:47 AM, Webert de Souza Lima  
> wrote:
> 
> 
> Yan, Zheng mailto:uker...@gmail.com>> 于2018年8月7日周二 
> 下午7:51写道:
> On Tue, Aug 7, 2018 at 7:15 PM Zhenshi Zhou  > wrote:
> this can cause memory deadlock. you should avoid doing this
> 
> > Yan, Zheng mailto:uker...@gmail.com>>于2018年8月7日 
> > 周二19:12写道:
> >>
> >> did you mount cephfs on the same machines that run ceph-osd?
> >>
> 
> I didn't know about this. I run this setup in production. :P 
> 
> Regards,
> 
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> Belo Horizonte - Brasil
> IRC NICK - WebertRLZ
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best way to replace OSD

2018-08-06 Thread Reed Dier
These SSD’s are definitely up to the task, 3-5 DWPD over 5 years, however I 
mostly use an abundance of caution and try to minimize unnecessary data 
movement so as not to exacerbate things.

I definitely could, I just er on the side of conservative wear.

Reed

> On Aug 6, 2018, at 11:19 AM, Richard Hesketh  
> wrote:
> 
> I would have thought that with the write endurance on modern SSDs,
> additional write wear from the occasional rebalance would honestly be
> negligible? If you're hitting them hard enough that you're actually
> worried about your write endurance, a rebalance or two is peanuts
> compared to your normal I/O. If you're not, then there's more than
> enough write endurance in an SSD to handle daily rebalances for years.
> 
> On 06/08/18 17:05, Reed Dier wrote:
>> This has been my modus operandi when replacing drives.
>> 
>> Only having ~50 OSD’s for each drive type/pool, rebalancing can be a lengthy 
>> process, and in the case of SSD’s, shuffling data adds unnecessary write 
>> wear to the disks.
>> 
>> When migrating from filestore to bluestore, I would actually forklift an 
>> entire failure domain using the below script, and the noout, norebalance, 
>> norecover flags.
>> 
>> This would keep crush from pushing data around until I had all of the drives 
>> replaced, and would then keep crush from trying to recover until I was ready.
>> 
>>> # $1 use $ID or osd.id
>>> # $2 use $DATA or /dev/sdx
>>> # $3 use $NVME or /dev/nvmeXnXpX
>>> 
>>> sudo systemctl stop ceph-osd@$1.service
>>> sudo ceph-osd -i $1 --flush-journal
>>> sudo umount /var/lib/ceph/osd/ceph-$1
>>> sudo ceph-volume lvm zap /dev/$2
>>> ceph osd crush remove osd.$1
>>> ceph auth del osd.$1
>>> ceph osd rm osd.$1
>>> sudo ceph-volume lvm create --bluestore --data /dev/$2 --block.db /dev/$3
>> 
>> For a single drive, this would stop it, remove it from crush, make a new one 
>> (and let it retake the old/existing osd.id), and then after I unset the 
>> norebalance/norecover flags, then it backfills from the other copies to the 
>> replaced drive, and doesn’t move data around.
>> That script is specific for filestore to bluestore somewhat, as the 
>> flush-journal command is no longer used in bluestore.
>> 
>> Hope thats helpful.
>> 
>> Reed
>> 
>>> On Aug 6, 2018, at 9:30 AM, Richard Hesketh  
>>> wrote:
>>> 
>>> Waiting for rebalancing is considered the safest way, since it ensures
>>> you retain your normal full number of replicas at all times. If you take
>>> the disk out before rebalancing is complete, you will be causing some
>>> PGs to lose a replica. That is a risk to your data redundancy, but it
>>> might be an acceptable one if you prefer to just get the disk replaced
>>> quickly.
>>> 
>>> Personally, if running at 3+ replicas, briefly losing one isn't the end
>>> of the world; you'd still need two more simultaneous disk failures to
>>> actually lose data, though one failure would cause inactive PGs (because
>>> you are running with min_size >= 2, right?). If running pools with only
>>> two replicas at size = 2 I absolutely would not remove a disk without
>>> waiting for rebalancing unless that disk was actively failing so badly
>>> that it was making rebalancing impossible.
>>> 
>>> Rich
>>> 
>>> On 06/08/18 15:20, Josef Zelenka wrote:
>>>> Hi, our procedure is usually(assured that the cluster was ok the
>>>> failure, with 2 replicas as crush rule)
>>>> 
>>>> 1.Stop the OSD process(to keep it from coming up and down and putting
>>>> load on the cluster)
>>>> 
>>>> 2. Wait for the "Reweight" to come to 0(happens after 5 min i think -
>>>> can be set manually but i let it happen by itself)
>>>> 
>>>> 3. remove the osd from cluster(ceph auth del, ceph osd crush remove,
>>>> ceph osd rm)
>>>> 
>>>> 4. note down the journal partitions if needed
>>>> 
>>>> 5. umount drive, replace the disk with new one
>>>> 
>>>> 6. ensure permissions are set to ceph:ceph in /dev
>>>> 
>>>> 7. mklabel gpt on the new drive
>>>> 
>>>> 8. create the new osd with ceph-disk prepare(automatically adds it to
>>>> the crushmap)
>>>> 
>>>> 
>>>> your procedure sounds reasonable to me, as far as i'm concerned you
>>>> shouldn't have to wait for rebalancing

Re: [ceph-users] Best way to replace OSD

2018-08-06 Thread Reed Dier
This has been my modus operandi when replacing drives.

Only having ~50 OSD’s for each drive type/pool, rebalancing can be a lengthy 
process, and in the case of SSD’s, shuffling data adds unnecessary write wear 
to the disks.

When migrating from filestore to bluestore, I would actually forklift an entire 
failure domain using the below script, and the noout, norebalance, norecover 
flags.

This would keep crush from pushing data around until I had all of the drives 
replaced, and would then keep crush from trying to recover until I was ready.

> # $1 use $ID or osd.id
> # $2 use $DATA or /dev/sdx
> # $3 use $NVME or /dev/nvmeXnXpX
> 
> sudo systemctl stop ceph-osd@$1.service
> sudo ceph-osd -i $1 --flush-journal
> sudo umount /var/lib/ceph/osd/ceph-$1
> sudo ceph-volume lvm zap /dev/$2
> ceph osd crush remove osd.$1
> ceph auth del osd.$1
> ceph osd rm osd.$1
> sudo ceph-volume lvm create --bluestore --data /dev/$2 --block.db /dev/$3

For a single drive, this would stop it, remove it from crush, make a new one 
(and let it retake the old/existing osd.id), and then after I unset the 
norebalance/norecover flags, then it backfills from the other copies to the 
replaced drive, and doesn’t move data around.
That script is specific for filestore to bluestore somewhat, as the 
flush-journal command is no longer used in bluestore.

Hope thats helpful.

Reed

> On Aug 6, 2018, at 9:30 AM, Richard Hesketh  
> wrote:
> 
> Waiting for rebalancing is considered the safest way, since it ensures
> you retain your normal full number of replicas at all times. If you take
> the disk out before rebalancing is complete, you will be causing some
> PGs to lose a replica. That is a risk to your data redundancy, but it
> might be an acceptable one if you prefer to just get the disk replaced
> quickly.
> 
> Personally, if running at 3+ replicas, briefly losing one isn't the end
> of the world; you'd still need two more simultaneous disk failures to
> actually lose data, though one failure would cause inactive PGs (because
> you are running with min_size >= 2, right?). If running pools with only
> two replicas at size = 2 I absolutely would not remove a disk without
> waiting for rebalancing unless that disk was actively failing so badly
> that it was making rebalancing impossible.
> 
> Rich
> 
> On 06/08/18 15:20, Josef Zelenka wrote:
>> Hi, our procedure is usually(assured that the cluster was ok the
>> failure, with 2 replicas as crush rule)
>> 
>> 1.Stop the OSD process(to keep it from coming up and down and putting
>> load on the cluster)
>> 
>> 2. Wait for the "Reweight" to come to 0(happens after 5 min i think -
>> can be set manually but i let it happen by itself)
>> 
>> 3. remove the osd from cluster(ceph auth del, ceph osd crush remove,
>> ceph osd rm)
>> 
>> 4. note down the journal partitions if needed
>> 
>> 5. umount drive, replace the disk with new one
>> 
>> 6. ensure permissions are set to ceph:ceph in /dev
>> 
>> 7. mklabel gpt on the new drive
>> 
>> 8. create the new osd with ceph-disk prepare(automatically adds it to
>> the crushmap)
>> 
>> 
>> your procedure sounds reasonable to me, as far as i'm concerned you
>> shouldn't have to wait for rebalancing after you remove the osd. all
>> this might not be 100% per ceph books but it works for us :)
>> 
>> Josef
>> 
>> 
>> On 06/08/18 16:15, Iztok Gregori wrote:
>>> Hi Everyone,
>>> 
>>> Which is the best way to replace a failing (SMART Health Status:
>>> HARDWARE IMPENDING FAILURE) OSD hard disk?
>>> 
>>> Normally I will:
>>> 
>>> 1. set the OSD as out
>>> 2. wait for rebalancing
>>> 3. stop the OSD on the osd-server (unmount if needed)
>>> 4. purge the OSD from CEPH
>>> 5. physically replace the disk with the new one
>>> 6. with ceph-deploy:
>>> 6a   zap the new disk (just in case)
>>> 6b   create the new OSD
>>> 7. add the new osd to the crush map.
>>> 8. wait for rebalancing.
>>> 
>>> My questions are:
>>> 
>>> - Is my procedure reasonable?
>>> - What if I skip the #2 and instead to wait for rebalancing I directly
>>> purge the OSD?
>>> - Is better to reweight the OSD before take it out?
>>> 
>>> I'm running a Luminous (12.2.2) cluster with 332 OSDs, failure domain
>>> is host.
>>> 
>>> Thanks,
>>> Iztok
>>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Balancer per Pool/Crush Unit

2018-08-03 Thread Reed Dier
I suppose I may have found the solution I was unaware existed.

> balancer optimize  { [...]} :  Run optimizer to create a 
> new plan

So apparently you can create a plan specific to a pool(s).
So just to double check this, I created two plans, plan1 with the hdd pool (and 
not the ssd pool); plan2 with no arguments.

I then ran ceph balancer show planN and also ceph osd crush weight-set dump.
Then compared the values in the weight-set dump against the values in the two 
plans, and concluded that plan1 did not adjust the values for ssd osd’s, which 
is exactly what I was looking for:

> ID  CLASS WEIGHTTYPE NAMESTATUS REWEIGHT PRI-AFF
> -1417.61093 host ceph00
>  24   ssd   1.76109 osd.24   up  1.0 1.0
>  25   ssd   1.76109 osd.25   up  1.0 1.0
>  26   ssd   1.76109 osd.26   up  1.0 1.0
>  27   ssd   1.76109 osd.27   up  1.0 1.0
>  28   ssd   1.76109 osd.28   up  1.0 1.0
>  29   ssd   1.76109 osd.29   up  1.0 1.0
>  30   ssd   1.76109 osd.30   up  1.0 1.0
>  31   ssd   1.76109 osd.31   up  1.0 1.0
>  32   ssd   1.76109 osd.32   up  1.0 1.0
>  33   ssd   1.76109 osd.33   up  1.0 1.0


ceph osd crush weight-set dump
>{
> "bucket_id": -14,
> "weight_set": [
> [
> 1.756317,
> 1.613647,
> 1.733200,
> 1.735825,
> 1.961304,
> 1.583069,
> 1.963791,
> 1.773041,
> 1.890228,
> 1.793457
> ]
> ]
> },


plan1 (no change)
> ceph osd crush weight-set reweight-compat 24 1.756317
> ceph osd crush weight-set reweight-compat 25 1.613647
> ceph osd crush weight-set reweight-compat 26 1.733200
> ceph osd crush weight-set reweight-compat 27 1.735825
> ceph osd crush weight-set reweight-compat 28 1.961304
> ceph osd crush weight-set reweight-compat 29 1.583069
> ceph osd crush weight-set reweight-compat 30 1.963791
> ceph osd crush weight-set reweight-compat 31 1.773041
> ceph osd crush weight-set reweight-compat 32 1.890228
> ceph osd crush weight-set reweight-compat 33 1.793457


plan2 (change)
> ceph osd crush weight-set reweight-compat 24 1.742185
> ceph osd crush weight-set reweight-compat 25 1.608330
> ceph osd crush weight-set reweight-compat 26 1.753393
> ceph osd crush weight-set reweight-compat 27 1.713531
> ceph osd crush weight-set reweight-compat 28 1.964446
> ceph osd crush weight-set reweight-compat 29 1.629001
> ceph osd crush weight-set reweight-compat 30 1.961968
> ceph osd crush weight-set reweight-compat 31 1.738253
> ceph osd crush weight-set reweight-compat 32 1.884098
> ceph osd crush weight-set reweight-compat 33 1.779180


Hopefully this will be helpful for someone else who overlooks this in the -h 
output.

Reed

> On Aug 1, 2018, at 6:05 PM, Reed Dier  wrote:
> 
> Hi Cephers,
> 
> I’m starting to play with the Ceph Balancer plugin after moving to straw2 and 
> running into something I’m surprised I haven’t seen posted here.
> 
> My cluster has two crush roots, one for HDD, one for SSD.
> 
> Right now, HDD’s are a single pool to themselves, SSD’s are a single pool to 
> themselves.
> 
> Using Ceph Balancer Eval, I can see the eval score for the hdd’s (worse), and 
> the ssd’s (better), and the blended score of the cluster overall.
> pool “hdd" score 0.012529 (lower is better)
> pool “ssd" score 0.004654 (lower is better)
> current cluster score 0.008484 (lower is better)
> 
> My problem is that I need to get my hdd’s better, and stop touching my ssd's, 
> because shuffling data wear’s the ssd's unnecessarily, and it has actually 
> gotten the distribution worse over time. https://imgur.com/RVh0jfH 
> <https://imgur.com/RVh0jfH>
> You can see that between 06:00 and 09:00 on the second day in the graph that 
> the spread was very tight, and then it expanded back.
> 
> So my question is, how can I run the balancer on just my hdd’s without 
> touching my ssd’s?
> 
> I removed about 15% of the PG’s living on the HDD’s because they were empty.
> I also have two tiers of HDD’s 8TB’s and 2TB’s, but they are roughly equally 
> weighted in crush at the chassis level where my failure domains are 
> configured.
> Hopefully this abbreviated ceph osd tree d

[ceph-users] Ceph Balancer per Pool/Crush Unit

2018-08-01 Thread Reed Dier
Hi Cephers,

I’m starting to play with the Ceph Balancer plugin after moving to straw2 and 
running into something I’m surprised I haven’t seen posted here.

My cluster has two crush roots, one for HDD, one for SSD.

Right now, HDD’s are a single pool to themselves, SSD’s are a single pool to 
themselves.

Using Ceph Balancer Eval, I can see the eval score for the hdd’s (worse), and 
the ssd’s (better), and the blended score of the cluster overall.
pool “hdd" score 0.012529 (lower is better)
pool “ssd" score 0.004654 (lower is better)
current cluster score 0.008484 (lower is better)

My problem is that I need to get my hdd’s better, and stop touching my ssd's, 
because shuffling data wear’s the ssd's unnecessarily, and it has actually 
gotten the distribution worse over time. https://imgur.com/RVh0jfH 

You can see that between 06:00 and 09:00 on the second day in the graph that 
the spread was very tight, and then it expanded back.

So my question is, how can I run the balancer on just my hdd’s without touching 
my ssd’s?

I removed about 15% of the PG’s living on the HDD’s because they were empty.
I also have two tiers of HDD’s 8TB’s and 2TB’s, but they are roughly equally 
weighted in crush at the chassis level where my failure domains are configured.
Hopefully this abbreviated ceph osd tree displays the hierarchy. Multipliers 
for that bucket on right.
> ID  CLASS WEIGHTTYPE NAME
>  -1   218.49353 root default.hdd
> -10   218.49353 rack default.rack-hdd
> -7043.66553 chassis hdd-2tb-chassis1  *1
> -6743.66553 host hdd-2tb-24-1   *1
>  74   hdd   1.81940 osd.74  *24
> -5543.70700 chassis hdd-8tb-chassis1*4
>  -221.85350 host hdd-8tb-3-1
>   0   hdd   7.28450 osd.0   *3
>  -321.85350 host hdd-8tb-3-1
>   1   hdd   7.28450 osd.1   *3


I assume this doesn’t complicate totally, but figured I would mention it, as I 
assume it is more difficult to equally distribute across OSD’s that are 4:1 
size delta.


If I create a plan plan1 with ceph balancer optimize plan1,
then do a show plan1, I see an entry:
ceph osd crush weight-set reweight-compat $OSD $ArbitraryNumberNearOsdSize

Could I then copy this output, remove entries for SSD OSD’s and then run the 
ceph osd crush weight-set reweight-compat commands in a script?

I and my SSD’s appreciate any insight.

Thanks,

Reed___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] separate monitoring node

2018-06-22 Thread Reed Dier

> On Jun 22, 2018, at 2:14 AM, Stefan Kooman  wrote:
> 
> Just checking here: Are you using the telegraf ceph plugin on the nodes?
> In that case you _are_ duplicating data. But the good news is that you
> don't need to. There is a Ceph mgr telegraf plugin now (mimic) which
> also works on luminous: http://docs.ceph.com/docs/master/mgr/telegraf/

Hi Stefan,

I’m just curious what the advantage you see to the telegraf plugin, then 
feeding into influxdb, rather than the influxdb plugin in ceph-mgr already 
existing.
Just generally curious what the advantage is to outputting into telegraf then 
into influx, unless you are outputting to a different TSDB from Telegraf.

Still have my OSD’s reporting their own stats in collectd daemons on all of my 
OSD nodes, as a supplement to the direct ceph-mgr -> influxdb statistics.
Almost moved everything to telegraf after Luminous broke some collectd data 
collection, but it all got sorted out.

> 
> You configure a listener ([[inputs.socket_listener]) on the nodes where
> you have ceph mgr running (probably mons) and have the mgr plugin send
> data to the socket. The telegraf daemon will pick it up and send it to
> influx (or whatever target you configured). As there is only one active
> mgr, you don't have the issue of duplicating data, and the solution is
> still HA.
> Gr. Stefan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


And +1 for icinga2 alerting.

Thanks,

Reed
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Bluestore Backfills Slow

2018-06-04 Thread Reed Dier
Appreciate the input.

Wasn’t sure if ceph-volume was the one setting these bits of metadata or 
something else.

Appreciate the help guys.

Thanks,

Reed

> The fix is in core Ceph (the OSD/BlueStore code), not ceph-volume. :) 
> journal_rotational is still a thing in BlueStore; it represents the combined 
> WAL+DB devices.
> -Greg 
> On Jun 4, 2018, at 11:53 AM, Alfredo Deza  wrote:
> 
> ceph-volume doesn't do anything here with the device metadata, and is
> something that bluestore has as an internal mechanism. Unsure if there
> is anything
> one can do to change this on the OSD itself (vs. injecting args)


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Bluestore Backfills Slow

2018-06-04 Thread Reed Dier
Hi Caspar,

David is correct, in that the issue I was having with SSD OSD’s having NVMe 
bluefs_db reporting as HDD creating an artificial throttle based on what David 
was mentioning, a prevention to keep spinning rust from thrashing. Not sure if 
the journal_rotational bit should be 1, but either way, it shouldn’t affect you 
being hdd OSDs. Curious how these OSD’s were deployed, per the below part of 
the message.

Copying Alfredo, as I’m not sure if something changed with respect to 
ceph-volume in 12.2.2 (when this originally happened) to 12.2.5 (I’m sure 
plenty did), because I recently had an NVMe drive fail on me unexpectedly 
(curse you Micron), and had to nuke and redo some SSD OSDs, and it was my first 
time deploying with ceph-deploy after the ceph-disk deprecation. The new OSD’s 
appear to report correctly wrt to the rotational status, where the others did 
not. So that appears to be working correctly, just wanted to provide some 
positive feedback there. Not sure if there’s an easy way to change those 
metadata tags on the OSDs, so that I don’t have to inject the args every time I 
need to reweight. Also feels like journal_rotational wouldn’t be a thing in 
bluestore?

> ceph osd metadata |grep ‘id\|model\|type\|rotational’
> "id": 63,
> "bluefs_db_model": "MTFDHAX1T2MCF-1AN1ZABYY",
> "bluefs_db_rotational": "0",
> "bluefs_db_type": "nvme",
> "bluefs_slow_model": "",
> "bluefs_slow_rotational": "0",
> "bluefs_slow_type": "ssd",
> "bluestore_bdev_model": "",
> "bluestore_bdev_rotational": "0",
> "bluestore_bdev_type": "ssd",
> "journal_rotational": "1",
> "rotational": "0"
> "id": 64,
> "bluefs_db_model": "INTEL SSDPED1D960GAY",
> "bluefs_db_rotational": "0",
> "bluefs_db_type": "nvme",
> "bluefs_slow_model": "",
> "bluefs_slow_rotational": "0",
> "bluefs_slow_type": "ssd",
> "bluestore_bdev_model": "",
> "bluestore_bdev_rotational": "0",
> "bluestore_bdev_type": "ssd",
> "journal_rotational": "0",
> "rotational": "0"


osd.63 being one deployed using ceph-volume lvm in 12.2.2 and osd.64 being 
redeployed using ceph-deploy in 12.2.5 using ceph-volume  backend.

Reed

> On Jun 4, 2018, at 8:16 AM, David Turner  wrote:
> 
> I don't believe this really applies to you. The problem here was with an SSD 
> osd that was incorrectly labeled as an HDD osd by ceph. The fix was to inject 
> a sleep seeing if 0 for those osds to speed up recovery. The sleep is needed 
> to not kill hdds to avoid thrashing, but the bug was SSDs were being 
> incorrectly identified as HDD and SSDs don't have a problem with thrashing.
> 
> You can try increasing osd_max_backfills. Watch your disk utilization as you 
> do this so that you don't accidentally kill your client io by setting that 
> too high, assuming that still needs priority.
> 
> On Mon, Jun 4, 2018, 3:55 AM Caspar Smit  <mailto:caspars...@supernas.eu>> wrote:
> Hi Reed,
> 
> "Changing/injecting osd_recovery_sleep_hdd into the running SSD OSD’s on 
> bluestore opened the floodgates."
> 
> What exactly did you change/inject here?
> 
> We have a cluster with 10TB SATA HDD's which each have a 100GB SSD based 
> block.db
> 
> Looking at ceph osd metadata for each of those:
> 
> "bluefs_db_model": "SAMSUNG MZ7KM960",
> "bluefs_db_rotational": "0",
> "bluefs_db_type": "ssd",
> "bluefs_slow_model": "ST1NM0086-2A",
> "bluefs_slow_rotational": "1",
> "bluefs_slow_type": "hdd",
> "bluestore_bdev_rotational": "1",
> "bluestore_bdev_type": "hdd",
> "default_device_class": "hdd",
> "journal_rotational": "1",
> "osd_objectstore": "bluestore",
> "rotational": "1"
> 
> Looks to me if i'm hitting the same issue, isn't it?
> 
> ps. An upgrade of Ceph is planned in the near future but for now i would like 
> to use the workaround if applicable to me.
> 
> Thank y

Re: [ceph-users] Luminous cluster - how to find out which clients are still jewel?

2018-05-29 Thread Reed Dier
Possibly helpful,

If you are able to hit your ceph-mgr dashboard in a web browser, I find it 
possible to see a table of currently connected cephfs clients, hostnames, 
state, type (userspace/kernel), and ceph version.

Assuming that the link is persistent, for me the url is ceph-mgr:7000/clients/1/

Or if that fails, the traversal for me from ceph-mgr:7000 is the folder icon in 
the lefthand menu, the specific filesystem, then in the top left it will say 
“Filesystem  /n Clients: ### Detail…”.
The “Detail…” link should lead you to the aforementioned link from above, with 
a table of all the cephfs clients.

Hopefully thats helpful.

Reed

> On May 29, 2018, at 4:42 AM, Paul Emmerich  wrote:
> 
> https://github.com/ceph/ceph/pull/17535 
>  
> It's not in Luminous, though.
> 
> Paul
> 
> 2018-05-29 9:41 GMT+02:00 Linh Vu  >:
> Ah I remember that one, I still have it on my watch list on tracker.ceph.com 
> 
> 
> Thanks 
> 
> Alternatively, is there a way to check on a client node what ceph features 
> (jewel, luminous etc.) it has? In our case, it's all CephFS clients, and it's 
> a mix between ceph-fuse (which is Luminous 12.2.5) and kernel client 
> (4.15.x). I suspect the latter is only supporting jewel features but I'd like 
> to confirm. 
> From: Massimo Sgaravatto  >
> Sent: Tuesday, 29 May 2018 4:51:56 PM
> To: Linh Vu
> Cc: ceph-users
> Subject: Re: [ceph-users] Luminous cluster - how to find out which clients 
> are still jewel?
>  
> As far as I know the status wrt this issue is still the one reported in this 
> thread:
> 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/020585.html
>  
> 
> 
> See also:
> 
> http://tracker.ceph.com/issues/21315 
> 
> Cheers, Massimo
> 
> On Tue, May 29, 2018 at 8:39 AM, Linh Vu  > wrote:
> Hi all,
> 
> I have a Luminous 12.2.4 cluster. This is what `ceph features` tells me:
> 
> ...
> "client": {
> "group": {
> "features": "0x7010fb86aa42ada",
> "release": "jewel",
> "num": 257
> },
> "group": {
> "features": "0x1ffddff8eea4fffb",
> "release": "luminous",
> "num": 820
> }
> }
> ...
> 
> How do I find out which clients (IP/hostname/IDs) are actually on jewel 
> feature set?
> 
> Regards,
> Linh
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> 
> 
> 
> 
> -- 
> Paul Emmerich
> 
> Looking for help with your Ceph cluster? Contact us at https://croit.io 
> 
> 
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io 
> Tel: +49 89 1896585 90
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Show and Tell: Grafana cluster dashboard

2018-05-07 Thread Reed Dier
I’ll +1 on InfluxDB rather than Prometheus, though I think having a version for 
each infrastructure path would be best.
I’m sure plenty here have existing InfluxDB infrastructure as their TSDB of 
choice, and moving to Prometheus would be less advantageous.

Conversely, I’m sure all of the Prometheus folks would be less inclined to move 
to InfluxDB for TSDB, so I think supporting both paths would be the best choice.

Reed

> On May 7, 2018, at 3:06 AM, Marc Roos  wrote:
> 
> 
> Looks nice 
> 
> - I rather have some dashboards with collectd/influxdb.
> - Take into account bigger tv/screens eg 65" uhd. I am putting more 
> stats on them than viewing them locally in a webbrowser.
> - What is to be considered most important to have on your ceph 
> dashboard? As a newbie I find it difficult to determine what is 
> important to monitor.
> - Maybe also some docs on what metrics you have taken and argumentation 
> on how you used them (could be usefull if one wants to modify the 
> dashboard for some other backend)
> 
> Ceph performance counters description.
> https://access.redhat.com/documentation/en/red-hat-ceph-storage/1.3/paged/administration-guide/chapter-9-performance-counters
> 
> 
> -Original Message-
> From: Jan Fajerski [mailto:jfajer...@suse.com] 
> Sent: maandag 7 mei 2018 12:32
> To: ceph-devel
> Cc: ceph-users
> Subject: [ceph-users] Show and Tell: Grafana cluster dashboard
> 
> Hi all,
> I'd like to request comments and feedback about a Grafana dashboard for 
> Ceph cluster monitoring.
> 
> https://youtu.be/HJquM127wMY
> 
> https://github.com/ceph/ceph/pull/21850
> 
> The goal is to eventually have a set of default dashboards in the Ceph 
> repository that offer decent monitoring for clusters of various (maybe 
> even all) sizes and applications, or at least serve as a starting point 
> for customizations.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-mgr balancer getting started

2018-04-12 Thread Reed Dier
Hi ceph-users,

I am trying to figure out how to go about making ceph balancer do its magic, as 
I have some pretty unbalanced distribution across osd’s currently, both SSD and 
HDD.

Cluster is 12.2.4 on Ubuntu 16.04.
All OSD’s have been migrated to bluestore.

Specifically, my HDD’s are the main driver of trying to run the balancer, as I 
have a near full HDD.

> ID CLASS WEIGHT  REWEIGHT SIZE  USE   AVAIL  %USE  VAR  PGS
>  4   hdd 7.28450  1.0 7459G 4543G  2916G 60.91 0.91 126
> 21   hdd 7.28450  1.0 7459G 4626G  2833G 62.02 0.92 130
>  0   hdd 7.28450  1.0 7459G 4869G  2589G 65.28 0.97 133
>  5   hdd 7.28450  1.0 7459G 4866G  2592G 65.24 0.97 136
> 14   hdd 7.28450  1.0 7459G 4829G  2629G 64.75 0.96 138
>  8   hdd 7.28450  1.0 7459G 4829G  2629G 64.75 0.96 139
>  7   hdd 7.28450  1.0 7459G 4959G  2499G 66.49 0.99 141
> 23   hdd 7.28450  1.0 7459G 5159G  2299G 69.17 1.03 142
>  2   hdd 7.28450  1.0 7459G 5042G  2416G 67.60 1.01 144
>  1   hdd 7.28450  1.0 7459G 5292G  2167G 70.95 1.06 145
> 10   hdd 7.28450  1.0 7459G 5441G  2018G 72.94 1.09 146
> 19   hdd 7.28450  1.0 7459G 5125G  2333G 68.72 1.02 146
>  9   hdd 7.28450  1.0 7459G 5123G  2335G 68.69 1.02 146
> 18   hdd 7.28450  1.0 7459G 5187G  2271G 69.54 1.04 149
> 22   hdd 7.28450  1.0 7459G 5369G  2089G 71.98 1.07 150
> 12   hdd 7.28450  1.0 7459G 5375G  2083G 72.07 1.07 152
> 17   hdd 7.28450  1.0 7459G 5498G  1961G 73.71 1.10 152
> 11   hdd 7.28450  1.0 7459G 5621G  1838G 75.36 1.12 154
> 15   hdd 7.28450  1.0 7459G 5576G  1882G 74.76 1.11 154
> 20   hdd 7.28450  1.0 7459G 5797G  1661G 77.72 1.16 158
>  6   hdd 7.28450  1.0 7459G 5951G  1508G 79.78 1.19 164
>  3   hdd 7.28450  1.0 7459G 5960G  1499G 79.90 1.19 166
> 16   hdd 7.28450  1.0 7459G 6161G  1297G 82.60 1.23 169
> 13   hdd 7.28450  1.0 7459G 6678G   780G 89.54 1.33 184

I sorted this on PGS, and you can see that PGs pretty well follow actual disk 
usage, and since balancer appears to attempt to distribute PGs more perfectly, 
I should get more even distribution of my usage.
Hopefully that passes the sanity check.

> ID CLASS WEIGHT  REWEIGHT SIZE  USE   AVAIL  %USE  VAR  PGS
> 49   ssd 1.76109  1.0 1803G  882G   920G 48.96 0.73 205
> 72   ssd 1.76109  1.0 1803G  926G   876G 51.38 0.77 217
> 30   ssd 1.76109  1.0 1803G  950G   852G 52.73 0.79 222
> 48   ssd 1.76109  1.0 1803G  961G   842G 53.29 0.79 225
> 54   ssd 1.76109  1.0 1803G  980G   823G 54.36 0.81 230
> 63   ssd 1.76109  1.0 1803G  985G   818G 54.62 0.81 230
> 35   ssd 1.76109  1.0 1803G  997G   806G 55.30 0.82 233
> 45   ssd 1.76109  1.0 1803G 1002G   801G 55.58 0.83 234
> 67   ssd 1.76109  1.0 1803G 1004G   799G 55.69 0.83 234
> 42   ssd 1.76109  1.0 1803G 1006G   796G 55.84 0.83 235
> 52   ssd 1.76109  1.0 1803G 1009G   793G 56.00 0.83 238
> 61   ssd 1.76109  1.0 1803G 1014G   789G 56.24 0.84 238
> 68   ssd 1.76109  1.0 1803G 1021G   782G 56.62 0.84 238
> 32   ssd 1.76109  1.0 1803G 1021G   781G 56.67 0.84 240
> 65   ssd 1.76109  1.0 1803G 1024G   778G 56.83 0.85 240
> 26   ssd 1.76109  1.0 1803G 1022G   780G 56.72 0.84 241
> 59   ssd 1.76109  1.0 1803G 1031G   771G 57.20 0.85 241
> 47   ssd 1.76109  1.0 1803G 1035G   767G 57.42 0.86 242
> 37   ssd 1.76109  1.0 1803G 1036G   767G 57.46 0.86 243
> 28   ssd 1.76109  1.0 1803G 1043G   760G 57.85 0.86 245
> 40   ssd 1.76109  1.0 1803G 1047G   755G 58.10 0.87 245
> 41   ssd 1.76109  1.0 1803G 1046G   756G 58.06 0.86 245
> 62   ssd 1.76109  1.0 1803G 1050G   752G 58.25 0.87 245
> 39   ssd 1.76109  1.0 1803G 1051G   751G 58.30 0.87 246
> 56   ssd 1.76109  1.0 1803G 1050G   752G 58.27 0.87 246
> 70   ssd 1.76109  1.0 1803G 1041G   761G 57.75 0.86 246
> 73   ssd 1.76109  1.0 1803G 1057G   746G 58.63 0.87 247
> 44   ssd 1.76109  1.0 1803G 1056G   746G 58.58 0.87 248
> 38   ssd 1.76109  1.0 1803G 1059G   743G 58.75 0.87 249
> 51   ssd 1.76109  1.0 1803G 1063G   739G 58.99 0.88 249
> 33   ssd 1.76109  1.0 1803G 1067G   736G 59.18 0.88 250
> 36   ssd 1.76109  1.0 1803G 1071G   731G 59.41 0.88 251
> 55   ssd 1.76109  1.0 1803G 1066G   737G 59.11 0.88 251
> 27   ssd 1.76109  1.0 1803G 1078G   724G 59.81 0.89 252
> 31   ssd 1.76109  1.0 1803G 1079G   724G 59.84 0.89 252
> 69   ssd 1.76109  1.0 1803G 1075G   727G 59.63 0.89 252
> 46   ssd 1.76109  1.0 1803G 1082G   721G 60.00 0.89 253
> 58   ssd 1.76109  1.0 1803G 1081G   721G 59.98 0.89 253
> 66   ssd 1.76109  1.0 1803G 1081G   722G 59.96 0.89 253
> 34   ssd 1.76109  1.0 1803G 1091G   712G 60.52 0.90 255
> 43   ssd 1.76109  1.0 1803G 1089G   713G 60.42 0.90 256
> 64   ssd 1.76109  1.0 1803G 1097G   705G 60.87 0.91 257
> 24   ssd 1.76109  1.0 1803G 1113G   690G 61.72 0.92 260
> 25   ssd 1.76109  1.0 1803G 1146G   656G 63.58 0.95 269
> 29   ssd 1.76109  1.0 1803G 1146G   

Re: [ceph-users] Disk write cache - safe?

2018-03-14 Thread Reed Dier
Tim, 

I can corroborate David’s sentiments as it pertains to being a disaster.

In the early days of my Ceph cluster, I had 8TB SAS drives behind an LSI RAID 
controller as RAID0 volumes (no IT mode), with on-drive write-caching enabled 
(pdcache=default). I subsequently had my the data center where this was 
colocated struck by lightning and grid power interrupted with the generators 
failing to start, so when the UPS for the DC went, so did my cluster. Most of 
my issues were related to xfs file system errors.

Luckily, I was bitten before I had important data on Ceph, mostly CephFS, but 
everything was lost.
It was a painful, but extremely helpful learning experience.

I was able to recreate the osd failure with power pulls to nodes, narrowing my 
issues to the pdcache.
I was then able to add BBU’s to the RAID cards, and enable write-back to 
improve write performance, making my disks fault tolerant while still keeping 
write perf increased. And when the BBU is failed, I have it configured to 
revert to write-through, which I have confirmed is tolerant.

I have later upgraded these drives to bluestore, and did a power pull on a 
single node to verify integrity, which I was able to do.

Worth mentioning that my 8TB SAS spinners were journaled/are block.db’d by an 
Intel P3700 NVMe disk which advertises "Enhanced power-loss data protection” 
which appears to come in the form of a capacitor in the NVMe card to keep 
writes from being flushed during power loss.

tl;dr steer clear of on-disk write caching where possible unless you can 
guarantee never losing power.

Reed

> On Mar 14, 2018, at 3:08 PM, David Byte  wrote:
> 
> Tim,
> 
> Enabling the drive write cache is a recipe for disaster.  In the event of a 
> power interruption, you have in-flight data that is stored in the cache and 
> uncommitted to the disk media itself.  Being that the power is interrupted 
> and the drive cache does not have a battery or supercap to keep it powered, 
> you end up losing the data in the cache.  Now, if this is just a single node 
> and you have size=3 or a decent EC scheme in place, Ceph should be able to 
> recover and keep going.  However, if it is more than 1 node that loses power, 
> you start running the risk of corrupting multiple or dare I say *all* copies 
> of the data that was supposed to be written, with the result being data loss. 
>  This is why is it the standard practice to disable drive caches, not just 
> with Ceph, but with any enterprise storage offering.
> 
> In testing that I've done, using a battery backed cache on the RAID 
> controller with each drive as it's own RAID-0 has positive performance 
> results.  This is something to try and see if you can regain some of the 
> performance, but as always in storage, YMMV.
> 
> David Byte
> Sr. Technology Strategist
> SCE Enterprise Linux 
> SCE Enterprise Storage
> Alliances and SUSE Embedded
> db...@suse.com
> 918.528.4422
> On 3/14/18, 2:43 PM, "ceph-users on behalf of Tim Bishop" 
>  wrote:
> 
>I'm using Ceph on Ubuntu 16.04 on Dell R730xd servers. A recent [1]
>update to the PERC firmware disabled the disk write cache by default
>which made a noticable difference to the latency on my disks (spinning
>disks, not SSD) - by as much as a factor of 10.
> 
>For reference their change list says:
> 
>"Changes default value of drive cache for 6 Gbps SATA drive to disabled.
>This is to align with the industry for SATA drives. This may result in a
>performance degradation especially in non-Raid mode. You must perform an
>AC reboot to see existing configurations change."
> 
>It's fairly straightforward to re-enable the cache either in the PERC
>BIOS, or by using hdparm, and doing so returns the latency back to what
>it was before.
> 
>Checking the Ceph documentation I can see that older versions [2]
>recommended disabling the write cache for older kernels. But given I'm
>using a newer kernel, and there's no mention of this in the Luminous
>docs, is it safe to assume it's ok to enable the disk write cache now?
> 
>If it makes a difference, I'm using a mixture of filestore and bluestore
>OSDs - migration is still ongoing.
> 
>Thanks,
> 
>Tim.
> 
>[1] - 
> https://www.dell.com/support/home/uk/en/ukdhs1/Drivers/DriversDetails?driverId=8WK8N
>[2] - 
> http://docs.ceph.com/docs/jewel/rados/configuration/filesystem-recommendations/
> 
>-- 
>Tim Bishop
>http://www.bishnet.net/tim/
>PGP Key: 0x6C226B37FDF38D55
> 
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mds suicide on upgrade

2018-03-12 Thread Reed Dier
Good eye,

Thanks Dietmar,

Glad to know this isn’t a standard issue, hopefully anything in the future will 
get caught and/or make it into release notes.

Thanks,

Reed

> On Mar 12, 2018, at 12:55 PM, Dietmar Rieder <dietmar.rie...@i-med.ac.at> 
> wrote:
> 
> Hi,
> 
> See: 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/025092.html 
> <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/025092.html>
> 
> Might be of interest.
> 
> Dietmar
> 
> Am 12. März 2018 18:19:51 MEZ schrieb Reed Dier <reed.d...@focusvq.com>:
> Figured I would see if anyone has seen this or can see something I am doing 
> wrong.
> 
> Upgrading all of my daemons from 12.2.2. to 12.2.4.
> 
> Followed the documentation, upgraded mons, mgrs, osds, then mds’s in that 
> order.
> 
> All was fine, until the MDSs.
> 
> I have two MDS’s in Active:Standby config. I decided it made sense to upgrade 
> the Standby MDS, so I could gracefully step down the current active, after 
> the standby was upgraded.
> 
> However, when I upgraded the standby, it caused the working active to 
> suicide, and the then standby to immediately rejoin as active when it 
> restarted, which didn’t leave me feeling warm and fuzzy about upgrading MDS’s 
> in the future.
> 
> Attaching log entries that would appear to be the culprit.
> 
>  2018-03-12 13:07:38.981339 7ff0cdc40700  0 mds.0 handle_mds_map mdsmap 
> compatset compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds 
> uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file 
> layout v2} not writeable with daemon features 
> compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds 
> uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline 
> data,8=file layout v2}, killing myself
>  2018-03-12 13:07:38.981353 7ff0cdc40700  1 mds.0 suicide.  wanted state 
> up:active
>  2018-03-12 13:07:40.000753 7ff0cdc40700  1 mds.0.119543 shutdown: shutting 
> down rank 0
>  2018-03-12 13:08:27.325667 7f32cc992200  0 set uid:gid to 64045:64045 
> (ceph:ceph)
>  2018-03-12 13:08:27.325687 7f32cc992200  0 ceph version 12.2.4 
> (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable), process 
> (unknown), pid 66854
>  2018-03-12 13:08:27.326795 7f32cc992200  0 pidfile_write: ignore empty 
> --pid-file
>  2018-03-12 13:08:32.350266 7f32c6440700  1 mds.0 handle_mds_map standby
> 
> Hopefully there may be some config issue with my mds_map or something like 
> that which may be an easy fix to prevent something like this in the future.
> 
> Thanks,
> 
> Reed
> 
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> 
> -- 
> ___
> D i e t m a r R i e d e r, Mag.Dr.
> Innsbruck Medical University
> Biocenter - Division for Bioinformatics
> Innrain 80, 6020 Innsbruck
> Phone: +43 512 9003 71402
> Fax: +43 512 9003 73100
> Email: dietmar.rie...@i-med.ac.at
> Web: http://www.icbi.at <http://www.icbi.at/>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-mds suicide on upgrade

2018-03-12 Thread Reed Dier
Figured I would see if anyone has seen this or can see something I am doing 
wrong.

Upgrading all of my daemons from 12.2.2. to 12.2.4.

Followed the documentation, upgraded mons, mgrs, osds, then mds’s in that order.

All was fine, until the MDSs.

I have two MDS’s in Active:Standby config. I decided it made sense to upgrade 
the Standby MDS, so I could gracefully step down the current active, after the 
standby was upgraded.

However, when I upgraded the standby, it caused the working active to suicide, 
and the then standby to immediately rejoin as active when it restarted, which 
didn’t leave me feeling warm and fuzzy about upgrading MDS’s in the future.

Attaching log entries that would appear to be the culprit.

> 2018-03-12 13:07:38.981339 7ff0cdc40700  0 mds.0 handle_mds_map mdsmap 
> compatset compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds 
> uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file 
> layout v2} not writeable with daemon features 
> compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds 
> uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline 
> data,8=file layout v2}, killing myself
> 2018-03-12 13:07:38.981353 7ff0cdc40700  1 mds.0 suicide.  wanted state 
> up:active
> 2018-03-12 13:07:40.000753 7ff0cdc40700  1 mds.0.119543 shutdown: shutting 
> down rank 0
> 2018-03-12 13:08:27.325667 7f32cc992200  0 set uid:gid to 64045:64045 
> (ceph:ceph)
> 2018-03-12 13:08:27.325687 7f32cc992200  0 ceph version 12.2.4 
> (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable), process 
> (unknown), pid 66854
> 2018-03-12 13:08:27.326795 7f32cc992200  0 pidfile_write: ignore empty 
> --pid-file
> 2018-03-12 13:08:32.350266 7f32c6440700  1 mds.0 handle_mds_map standby

Hopefully there may be some config issue with my mds_map or something like that 
which may be an easy fix to prevent something like this in the future.

Thanks,

Reed
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-26 Thread Reed Dier
Quick turn around,

Changing/injecting osd_recovery_sleep_hdd into the running SSD OSD’s on 
bluestore opened the floodgates.

> pool objects-ssd id 20
>   recovery io 1512 MB/s, 21547 objects/s
> 
> pool fs-metadata-ssd id 16
>   recovery io 0 B/s, 6494 keys/s, 271 objects/s
>   client io 82325 B/s rd, 68146 B/s wr, 1 op/s rd, 0 op/s wr

Graph of performance jump. Extremely marked.
https://imgur.com/a/LZR9R <https://imgur.com/a/LZR9R>

So at least we now have the gun to go with the smoke.

Thanks for the help and appreciate you pointing me in some directions that I 
was able to use to figure out the issue.

Adding to ceph.conf for future OSD conversions.

Thanks,

Reed


> On Feb 26, 2018, at 4:12 PM, Reed Dier <reed.d...@focusvq.com> wrote:
> 
> For the record, I am not seeing a demonstrative fix by injecting the value of 
> 0 into the OSDs running.
>> osd_recovery_sleep_hybrid = '0.00' (not observed, change may require 
>> restart)
> 
> If it does indeed need to be restarted, I will need to wait for the current 
> backfills to finish their process as restarting an OSD would bring me under 
> min_size.
> 
> However, doing config show on the osd daemon appears to have taken the value 
> of 0.
> 
>> ceph daemon osd.24 config show | grep recovery_sleep
>> "osd_recovery_sleep": "0.00",
>> "osd_recovery_sleep_hdd": "0.10",
>> "osd_recovery_sleep_hybrid": "0.00",
>> "osd_recovery_sleep_ssd": "0.00",
> 
> 
> I may take the restart as an opportunity to also move to 12.2.3 at the same 
> time, since it is not expected that that should affect this issue.
> 
> I could also attempt to change osd_recovery_sleep_hdd as well, since these 
> are ssd osd’s, it shouldn’t make a difference, but its a free move.
> 
> Thanks,
> 
> Reed
> 
>> On Feb 26, 2018, at 3:42 PM, Gregory Farnum <gfar...@redhat.com 
>> <mailto:gfar...@redhat.com>> wrote:
>> 
>> On Mon, Feb 26, 2018 at 12:26 PM Reed Dier <reed.d...@focusvq.com 
>> <mailto:reed.d...@focusvq.com>> wrote:
>> I will try to set the hybrid sleeps to 0 on the affected OSDs as an interim 
>> solution to getting the metadata configured correctly.
>> 
>> Yes, that's a good workaround as long as you don't have any actual hybrid 
>> OSDs (or aren't worried about them sleeping...I'm not sure if that setting 
>> came from experience or not).
>>  
>> 
>> For reference, here is the complete metadata for osd.24, bluestore SATA SSD 
>> with NVMe block.db.
>> 
>>> {
>>> "id": 24,
>>> "arch": "x86_64",
>>> "back_addr": "",
>>> "back_iface": "bond0",
>>> "bluefs": "1",
>>> "bluefs_db_access_mode": "blk",
>>> "bluefs_db_block_size": "4096",
>>> "bluefs_db_dev": "259:0",
>>> "bluefs_db_dev_node": "nvme0n1",
>>> "bluefs_db_driver": "KernelDevice",
>>> "bluefs_db_model": "INTEL SSDPEDMD400G4 ",
>>> "bluefs_db_partition_path": "/dev/nvme0n1p4",
>>> "bluefs_db_rotational": "0",
>>> "bluefs_db_serial": " ",
>>> "bluefs_db_size": "16000221184",
>>> "bluefs_db_type": "nvme",
>>> "bluefs_single_shared_device": "0",
>>> "bluefs_slow_access_mode": "blk",
>>> "bluefs_slow_block_size": "4096",
>>> "bluefs_slow_dev": "253:8",
>>> "bluefs_slow_dev_node": "dm-8",
>>> "bluefs_slow_driver": "KernelDevice",
>>> "bluefs_slow_model": "",
>>> "bluefs_slow_partition_path": "/dev/dm-8",
>>> "bluefs_slow_rotational": "0",
>>> "bluefs_slow_size": "1920378863616",
>>> "bluefs_slow_type": "ssd",
>>> "bluestore_bdev_access_mode": "blk",
>>> "bluestore_bdev_block_size": "4096",
>>> "bluestore_bdev_dev": "253:8",
>>> "

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-26 Thread Reed Dier
I will try to set the hybrid sleeps to 0 on the affected OSDs as an interim 
solution to getting the metadata configured correctly.

For reference, here is the complete metadata for osd.24, bluestore SATA SSD 
with NVMe block.db.

> {
> "id": 24,
> "arch": "x86_64",
> "back_addr": "",
> "back_iface": "bond0",
> "bluefs": "1",
> "bluefs_db_access_mode": "blk",
> "bluefs_db_block_size": "4096",
> "bluefs_db_dev": "259:0",
> "bluefs_db_dev_node": "nvme0n1",
> "bluefs_db_driver": "KernelDevice",
> "bluefs_db_model": "INTEL SSDPEDMD400G4 ",
> "bluefs_db_partition_path": "/dev/nvme0n1p4",
> "bluefs_db_rotational": "0",
> "bluefs_db_serial": " ",
> "bluefs_db_size": "16000221184",
> "bluefs_db_type": "nvme",
> "bluefs_single_shared_device": "0",
> "bluefs_slow_access_mode": "blk",
> "bluefs_slow_block_size": "4096",
> "bluefs_slow_dev": "253:8",
> "bluefs_slow_dev_node": "dm-8",
> "bluefs_slow_driver": "KernelDevice",
> "bluefs_slow_model": "",
> "bluefs_slow_partition_path": "/dev/dm-8",
> "bluefs_slow_rotational": "0",
> "bluefs_slow_size": "1920378863616",
> "bluefs_slow_type": "ssd",
> "bluestore_bdev_access_mode": "blk",
> "bluestore_bdev_block_size": "4096",
> "bluestore_bdev_dev": "253:8",
> "bluestore_bdev_dev_node": "dm-8",
> "bluestore_bdev_driver": "KernelDevice",
> "bluestore_bdev_model": "",
> "bluestore_bdev_partition_path": "/dev/dm-8",
> "bluestore_bdev_rotational": "0",
> "bluestore_bdev_size": "1920378863616",
> "bluestore_bdev_type": "ssd",
> "ceph_version": "ceph version 12.2.2 
> (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)",
> "cpu": "Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz",
> "default_device_class": "ssd",
> "distro": "ubuntu",
> "distro_description": "Ubuntu 16.04.3 LTS",
> "distro_version": "16.04",
> "front_addr": "",
> "front_iface": "bond0",
> "hb_back_addr": "",
> "hb_front_addr": "",
> "hostname": “host00",
> "journal_rotational": "1",
> "kernel_description": "#29~16.04.2-Ubuntu SMP Tue Jan 9 22:00:44 UTC 
> 2018",
> "kernel_version": "4.13.0-26-generic",
> "mem_swap_kb": "124999672",
> "mem_total_kb": "131914008",
> "os": "Linux",
> "osd_data": "/var/lib/ceph/osd/ceph-24",
> "osd_objectstore": "bluestore",
> "rotational": "0"
> }


So it looks like it correctly guessed(?) the 
bluestore_bdev_type/default_device_class correctly (though it may have been an 
inherited value?), as did bluefs_db_type get set to nvme correctly.

So I’m not sure why journal_rotational is still showing 1.
Maybe something in the ceph-volume lvm piece that isn’t correctly setting that 
flag on OSD creation?
Also seems like the journal_rotational field should have been deprecated in 
bluestore as bluefs_db_rotational should cover that, and if there were a WAL 
partition as well, I assume there would be something to the tune of 
bluefs_wal_rotational or something like that, and journal would never be used 
for bluestore?

Appreciate the help.

Thanks,
Reed

> On Feb 26, 2018, at 1:28 PM, Gregory Farnum <gfar...@redhat.com> wrote:
> 
> On Mon, Feb 26, 2018 at 11:21 AM Reed Dier <reed.d...@focusvq.com 
> <mailto:reed.d...@focusvq.com>> wrote:
> The ‘good perf’ that I reported 

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-26 Thread Reed Dier
The ‘good perf’ that I reported below was the result of beginning 5 new 
bluestore conversions which results in a leading edge of ‘good’ performance, 
before trickling off.

This performance lasted about 20 minutes, where it backfilled a small set of 
PGs off of non-bluestore OSDs.

Current performance is now hovering around:
> pool objects-ssd id 20
>   recovery io 14285 kB/s, 202 objects/s
> 
> pool fs-metadata-ssd id 16
>   recovery io 0 B/s, 262 keys/s, 12 objects/s
>   client io 412 kB/s rd, 67593 B/s wr, 5 op/s rd, 0 op/s wr

> What are you referencing when you talk about recovery ops per second?

These are recovery ops as reported by ceph -s or via stats exported via influx 
plugin in mgr, and via local collectd collection.

> Also, what are the values for osd_recovery_sleep_hdd and 
> osd_recovery_sleep_hybrid, and can you validate via "ceph osd metadata" that 
> your BlueStore SSD OSDs are correctly reporting both themselves and their 
> journals as non-rotational?

This yields more interesting results.
Pasting results for 3 sets of OSDs in this order
 {0}hdd+nvme block.db
{24}ssd+nvme block.db
{59}ssd+nvme journal

> ceph osd metadata | grep 'id\|rotational'
> "id": 0,
> "bluefs_db_rotational": "0",
> "bluefs_slow_rotational": "1",
> "bluestore_bdev_rotational": "1",
> "journal_rotational": "1",
> "rotational": “1"
> "id": 24,
> "bluefs_db_rotational": "0",
> "bluefs_slow_rotational": "0",
> "bluestore_bdev_rotational": "0",
> "journal_rotational": "1",
> "rotational": “0"
> "id": 59,
> "journal_rotational": "0",
> "rotational": “0"

I wonder if it matters/is correct to see "journal_rotational": “1” for the 
bluestore OSD’s {0,24} with nvme block.db.

Hope this may be helpful in determining the root cause.

If it helps, all of the OSD’s were originally deployed with ceph-deploy, but 
are now being redone with ceph-volume locally on each host.

Thanks,

Reed

> On Feb 26, 2018, at 1:00 PM, Gregory Farnum <gfar...@redhat.com> wrote:
> 
> On Mon, Feb 26, 2018 at 9:12 AM Reed Dier <reed.d...@focusvq.com 
> <mailto:reed.d...@focusvq.com>> wrote:
> After my last round of backfills completed, I started 5 more bluestore 
> conversions, which helped me recognize a very specific pattern of performance.
> 
>> pool objects-ssd id 20
>>   recovery io 757 MB/s, 10845 objects/s
>> 
>> pool fs-metadata-ssd id 16
>>   recovery io 0 B/s, 36265 keys/s, 1633 objects/s
>>   client io 2544 kB/s rd, 36788 B/s wr, 1 op/s rd, 0 op/s wr
> 
> The “non-throttled” backfills are only coming from filestore SSD OSD’s.
> When backfilling from bluestore SSD OSD’s, they appear to be throttled at the 
> aforementioned <20 ops per OSD.
> 
> Wait, is that the current state? What are you referencing when you talk about 
> recovery ops per second?
> 
> Also, what are the values for osd_recovery_sleep_hdd and 
> osd_recovery_sleep_hybrid, and can you validate via "ceph osd metadata" that 
> your BlueStore SSD OSDs are correctly reporting both themselves and their 
> journals as non-rotational?
> -Greg
>  
> 
> This would corroborate why the first batch of SSD’s I migrated to bluestore 
> were all at “full” speed, as all of the OSD’s they were backfilling from were 
> filestore based, compared to increasingly bluestore backfill targets, leading 
> to increasingly long backfill times as I move from one host to the next.
> 
> Looking at the recovery settings, the recovery_sleep and recovery_sleep_ssd 
> values across bluestore or filestore OSDs are showing as 0 values, which 
> means no sleep/throttle if I am reading everything correctly.
> 
>> sudo ceph daemon osd.73 config show | grep recovery
>> "osd_allow_recovery_below_min_size": "true",
>> "osd_debug_skip_full_check_in_recovery": "false",
>> "osd_force_recovery_pg_log_entries_factor": "1.30",
>> "osd_min_recovery_priority": "0",
>> "osd_recovery_cost": "20971520",
>> "osd_recovery_delay_start": "0.00",
>> "osd_recovery_forget_lost_objects": "false",
>> "osd_recovery_max_active": "35",
>> "osd_recovery_max_chunk": "8388608",
>> "osd_recovery_max_omap_entries_per_chunk": "64

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-26 Thread Reed Dier
After my last round of backfills completed, I started 5 more bluestore 
conversions, which helped me recognize a very specific pattern of performance.

> pool objects-ssd id 20
>   recovery io 757 MB/s, 10845 objects/s
> 
> pool fs-metadata-ssd id 16
>   recovery io 0 B/s, 36265 keys/s, 1633 objects/s
>   client io 2544 kB/s rd, 36788 B/s wr, 1 op/s rd, 0 op/s wr

The “non-throttled” backfills are only coming from filestore SSD OSD’s.
When backfilling from bluestore SSD OSD’s, they appear to be throttled at the 
aforementioned <20 ops per OSD.

This would corroborate why the first batch of SSD’s I migrated to bluestore 
were all at “full” speed, as all of the OSD’s they were backfilling from were 
filestore based, compared to increasingly bluestore backfill targets, leading 
to increasingly long backfill times as I move from one host to the next.

Looking at the recovery settings, the recovery_sleep and recovery_sleep_ssd 
values across bluestore or filestore OSDs are showing as 0 values, which means 
no sleep/throttle if I am reading everything correctly.

> sudo ceph daemon osd.73 config show | grep recovery
> "osd_allow_recovery_below_min_size": "true",
> "osd_debug_skip_full_check_in_recovery": "false",
> "osd_force_recovery_pg_log_entries_factor": "1.30",
> "osd_min_recovery_priority": "0",
> "osd_recovery_cost": "20971520",
> "osd_recovery_delay_start": "0.00",
> "osd_recovery_forget_lost_objects": "false",
> "osd_recovery_max_active": "35",
> "osd_recovery_max_chunk": "8388608",
> "osd_recovery_max_omap_entries_per_chunk": "64000",
> "osd_recovery_max_single_start": "1",
> "osd_recovery_op_priority": "3",
> "osd_recovery_op_warn_multiple": "16",
> "osd_recovery_priority": "5",
> "osd_recovery_retry_interval": "30.00",
> "osd_recovery_sleep": "0.00",
> "osd_recovery_sleep_hdd": "0.10",
> "osd_recovery_sleep_hybrid": "0.025000",
> "osd_recovery_sleep_ssd": "0.00",
> "osd_recovery_thread_suicide_timeout": "300",
> "osd_recovery_thread_timeout": "30",
> "osd_scrub_during_recovery": "false",


As far as I know, the device class is configured correctly as far as I know, it 
all shows as ssd/hdd correctly in ceph osd tree.

So hopefully this may be enough of a smoking gun to help narrow down where this 
may be stemming from.

Thanks,

Reed

> On Feb 23, 2018, at 10:04 AM, David Turner <drakonst...@gmail.com> wrote:
> 
> Here is a [1] link to a ML thread tracking some slow backfilling on 
> bluestore.  It came down to the backfill sleep setting for them.  Maybe it 
> will help.
> 
> [1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg40256.html 
> <https://www.mail-archive.com/ceph-users@lists.ceph.com/msg40256.html>
> On Fri, Feb 23, 2018 at 10:46 AM Reed Dier <reed.d...@focusvq.com 
> <mailto:reed.d...@focusvq.com>> wrote:
> Probably unrelated, but I do keep seeing this odd negative objects degraded 
> message on the fs-metadata pool:
> 
>> pool fs-metadata-ssd id 16
>>   -34/3 objects degraded (-1133.333%)
>>   recovery io 0 B/s, 89 keys/s, 2 objects/s
>>   client io 51289 B/s rd, 101 kB/s wr, 0 op/s rd, 0 op/s wr
> 
> Don’t mean to clutter the ML/thread, however it did seem odd, maybe its a 
> culprit? Maybe its some weird sampling interval issue thats been solved in 
> 12.2.3?
> 
> Thanks,
> 
> Reed
> 
> 
>> On Feb 23, 2018, at 8:26 AM, Reed Dier <reed.d...@focusvq.com 
>> <mailto:reed.d...@focusvq.com>> wrote:
>> 
>> Below is ceph -s
>> 
>>>   cluster:
>>> id: {id}
>>> health: HEALTH_WARN
>>> noout flag(s) set
>>> 260610/1068004947 objects misplaced (0.024%)
>>> Degraded data redundancy: 23157232/1068004947 objects degraded 
>>> (2.168%), 332 pgs unclean, 328 pgs degraded, 328 pgs undersized
>>> 
>>>   services:
>>> mon: 3 daemons, quorum mon02,mon01,mon03
>>> mgr: mon03(active), standbys: mon02
>>> mds: cephfs-1/1/1 up  {0=mon03=up:active}, 1 up:standby
>>> osd: 74 osds: 74 up, 74 in; 332 remapped pgs
>>>  flags noout
>>> 
>>>   data:
>>> pools:   5 pools, 5316 pgs
>>&

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-23 Thread Reed Dier
Probably unrelated, but I do keep seeing this odd negative objects degraded 
message on the fs-metadata pool:

> pool fs-metadata-ssd id 16
>   -34/3 objects degraded (-1133.333%)
>   recovery io 0 B/s, 89 keys/s, 2 objects/s
>   client io 51289 B/s rd, 101 kB/s wr, 0 op/s rd, 0 op/s wr

Don’t mean to clutter the ML/thread, however it did seem odd, maybe its a 
culprit? Maybe its some weird sampling interval issue thats been solved in 
12.2.3?

Thanks,

Reed


> On Feb 23, 2018, at 8:26 AM, Reed Dier <reed.d...@focusvq.com> wrote:
> 
> Below is ceph -s
> 
>>   cluster:
>> id: {id}
>> health: HEALTH_WARN
>> noout flag(s) set
>> 260610/1068004947 objects misplaced (0.024%)
>> Degraded data redundancy: 23157232/1068004947 objects degraded 
>> (2.168%), 332 pgs unclean, 328 pgs degraded, 328 pgs undersized
>> 
>>   services:
>> mon: 3 daemons, quorum mon02,mon01,mon03
>> mgr: mon03(active), standbys: mon02
>> mds: cephfs-1/1/1 up  {0=mon03=up:active}, 1 up:standby
>> osd: 74 osds: 74 up, 74 in; 332 remapped pgs
>>  flags noout
>> 
>>   data:
>> pools:   5 pools, 5316 pgs
>> objects: 339M objects, 46627 GB
>> usage:   154 TB used, 108 TB / 262 TB avail
>> pgs: 23157232/1068004947 objects degraded (2.168%)
>>  260610/1068004947 objects misplaced (0.024%)
>>  4984 active+clean
>>  183  active+undersized+degraded+remapped+backfilling
>>  145  active+undersized+degraded+remapped+backfill_wait
>>  3active+remapped+backfill_wait
>>  1active+remapped+backfilling
>> 
>>   io:
>> client:   8428 kB/s rd, 47905 B/s wr, 130 op/s rd, 0 op/s wr
>> recovery: 37057 kB/s, 50 keys/s, 217 objects/s
> 
> Also the two pools on the SSDs, are the objects pool at 4096 PG, and the 
> fs-metadata pool at 32 PG.
> 
>> Are you sure the recovery is actually going slower, or are the individual 
>> ops larger or more expensive?
> 
> The objects should not vary wildly in size.
> Even if they were differing in size, the SSDs are roughly idle in their 
> current state of backfilling when examining wait in iotop, or atop, or 
> sysstat/iostat.
> 
> This compares to when I was fully saturating the SATA backplane with over 
> 1000MB/s of writes to multiple disks when the backfills were going “full 
> speed.”
> 
> Here is a breakdown of recovery io by pool:
> 
>> pool objects-ssd id 20
>>   recovery io 6779 kB/s, 92 objects/s
>>   client io 3071 kB/s rd, 50 op/s rd, 0 op/s wr
>> 
>> pool fs-metadata-ssd id 16
>>   recovery io 0 B/s, 28 keys/s, 2 objects/s
>>   client io 109 kB/s rd, 67455 B/s wr, 1 op/s rd, 0 op/s wr
>> 
>> pool cephfs-hdd id 17
>>   recovery io 40542 kB/s, 158 objects/s
>>   client io 10056 kB/s rd, 142 op/s rd, 0 op/s wr
> 
> So the 24 HDD’s are outperforming the 50 SSD’s for recovery and client 
> traffic at the moment, which seems conspicuous to me.
> 
> Most of the OSD’s with recovery ops to the SSDs are reporting 8-12 ops, with 
> one OSD occasionally spiking up to 300-500 for a few minutes. Stats being 
> pulled by both local CollectD instances on each node, as well as the Influx 
> plugin in MGR as we evaluate that against collectd.
> 
> Thanks,
> 
> Reed
> 
> 
>> On Feb 22, 2018, at 6:21 PM, Gregory Farnum <gfar...@redhat.com 
>> <mailto:gfar...@redhat.com>> wrote:
>> 
>> What's the output of "ceph -s" while this is happening?
>> 
>> Is there some identifiable difference between these two states, like you get 
>> a lot of throughput on the data pools but then metadata recovery is slower?
>> 
>> Are you sure the recovery is actually going slower, or are the individual 
>> ops larger or more expensive?
>> 
>> My WAG is that recovering the metadata pool, composed mostly of directories 
>> stored in omap objects, is going much slower for some reason. You can adjust 
>> the cost of those individual ops some by changing 
>> osd_recovery_max_omap_entries_per_chunk (default: 8096), but I'm not sure 
>> which way you want to go or indeed if this has anything to do with the 
>> problem you're seeing. (eg, it could be that reading out the omaps is 
>> expensive, so you can get higher recovery op numbers by turning down the 
>> number of entries per request, but not actually see faster backfilling 
>> because you have to issue more requests.)
>> -Greg
>> 
>> On Wed, Feb 21, 2018 at 2:57 PM Reed Dier <reed.d...@focusvq.com 

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-23 Thread Reed Dier
Below is ceph -s

>   cluster:
> id: {id}
> health: HEALTH_WARN
> noout flag(s) set
> 260610/1068004947 objects misplaced (0.024%)
> Degraded data redundancy: 23157232/1068004947 objects degraded 
> (2.168%), 332 pgs unclean, 328 pgs degraded, 328 pgs undersized
> 
>   services:
> mon: 3 daemons, quorum mon02,mon01,mon03
> mgr: mon03(active), standbys: mon02
> mds: cephfs-1/1/1 up  {0=mon03=up:active}, 1 up:standby
> osd: 74 osds: 74 up, 74 in; 332 remapped pgs
>  flags noout
> 
>   data:
> pools:   5 pools, 5316 pgs
> objects: 339M objects, 46627 GB
> usage:   154 TB used, 108 TB / 262 TB avail
> pgs: 23157232/1068004947 objects degraded (2.168%)
>  260610/1068004947 objects misplaced (0.024%)
>  4984 active+clean
>  183  active+undersized+degraded+remapped+backfilling
>  145  active+undersized+degraded+remapped+backfill_wait
>  3active+remapped+backfill_wait
>  1active+remapped+backfilling
> 
>   io:
> client:   8428 kB/s rd, 47905 B/s wr, 130 op/s rd, 0 op/s wr
> recovery: 37057 kB/s, 50 keys/s, 217 objects/s

Also the two pools on the SSDs, are the objects pool at 4096 PG, and the 
fs-metadata pool at 32 PG.

> Are you sure the recovery is actually going slower, or are the individual ops 
> larger or more expensive?

The objects should not vary wildly in size.
Even if they were differing in size, the SSDs are roughly idle in their current 
state of backfilling when examining wait in iotop, or atop, or sysstat/iostat.

This compares to when I was fully saturating the SATA backplane with over 
1000MB/s of writes to multiple disks when the backfills were going “full speed.”

Here is a breakdown of recovery io by pool:

> pool objects-ssd id 20
>   recovery io 6779 kB/s, 92 objects/s
>   client io 3071 kB/s rd, 50 op/s rd, 0 op/s wr
> 
> pool fs-metadata-ssd id 16
>   recovery io 0 B/s, 28 keys/s, 2 objects/s
>   client io 109 kB/s rd, 67455 B/s wr, 1 op/s rd, 0 op/s wr
> 
> pool cephfs-hdd id 17
>   recovery io 40542 kB/s, 158 objects/s
>   client io 10056 kB/s rd, 142 op/s rd, 0 op/s wr

So the 24 HDD’s are outperforming the 50 SSD’s for recovery and client traffic 
at the moment, which seems conspicuous to me.

Most of the OSD’s with recovery ops to the SSDs are reporting 8-12 ops, with 
one OSD occasionally spiking up to 300-500 for a few minutes. Stats being 
pulled by both local CollectD instances on each node, as well as the Influx 
plugin in MGR as we evaluate that against collectd.

Thanks,

Reed


> On Feb 22, 2018, at 6:21 PM, Gregory Farnum <gfar...@redhat.com> wrote:
> 
> What's the output of "ceph -s" while this is happening?
> 
> Is there some identifiable difference between these two states, like you get 
> a lot of throughput on the data pools but then metadata recovery is slower?
> 
> Are you sure the recovery is actually going slower, or are the individual ops 
> larger or more expensive?
> 
> My WAG is that recovering the metadata pool, composed mostly of directories 
> stored in omap objects, is going much slower for some reason. You can adjust 
> the cost of those individual ops some by changing 
> osd_recovery_max_omap_entries_per_chunk (default: 8096), but I'm not sure 
> which way you want to go or indeed if this has anything to do with the 
> problem you're seeing. (eg, it could be that reading out the omaps is 
> expensive, so you can get higher recovery op numbers by turning down the 
> number of entries per request, but not actually see faster backfilling 
> because you have to issue more requests.)
> -Greg
> 
> On Wed, Feb 21, 2018 at 2:57 PM Reed Dier <reed.d...@focusvq.com 
> <mailto:reed.d...@focusvq.com>> wrote:
> Hi all,
> 
> I am running into an odd situation that I cannot easily explain.
> I am currently in the midst of destroy and rebuild of OSDs from filestore to 
> bluestore.
> With my HDDs, I am seeing expected behavior, but with my SSDs I am seeing 
> unexpected behavior. The HDDs and SSDs are set in crush accordingly.
> 
> My path to replacing the OSDs is to set the noout, norecover, norebalance 
> flag, destroy the OSD, create the OSD back, (iterate n times, all within a 
> single failure domain), unset the flags, and let it go. It finishes, rinse, 
> repeat.
> 
> For the SSD OSDs, they are SATA SSDs (Samsung SM863a) , 10 to a node, with 2 
> NVMe drives (Intel P3700), 5 SATA SSDs to 1 NVMe drive, 16G partitions for 
> block.db (previously filestore journals).
> 2x10GbE networking between the nodes. SATA backplane caps out at around 10 
> Gb/s as its 2x 6 Gb/s controllers. Luminous 12.2.2.
> 
> When the f

[ceph-users] SSD Bluestore Backfills Slow

2018-02-21 Thread Reed Dier
Hi all,

I am running into an odd situation that I cannot easily explain.
I am currently in the midst of destroy and rebuild of OSDs from filestore to 
bluestore.
With my HDDs, I am seeing expected behavior, but with my SSDs I am seeing 
unexpected behavior. The HDDs and SSDs are set in crush accordingly.

My path to replacing the OSDs is to set the noout, norecover, norebalance flag, 
destroy the OSD, create the OSD back, (iterate n times, all within a single 
failure domain), unset the flags, and let it go. It finishes, rinse, repeat.

For the SSD OSDs, they are SATA SSDs (Samsung SM863a) , 10 to a node, with 2 
NVMe drives (Intel P3700), 5 SATA SSDs to 1 NVMe drive, 16G partitions for 
block.db (previously filestore journals).
2x10GbE networking between the nodes. SATA backplane caps out at around 10 Gb/s 
as its 2x 6 Gb/s controllers. Luminous 12.2.2.

When the flags are unset, recovery starts and I see a very large rush of 
traffic, however, after the first machine completed, the performance tapered 
off at a rapid pace and trickles. Comparatively, I’m getting 100-200 recovery 
ops on 3 HDDs, backfilling from 21 other HDDs, where as I’m getting 150-250 
recovery ops on 5 SSDs, backfilling from 40 other SSDs. Every once in a while I 
will see a spike up to 500, 1000, or even 2000 ops on the SSDs, often a few 
hundred recovery ops from one OSD, and 8-15 ops from the others that are 
backfilling.

This is a far cry from the more than 15-30k recovery ops that it started off 
recovering with 1-3k recovery ops from a single OSD to the backfilling OSD(s). 
And an even farther cry from the >15k recovery ops I was sustaining for over an 
hour or more before. I was able to rebuild a 1.9T SSD (1.1T used) in a little 
under an hour, and I could do about 5 at a time and still keep it at roughly an 
hour to backfill all of them, but then I hit a roadblock after the first 
machine, when I tried to do 10 at a time (single machine). I am now still 
experiencing the same thing on the third node, while doing 5 OSDs at a time. 

The pools associated with these SSDs are cephfs-metadata, as well as a pure 
rados object pool we use for our own internal applications. Both are size=3, 
min_size=2.

It appears I am not the first to run into this, but it looks like there was no 
resolution: https://www.spinics.net/lists/ceph-users/msg41493.html 


Recovery parameters for the OSDs match what was in the previous thread, sans 
the osd conf block listed. And current osd_max_backfills = 30 and 
osd_recovery_max_active = 35. Very little activity on the OSDs during this 
period, so should not be any contention for iops on the SSDs.

The only oddity that I can attribute to things is that we had a few periods of 
time where the disk load on one of the mons was high enough to cause the mon to 
drop out of quorum for a brief amount of time, a few times. But I wouldn’t 
think backfills would just get throttled due to mons flapping.

Hopefully someone has some experience or can steer me in a path to improve the 
performance of the backfills so that I’m not stuck in backfill purgatory longer 
than I need to be.

Linking an imgur album with some screen grabs of the recovery ops over time for 
the first machine, versus the second and third machines to demonstrate the 
delta between them.
https://imgur.com/a/OJw4b 

Also including a ceph osd df of the SSDs, highlighted in red are the OSDs 
currently backfilling. Could this possibly be PG overdose? I don’t ever run 
into ‘stuck activating’ PGs, its just painfully slow backfills, like they are 
being throttled by ceph, that are causing me to worry. Drives aren’t worn, <30 
P/E cycles on the drives, so plenty of life left in them.

Thanks,
Reed

> $ ceph osd df
> ID CLASS WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR  PGS
> 24   ssd 1.76109  1.0 1803G 1094G  708G 60.69 1.08 260
> 25   ssd 1.76109  1.0 1803G 1136G  667G 63.01 1.12 271
> 26   ssd 1.76109  1.0 1803G 1018G  785G 56.46 1.01 243
> 27   ssd 1.76109  1.0 1803G 1065G  737G 59.10 1.05 253
> 28   ssd 1.76109  1.0 1803G 1026G  776G 56.94 1.02 245
> 29   ssd 1.76109  1.0 1803G 1132G  671G 62.79 1.12 270
> 30   ssd 1.76109  1.0 1803G  944G  859G 52.35 0.93 224
> 31   ssd 1.76109  1.0 1803G 1061G  742G 58.85 1.05 252
> 32   ssd 1.76109  1.0 1803G 1003G  799G 55.67 0.99 239
> 33   ssd 1.76109  1.0 1803G 1049G  753G 58.20 1.04 250
> 34   ssd 1.76109  1.0 1803G 1086G  717G 60.23 1.07 257
> 35   ssd 1.76109  1.0 1803G  978G  824G 54.26 0.97 232
> 36   ssd 1.76109  1.0 1803G 1057G  745G 58.64 1.05 252
> 37   ssd 1.76109  1.0 1803G 1025G  777G 56.88 1.01 244
> 38   ssd 1.76109  1.0 1803G 1047G  756G 58.06 1.04 250
> 39   ssd 1.76109  1.0 1803G 1031G  771G 57.20 1.02 246
> 40   ssd 1.76109  1.0 1803G 1029G  774G 57.07 1.02 245
> 41   ssd 1.76109  1.0 1803G 1033G  770G 57.28 1.02 245
> 42   ssd 

Re: [ceph-users] Is there a "set pool readonly" command?

2018-02-12 Thread Reed Dier
I do know that there is a pause flag in Ceph.

What I do not know is if that also pauses recovery traffic, in addition to 
client traffic.

Also worth mentioning, this is a cluster-wide flag, not a pool level flag.

Reed

> On Feb 11, 2018, at 11:45 AM, David Turner  wrote:
> 
> If you set min_size to 2 or more, it will disable reads and writes to the 
> pool by blocking requests. Min_size is the minimum copies of a PG that need 
> to be online to allow it to the data. If you only have 1 copy, then it will 
> prevent io. It's not a flag you can set on the pool, but it should work out. 
> If you have size=3, then min_size=3 should block most io until the pool is 
> almost fully backfilled.
> 
> 
> On Sun, Feb 11, 2018, 9:46 AM Nico Schottelius  > wrote:
> 
> Hello,
> 
> we have one pool, in which about 10 disks failed last week (fortunately
> mostly sequentially), which now has now some pgs that are only left on
> one disk.
> 
> Is there a command to set one pool into "read-only" mode or even
> "recovery io-only" mode so that the only thing same is doing is
> recovering and no client i/o will disturb that process?
> 
> Best,
> 
> Nico
> 
> 
> 
> --
> Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating filestore to bluestore using ceph-volume

2018-01-26 Thread Reed Dier
Bit late for this to be helpful, but instead of zapping the lvm labels, you 
could alternatively destroy the lvm volume by hand.

> lvremove -f /
> vgremove 
> pvremove /dev/ceph-device (should wipe labels)


Then you should be able to run ‘ceph-volume lvm zap /dev/sdX’ and retry the 
'ceph-volume lvm create’ command (sans --osd-id flag) and it should run as well.

This info will hopefully be useful for those not as well versed with lvm as I 
am/was at the time I needed this info.

Reed

> On Jan 26, 2018, at 11:32 AM, David Majchrzak <da...@visions.se> wrote:
> 
> Thanks that helped!
> 
> Since I had already "halfway" created a lvm volume I wanted to start from the 
> beginning and zap it.
> 
> Tried to zap the raw device but failed since --destroy doesn't seem to be in 
> 12.2.2
> 
> http://docs.ceph.com/docs/master/ceph-volume/lvm/zap/ 
> <http://docs.ceph.com/docs/master/ceph-volume/lvm/zap/>
> 
> root@int1:~# ceph-volume lvm zap /dev/sdc --destroy
> usage: ceph-volume lvm zap [-h] [DEVICE]
> ceph-volume lvm zap: error: unrecognized arguments: --destroy
> 
> So i zapped it with the vg/lvm instead.
> ceph-volume lvm zap 
> /dev/ceph-efad7df8-721d-43d8-8d02-449406e70b90/osd-block-138ce507-f28a-45bf-814c-7fa124a9d9b9
> 
> However I run create on it since the LVM was already there.
> So I zapped it with sgdisk and ran dmsetup remove. After that I was able to 
> create it again.
> 
> However - each "ceph-volume lvm create" that I ran that failed, successfully 
> added an osd to crush map ;)
> 
> So I've got this now:
> 
> root@int1:~# ceph osd df tree
> ID CLASS WEIGHT  REWEIGHT SIZE  USEAVAIL  %USE  VAR  PGS TYPE NAME
> -1   2.60959- 2672G  1101G  1570G 41.24 1.00   - root default
> -2   0.87320-  894G   369G   524G 41.36 1.00   - host int1
>  3   ssd 0.43660  1.0  447G   358G 90295M 80.27 1.95 301 osd.3
>  8   ssd 0.43660  1.0  447G 11273M   436G  2.46 0.06  19 osd.8
> -3   0.86819-  888G   366G   522G 41.26 1.00   - host int2
>  1   ssd 0.43159  1.0  441G   167G   274G 37.95 0.92 147 osd.1
>  4   ssd 0.43660  1.0  447G   199G   247G 44.54 1.08 173 osd.4
> -4   0.86819-  888G   365G   523G 41.09 1.00   - host int3
>  2   ssd 0.43159  1.0  441G   193G   248G 43.71 1.06 174 osd.2
>  5   ssd 0.43660  1.0  447G   172G   274G 38.51 0.93 146 osd.5
>  0 00 0  0  0 00   0 osd.0
>  6 00 0  0  0 00   0 osd.6
>  7 00 0  0  0 00   0 osd.7
> 
> I guess I can just remove them from crush,auth and rm them?
> 
> Kind Regards,
> 
> David Majchrzak
> 
>> 26 jan. 2018 kl. 18:09 skrev Reed Dier <reed.d...@focusvq.com 
>> <mailto:reed.d...@focusvq.com>>:
>> 
>> This is the exact issue that I ran into when starting my bluestore 
>> conversion journey.
>> 
>> See my thread here: https://www.spinics.net/lists/ceph-users/msg41802.html 
>> <https://www.spinics.net/lists/ceph-users/msg41802.html>
>> 
>> Specifying --osd-id causes it to fail.
>> 
>> Below are my steps for OSD replace/migrate from filestore to bluestore.
>> 
>> BIG caveat here in that I am doing destructive replacement, in that I am not 
>> allowing my objects to be migrated off of the OSD I’m replacing before 
>> nuking it.
>> With 8TB drives it just takes way too long, and I trust my failure domains 
>> and other hardware to get me through the backfills.
>> So instead of 1) reading data off, writing data elsewhere 2) remove/re-add 
>> 3) reading data elsewhere, writing back on, I am taking step one out, and 
>> trusting my two other copies of the objects. Just wanted to clarify my steps.
>> 
>> I also set norecover and norebalance flags immediately prior to running 
>> these commands so that it doesn’t try to start moving data unnecessarily. 
>> Then when done, remove those flags, and let it backfill.
>> 
>>> systemctl stop ceph-osd@$ID.service <mailto:ceph-osd@$id.service>
>>> ceph-osd -i $ID --flush-journal
>>> umount /var/lib/ceph/osd/ceph-$ID
>>> ceph-volume lvm zap /dev/$ID
>>> ceph osd crush remove osd.$ID
>>> ceph auth del osd.$ID
>>> ceph osd rm osd.$ID
>>> ceph-volume lvm create --bluestore --data /dev/$DATA --block.db /dev/$NVME
>> 
>> So essentially I fully remove the OSD from crush and the osdmap, and when I 
>> add the OSD back, like I would a new OSD, it fills in the numeric gap with 
>> the $ID it had before.
>> 
>>

Re: [ceph-users] Migrating filestore to bluestore using ceph-volume

2018-01-26 Thread Reed Dier
This is the exact issue that I ran into when starting my bluestore conversion 
journey.

See my thread here: https://www.spinics.net/lists/ceph-users/msg41802.html 


Specifying --osd-id causes it to fail.

Below are my steps for OSD replace/migrate from filestore to bluestore.

BIG caveat here in that I am doing destructive replacement, in that I am not 
allowing my objects to be migrated off of the OSD I’m replacing before nuking 
it.
With 8TB drives it just takes way too long, and I trust my failure domains and 
other hardware to get me through the backfills.
So instead of 1) reading data off, writing data elsewhere 2) remove/re-add 3) 
reading data elsewhere, writing back on, I am taking step one out, and trusting 
my two other copies of the objects. Just wanted to clarify my steps.

I also set norecover and norebalance flags immediately prior to running these 
commands so that it doesn’t try to start moving data unnecessarily. Then when 
done, remove those flags, and let it backfill.

> systemctl stop ceph-osd@$ID.service
> ceph-osd -i $ID --flush-journal
> umount /var/lib/ceph/osd/ceph-$ID
> ceph-volume lvm zap /dev/$ID
> ceph osd crush remove osd.$ID
> ceph auth del osd.$ID
> ceph osd rm osd.$ID
> ceph-volume lvm create --bluestore --data /dev/$DATA --block.db /dev/$NVME

So essentially I fully remove the OSD from crush and the osdmap, and when I add 
the OSD back, like I would a new OSD, it fills in the numeric gap with the $ID 
it had before.

Hope this is helpful.
Been working well for me so far, doing 3 OSDs at a time (half of a failure 
domain).

Reed

> On Jan 26, 2018, at 10:01 AM, David  wrote:
> 
> 
> Hi!
> 
> On luminous 12.2.2
> 
> I'm migrating some OSDs from filestore to bluestore using the "simple" method 
> as described in docs: 
> http://docs.ceph.com/docs/master/rados/operations/bluestore-migration/#convert-existing-osds
>  
> 
> Mark out and Replace.
> 
> However, at 9.: ceph-volume create --bluestore --data $DEVICE --osd-id $ID
> it seems to create the bluestore but it fails to authenticate with the old 
> osd-id auth.
> (the command above is also missing lvm or simple)
> 
> I think it's related to this:
> http://tracker.ceph.com/issues/22642 
> 
> # ceph-volume lvm create --bluestore --data /dev/sdc --osd-id 0
> Running command: sudo vgcreate --force --yes 
> ceph-efad7df8-721d-43d8-8d02-449406e70b90 /dev/sdc
>  stderr: WARNING: lvmetad is running but disabled. Restart lvmetad before 
> enabling it!
>  stdout: Physical volume "/dev/sdc" successfully created
>  stdout: Volume group "ceph-efad7df8-721d-43d8-8d02-449406e70b90" 
> successfully created
> Running command: sudo lvcreate --yes -l 100%FREE -n 
> osd-block-138ce507-f28a-45bf-814c-7fa124a9d9b9 
> ceph-efad7df8-721d-43d8-8d02-449406e70b90
>  stderr: WARNING: lvmetad is running but disabled. Restart lvmetad before 
> enabling it!
>  stdout: Logical volume "osd-block-138ce507-f28a-45bf-814c-7fa124a9d9b9" 
> created.
> Running command: sudo mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-0
> Running command: chown -R ceph:ceph /dev/dm-4
> Running command: sudo ln -s 
> /dev/ceph-efad7df8-721d-43d8-8d02-449406e70b90/osd-block-138ce507-f28a-45bf-814c-7fa124a9d9b9
>  /var/lib/ceph/osd/ceph-0/block
> Running command: sudo ceph --cluster ceph --name client.bootstrap-osd 
> --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o 
> /var/lib/ceph/osd/ceph-0/activate.monmap
>  stderr: got monmap epoch 2
> Running command: ceph-authtool /var/lib/ceph/osd/ceph-0/keyring 
> --create-keyring --name osd.0 --add-key 
>  stdout: creating /var/lib/ceph/osd/ceph-0/keyring
>  stdout: added entity osd.0 auth auth(auid = 18446744073709551615 key= 
>  with 0 caps)
> Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-0/keyring
> Running command: chown -R ceph:ceph /var/lib/ceph/osd/ceph-0/
> Running command: sudo ceph-osd --cluster ceph --osd-objectstore bluestore 
> --mkfs -i 0 --monmap /var/lib/ceph/osd/ceph-0/activate.monmap --key 
>  --osd-data /var/lib/ceph/osd/ceph-0/ 
> --osd-uuid 138ce507-f28a-45bf-814c-7fa124a9d9b9 --setuser ceph --setgroup ceph
>  stderr: 2018-01-26 14:59:10.039549 7fd7ef951cc0 -1 
> bluestore(/var/lib/ceph/osd/ceph-0//block) _read_bdev_label unable to decode 
> label at offset 102: buffer::malformed_input: void 
> bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past end 
> of struct encoding
>  stderr: 2018-01-26 14:59:10.039744 7fd7ef951cc0 -1 
> bluestore(/var/lib/ceph/osd/ceph-0//block) _read_bdev_label unable to decode 
> label at offset 102: buffer::malformed_input: void 
> bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past end 
> of struct encoding
>  stderr: 2018-01-26 14:59:10.039925 7fd7ef951cc0 -1 

Re: [ceph-users] filestore to bluestore: osdmap epoch problem and is the documentation correct?

2018-01-11 Thread Reed Dier
Thank you for documenting your progress and peril on the ML.

Luckily I only have 24x 8TB HDD and 50x 1.92TB SSDs to migrate over to 
bluestore.

8 nodes, 4 chassis (failure domain), 3 drives per node for the HDDs, so I’m 
able to do about 3 at a time (1 node) for rip/replace.

Definitely taking it slow and steady, and the SSDs will move quickly for 
backfills as well.
Seeing about 1TB/6hr on backfills, without much performance hit on rest of 
everything, about 5TB average util on each 8TB disk, so just about 30 hours-ish 
per host *8 hosts will be about 10 days, so a couple weeks is a safe amount of 
headway.
This write performance certainly seems better on bluestore than filestore, so 
that likely helps as well.

Expect I can probably refill an SSD osd in about an hour or two, and will 
likely stagger those out.
But with such a small number of osd’s currently, I’m taking the by-hand 
approach rather than scripting it so as to avoid similar pitfalls.

Reed 

> On Jan 11, 2018, at 12:38 PM, Brady Deetz <bde...@gmail.com> wrote:
> 
> I hear you on time. I have 350 x 6TB drives to convert. I recently posted 
> about a disaster I created automating my migration. Good luck
> 
> On Jan 11, 2018 12:22 PM, "Reed Dier" <reed.d...@focusvq.com 
> <mailto:reed.d...@focusvq.com>> wrote:
> I am in the process of migrating my OSDs to bluestore finally and thought I 
> would give you some input on how I am approaching it.
> Some of saga you can find in another ML thread here: 
> https://www.spinics.net/lists/ceph-users/msg41802.html 
> <https://www.spinics.net/lists/ceph-users/msg41802.html>
> 
> My first OSD I was cautious, and I outed the OSD without downing it, allowing 
> it to move data off.
> Some background on my cluster, for this OSD, it is an 8TB spinner, with an 
> NVMe partition previously used for journaling in filestore, intending to be 
> used for block.db in bluestore.
> 
> Then I downed it, flushed the journal, destroyed it, zapped with ceph-volume, 
> set norecover and norebalance flags, did ceph osd crush remove osd.$ID, ceph 
> auth del osd.$ID, and ceph osd rm osd.$ID and used ceph-volume locally to 
> create the new LVM target. Then unset the norecover and norebalance flags and 
> it backfilled like normal.
> 
> I initially ran into issues with specifying --osd.id <http://osd.id/> causing 
> my osd’s to fail to start, but removing that I was able to get it to fill in 
> the gap of the OSD I just removed.
> 
> I’m now doing quicker, more destructive migrations in an attempt to reduce 
> data movement.
> This way I don’t read from OSD I’m replacing, write to other OSD temporarily, 
> read back from temp OSD, write back to ‘new’ OSD.
> I’m just reading from replica and writing to ‘new’ OSD.
> 
> So I’m setting the norecover and norebalance flags, down the OSD (but not 
> out, it stays in, also have the noout flag set), destroy/zap, recreate using 
> ceph-volume, unset the flags, and it starts backfilling.
> For 8TB disks, and with 23 other 8TB disks in the pool, it takes a long time 
> to offload it and then backfill back from them. I trust my disks enough to 
> backfill from the other disks, and its going well. Also seeing very good 
> write performance backfilling compared to previous drive replacements in 
> filestore, so thats very promising.
> 
> Reed
> 
>> On Jan 10, 2018, at 8:29 AM, Jens-U. Mozdzen <jmozd...@nde.ag 
>> <mailto:jmozd...@nde.ag>> wrote:
>> 
>> Hi Alfredo,
>> 
>> thank you for your comments:
>> 
>> Zitat von Alfredo Deza <ad...@redhat.com <mailto:ad...@redhat.com>>:
>>> On Wed, Jan 10, 2018 at 8:57 AM, Jens-U. Mozdzen <jmozd...@nde.ag 
>>> <mailto:jmozd...@nde.ag>> wrote:
>>>> Dear *,
>>>> 
>>>> has anybody been successful migrating Filestore OSDs to Bluestore OSDs,
>>>> keeping the OSD number? There have been a number of messages on the list,
>>>> reporting problems, and my experience is the same. (Removing the existing
>>>> OSD and creating a new one does work for me.)
>>>> 
>>>> I'm working on an Ceph 12.2.2 cluster and tried following
>>>> http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd
>>>>  
>>>> <http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd>
>>>> - this basically says
>>>> 
>>>> 1. destroy old OSD
>>>> 2. zap the disk
>>>> 3. prepare the new OSD
>>>> 4. activate the new OSD
>>>> 
>>>> I never got step 4 to complete. The closest I got was by doing the 
>>>> following
>>>> ste

Re: [ceph-users] filestore to bluestore: osdmap epoch problem and is the documentation correct?

2018-01-11 Thread Reed Dier
I am in the process of migrating my OSDs to bluestore finally and thought I 
would give you some input on how I am approaching it.
Some of saga you can find in another ML thread here: 
https://www.spinics.net/lists/ceph-users/msg41802.html 


My first OSD I was cautious, and I outed the OSD without downing it, allowing 
it to move data off.
Some background on my cluster, for this OSD, it is an 8TB spinner, with an NVMe 
partition previously used for journaling in filestore, intending to be used for 
block.db in bluestore.

Then I downed it, flushed the journal, destroyed it, zapped with ceph-volume, 
set norecover and norebalance flags, did ceph osd crush remove osd.$ID, ceph 
auth del osd.$ID, and ceph osd rm osd.$ID and used ceph-volume locally to 
create the new LVM target. Then unset the norecover and norebalance flags and 
it backfilled like normal.

I initially ran into issues with specifying --osd.id causing my osd’s to fail 
to start, but removing that I was able to get it to fill in the gap of the OSD 
I just removed.

I’m now doing quicker, more destructive migrations in an attempt to reduce data 
movement.
This way I don’t read from OSD I’m replacing, write to other OSD temporarily, 
read back from temp OSD, write back to ‘new’ OSD.
I’m just reading from replica and writing to ‘new’ OSD.

So I’m setting the norecover and norebalance flags, down the OSD (but not out, 
it stays in, also have the noout flag set), destroy/zap, recreate using 
ceph-volume, unset the flags, and it starts backfilling.
For 8TB disks, and with 23 other 8TB disks in the pool, it takes a long time to 
offload it and then backfill back from them. I trust my disks enough to 
backfill from the other disks, and its going well. Also seeing very good write 
performance backfilling compared to previous drive replacements in filestore, 
so thats very promising.

Reed

> On Jan 10, 2018, at 8:29 AM, Jens-U. Mozdzen  wrote:
> 
> Hi Alfredo,
> 
> thank you for your comments:
> 
> Zitat von Alfredo Deza >:
>> On Wed, Jan 10, 2018 at 8:57 AM, Jens-U. Mozdzen  wrote:
>>> Dear *,
>>> 
>>> has anybody been successful migrating Filestore OSDs to Bluestore OSDs,
>>> keeping the OSD number? There have been a number of messages on the list,
>>> reporting problems, and my experience is the same. (Removing the existing
>>> OSD and creating a new one does work for me.)
>>> 
>>> I'm working on an Ceph 12.2.2 cluster and tried following
>>> http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd
>>> - this basically says
>>> 
>>> 1. destroy old OSD
>>> 2. zap the disk
>>> 3. prepare the new OSD
>>> 4. activate the new OSD
>>> 
>>> I never got step 4 to complete. The closest I got was by doing the following
>>> steps (assuming OSD ID "999" on /dev/sdzz):
>>> 
>>> 1. Stop the old OSD via systemd (osd-node # systemctl stop
>>> ceph-osd@999.service)
>>> 
>>> 2. umount the old OSD (osd-node # umount /var/lib/ceph/osd/ceph-999)
>>> 
>>> 3a. if the old OSD was Bluestore with LVM, manually clean up the old OSD's
>>> volume group
>>> 
>>> 3b. zap the block device (osd-node # ceph-volume lvm zap /dev/sdzz)
>>> 
>>> 4. destroy the old OSD (osd-node # ceph osd destroy 999
>>> --yes-i-really-mean-it)
>>> 
>>> 5. create a new OSD entry (osd-node # ceph osd new $(cat
>>> /var/lib/ceph/osd/ceph-999/fsid) 999)
>> 
>> Step 5 and 6 are problematic if you are going to be trying ceph-volume
>> later on, which takes care of doing this for you.
>> 
>>> 
>>> 6. add the OSD secret to Ceph authentication (osd-node # ceph auth add
>>> osd.999 mgr 'allow profile osd' osd 'allow *' mon 'allow profile osd' -i
>>> /var/lib/ceph/osd/ceph-999/keyring)
> 
> I at first tried to follow the documented steps (without my steps 5 and 6), 
> which did not work for me. The documented approach failed with "init 
> authentication >> failed: (1) Operation not permitted", because actually 
> ceph-volume did not add the auth entry for me.
> 
> But even after manually adding the authentication, the "ceph-volume" approach 
> failed, as the OSD was still marked "destroyed" in the osdmap epoch as used 
> by ceph-osd (see the commented messages from ceph-osd.999.log below).
> 
>>> 
>>> 7. prepare the new OSD (osd-node # ceph-volume lvm prepare --bluestore
>>> --osd-id 999 --data /dev/sdzz)
>> 
>> You are going to hit a bug in ceph-volume that is preventing you from
>> specifying the osd id directly if the ID has been destroyed.
>> 
>> See http://tracker.ceph.com/issues/22642 
>> 
> 
> If I read that bug description correctly, you're confirming why I needed step 
> #6 above (manually adding the OSD auth entry. But even if ceph-volume had 
> added it, the ceph-osd.log entries suggest that starting the OSD would still 
> have failed, because of accessing the wrong osdmap epoch.
> 
> To me it seems like I'm 

Re: [ceph-users] Ceph MGR Influx plugin 12.2.2

2018-01-11 Thread Reed Dier
This morning I went through and enabled the influx plugin in ceph-mgr on12.2.2, 
so far so good.

Only non-obvious step was installing the python-influxdb package that it 
depends on. Probably needs to be baked into the documentation somewhere.

Other than that, 90% of the stats I use are in this, and a few breakdowns of my 
existing statistics are available now.

If I had to make a wishlist of stats I wish were part of this:
PG state stats - number of PGs active, clean, scrubbing, scrubbing-deep, 
backfilling, recovering, etc
Pool ops - we have pool level rd/wr_bytes, would love to see pool level 
rd/wr_ops as well.
Cluster level object state stats - Total Objects, Degraded, Misplaced, Unfound, 
etc
daemon (osd/mon/mds/mgr) state stats - total, up, in, active, degraded/failed, 
quorum, etc
osd recovery_bytes - recovery bytes to compliment ops (like ceph -s provides)

Otherwise, this seems to be a much better approach than CollectD for data 
collection and shipping as it eliminates the middleman and puts the mgr daemons 
to work.

Love to see the ceph-mgr daemons grow in capability like this, take load off 
the mons, and provide more useful functionality.

Thanks,
Reed

> On Jan 11, 2018, at 10:02 AM, Benjeman Meekhof <bmeek...@umich.edu> wrote:
> 
> Hi Reed,
> 
> Someone in our group originally wrote the plugin and put in PR.  Since
> our commit the plugin was 'forward-ported' to master and made
> incompatible with Luminous so we've been using our own version of the
> plugin while waiting for the necessary pieces to be back-ported to
> Luminous to use the modified upstream version.  Now we are in the
> process of trying out the back-ported version that is in 12.2.2 as
> well as adding some additional code from our version that collects pg
> summary information (count of active, etc) and supports sending to
> multiple influx destinations.  We'll attempt to PR any changes we
> make.
> 
> So to answer your question:  Yes, we use it but not exactly the
> version from upstream in production yet.  However in our testing the
> module included with 12.2.2 appears to work as expected and we're
> planning to move over to it and do any future work based from the
> version in the upstream Ceph tree.
> 
> There is one issue/bug that may still exist exist:  because of how the
> data point timestamps are written inside a loop through OSD stats the
> spread is sometimes wide enough that Grafana doesn't group properly
> and you get the appearance of extreme spikes in derivative calculation
> of rates.  We ended up modifying our code to calculate timestamps just
> outside the loops that create data points and apply it to every point
> created in loops through stats.  Of course we'll feed that back
> upstream when we get to it and assuming it is still an issue in the
> current code.
> 
> thanks,
> Ben
> 
> On Thu, Jan 11, 2018 at 2:04 AM, Reed Dier <reed.d...@focusvq.com> wrote:
>> Hi all,
>> 
>> Does anyone have any idea if the influx plugin for ceph-mgr is stable in
>> 12.2.2?
>> 
>> Would love to ditch collectd and report directly from ceph if that is the
>> case.
>> 
>> Documentation says that it is added in Mimic/13.x, however it looks like
>> from an earlier ML post that it would be coming to Luminous.
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021302.html
>> 
>> I also see it as a disabled module currently:
>> 
>> $ ceph mgr module ls
>> {
>>"enabled_modules": [
>>"dashboard",
>>"restful",
>>"status"
>>],
>>"disabled_modules": [
>>"balancer",
>>"influx",
>>"localpool",
>>"prometheus",
>>"selftest",
>>"zabbix"
>>]
>> }
>> 
>> 
>> Curious if anyone has been using it in place of CollectD/Telegraf for
>> feeding InfluxDB with statistics.
>> 
>> Thanks,
>> 
>> Reed
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph MGR Influx plugin 12.2.2

2018-01-10 Thread Reed Dier
Hi all,

Does anyone have any idea if the influx plugin for ceph-mgr is stable in 12.2.2?

Would love to ditch collectd and report directly from ceph if that is the case.

Documentation says that it is added in Mimic/13.x, however it looks like from 
an earlier ML post that it would be coming to Luminous.
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021302.html 


I also see it as a disabled module currently:
> $ ceph mgr module ls
> {
> "enabled_modules": [
> "dashboard",
> "restful",
> "status"
> ],
> "disabled_modules": [
> "balancer",
> "influx",
> "localpool",
> "prometheus",
> "selftest",
> "zabbix"
> ]
> }


Curious if anyone has been using it in place of CollectD/Telegraf for feeding 
InfluxDB with statistics.

Thanks,

Reed___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Bluestore Migration Issues

2018-01-09 Thread Reed Dier
After removing the —osd-id flag, everything came up normally.

>  -221.82448 host node24
>   0   hdd   7.28450 osd.0 up  1.0 1.0
>   8   hdd   7.26999 osd.8 up  1.0 1.0
>  16   hdd   7.26999 osd.16up  1.0 1.0


Given the vanilla-ness to this ceph-volume command, is this something 
ceph-deploy-able?

I’m seeing ceph-deploy 1.5.39 as the latest stable release.

> ceph-deploy --username root disk zap $NODE:$HDD
> ceph-deploy --username root osd create $NODE:$HDD:$SSD

In that example $HDD is the main OSD device, and $SSD is the NVMe partition I 
want to use for block.db (and block.wal). Or is the syntax different from the 
filestore days?
And I am assuming that no --bluestore would be necessary given that I am 
reading that bluestore is the default and filestore requires intervention.

Thanks,

Reed

> On Jan 9, 2018, at 2:10 PM, Reed Dier <reed.d...@focusvq.com> wrote:
> 
>> -221.81000 host node24
>>   0   hdd   7.26999 osd.0 destroyed0 
>> 1.0
>>   8   hdd   7.26999 osd.8up  1.0 
>> 1.0
>>  16   hdd   7.26999 osd.16   up  1.0 
>> 1.0
> 
> Should I do these prior to running without the osd-id specified?
>> # ceph osd crush remove osd.$ID
>> # ceph auth del osd.$ID
>> # ceph osd rm osd.$ID
> 
> 
> And then it fill in the missing osd.0.
> Will set norebalance flag first to prevent data reshuffle upon the osd being 
> removed from the crush map.
> 
> Thanks,
> 
> Reed
> 
>> On Jan 9, 2018, at 2:05 PM, Alfredo Deza <ad...@redhat.com 
>> <mailto:ad...@redhat.com>> wrote:
>> 
>> On Tue, Jan 9, 2018 at 2:19 PM, Reed Dier <reed.d...@focusvq.com 
>> <mailto:reed.d...@focusvq.com>> wrote:
>>> Hi ceph-users,
>>> 
>>> Hoping that this is something small that I am overlooking, but could use the
>>> group mind to help.
>>> 
>>> Ceph 12.2.2, Ubuntu 16.04 environment.
>>> OSD (0) is an 8TB spinner (/dev/sda) and I am moving from a filestore
>>> journal to a blocks.db and WAL device on an NVMe partition (/dev/nvme0n1p5).
>>> 
>>> I have an OSD that I am trying to convert to bluestore and running into some
>>> trouble.
>>> 
>>> Started here until the ceps-volume create statement, which doesn’t work.
>>> http://docs.ceph.com/docs/master/rados/operations/bluestore-migration/ 
>>> <http://docs.ceph.com/docs/master/rados/operations/bluestore-migration/>
>>> Worth mentioning I also flushed the journal on the nvme partition before
>>> nuking the OSD.
>>> 
>>> $ sudo ceph-osd -i 0 --flush-journal
>>> 
>>> 
>>> So I first started with this command:
>>> 
>>> $ sudo ceph-volume lvm create --bluestore --data /dev/sda --block.db
>>> /dev/nvme0n1p5 --osd-id 0
>>> 
>>> 
>>> Pastebin to the ceph-volume log: https://pastebin.com/epkM3aP6
>>> 
>>> However the OSD doesn’t start.
>> 
>> I was just able to replicate this by using an ID that doesn't exist in
>> the cluster. On a cluster with just one OSD (with an ID of 0) I
>> created
>> an OSD with --osd-id 3, and had the exact same results.
>> 
>>> 
>>> Pastebin to ceph-osd log: https://pastebin.com/9qEsAJzA 
>>> <https://pastebin.com/9qEsAJzA>
>>> 
>>> I tried restarting the process, by deleting the LVM structures, zapping the
>>> disk using ceph-volume.
>>> This time using prepare and activate instead of create.
>>> 
>>> $ sudo ceph-volume lvm prepare --bluestore --data /dev/sda --block.db
>>> /dev/nvme0n1p5 --osd-id 0
>>> 
>>> $ sudo ceph-volume lvm activate --bluestore 0
>>> 227e1721-cd2e-4d7e-bb48-bc2bb715a038
>>> 
>>> 
>>> Also ran the enable on the ceph-volume systemd unit per
>>> http://docs.ceph.com/docs/master/install/manual-deployment/ 
>>> <http://docs.ceph.com/docs/master/install/manual-deployment/>
>>> 
>>> $ sudo systemctl enable
>>> ceph-volume@lvm-0-227e1721-cd2e-4d7e-bb48-bc2bb715a038
>>> 
>>> 
>>> Same results.
>>> 
>>> Any help is greatly appreciated.
>> 
>> Could you try without passing --osd-id ?
>>> 
>>> Thanks,
>>> 
>>> Reed
>>> 
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Bluestore Migration Issues

2018-01-09 Thread Reed Dier
> -221.81000 host node24
>   0   hdd   7.26999 osd.0 destroyed0 
> 1.0
>   8   hdd   7.26999 osd.8up  1.0 
> 1.0
>  16   hdd   7.26999 osd.16   up  1.0 
> 1.0

Should I do these prior to running without the osd-id specified?
> # ceph osd crush remove osd.$ID
> # ceph auth del osd.$ID
> # ceph osd rm osd.$ID


And then it fill in the missing osd.0.
Will set norebalance flag first to prevent data reshuffle upon the osd being 
removed from the crush map.

Thanks,

Reed

> On Jan 9, 2018, at 2:05 PM, Alfredo Deza <ad...@redhat.com> wrote:
> 
> On Tue, Jan 9, 2018 at 2:19 PM, Reed Dier <reed.d...@focusvq.com 
> <mailto:reed.d...@focusvq.com>> wrote:
>> Hi ceph-users,
>> 
>> Hoping that this is something small that I am overlooking, but could use the
>> group mind to help.
>> 
>> Ceph 12.2.2, Ubuntu 16.04 environment.
>> OSD (0) is an 8TB spinner (/dev/sda) and I am moving from a filestore
>> journal to a blocks.db and WAL device on an NVMe partition (/dev/nvme0n1p5).
>> 
>> I have an OSD that I am trying to convert to bluestore and running into some
>> trouble.
>> 
>> Started here until the ceps-volume create statement, which doesn’t work.
>> http://docs.ceph.com/docs/master/rados/operations/bluestore-migration/
>> Worth mentioning I also flushed the journal on the nvme partition before
>> nuking the OSD.
>> 
>> $ sudo ceph-osd -i 0 --flush-journal
>> 
>> 
>> So I first started with this command:
>> 
>> $ sudo ceph-volume lvm create --bluestore --data /dev/sda --block.db
>> /dev/nvme0n1p5 --osd-id 0
>> 
>> 
>> Pastebin to the ceph-volume log: https://pastebin.com/epkM3aP6
>> 
>> However the OSD doesn’t start.
> 
> I was just able to replicate this by using an ID that doesn't exist in
> the cluster. On a cluster with just one OSD (with an ID of 0) I
> created
> an OSD with --osd-id 3, and had the exact same results.
> 
>> 
>> Pastebin to ceph-osd log: https://pastebin.com/9qEsAJzA
>> 
>> I tried restarting the process, by deleting the LVM structures, zapping the
>> disk using ceph-volume.
>> This time using prepare and activate instead of create.
>> 
>> $ sudo ceph-volume lvm prepare --bluestore --data /dev/sda --block.db
>> /dev/nvme0n1p5 --osd-id 0
>> 
>> $ sudo ceph-volume lvm activate --bluestore 0
>> 227e1721-cd2e-4d7e-bb48-bc2bb715a038
>> 
>> 
>> Also ran the enable on the ceph-volume systemd unit per
>> http://docs.ceph.com/docs/master/install/manual-deployment/
>> 
>> $ sudo systemctl enable
>> ceph-volume@lvm-0-227e1721-cd2e-4d7e-bb48-bc2bb715a038
>> 
>> 
>> Same results.
>> 
>> Any help is greatly appreciated.
> 
> Could you try without passing --osd-id ?
>> 
>> Thanks,
>> 
>> Reed
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD Bluestore Migration Issues

2018-01-09 Thread Reed Dier
Hi ceph-users,

Hoping that this is something small that I am overlooking, but could use the 
group mind to help.

Ceph 12.2.2, Ubuntu 16.04 environment.
OSD (0) is an 8TB spinner (/dev/sda) and I am moving from a filestore journal 
to a blocks.db and WAL device on an NVMe partition (/dev/nvme0n1p5).

I have an OSD that I am trying to convert to bluestore and running into some 
trouble.

Started here until the ceps-volume create statement, which doesn’t work. 
http://docs.ceph.com/docs/master/rados/operations/bluestore-migration/ 

Worth mentioning I also flushed the journal on the nvme partition before nuking 
the OSD.
> $ sudo ceph-osd -i 0 --flush-journal


So I first started with this command:
> $ sudo ceph-volume lvm create --bluestore --data /dev/sda --block.db 
> /dev/nvme0n1p5 --osd-id 0


Pastebin to the ceph-volume log: https://pastebin.com/epkM3aP6 


However the OSD doesn’t start.

Pastebin to ceph-osd log: https://pastebin.com/9qEsAJzA 


I tried restarting the process, by deleting the LVM structures, zapping the 
disk using ceph-volume.
This time using prepare and activate instead of create.
> $ sudo ceph-volume lvm prepare --bluestore --data /dev/sda --block.db 
> /dev/nvme0n1p5 --osd-id 0
> $ sudo ceph-volume lvm activate --bluestore 0 
> 227e1721-cd2e-4d7e-bb48-bc2bb715a038

Also ran the enable on the ceph-volume systemd unit per 
http://docs.ceph.com/docs/master/install/manual-deployment/ 

> $ sudo systemctl enable ceph-volume@lvm-0-227e1721-cd2e-4d7e-bb48-bc2bb715a038

Same results.

Any help is greatly appreciated.

Thanks,

Reed___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume lvm deactivate/destroy/zap

2018-01-09 Thread Reed Dier
I would just like to mirror what Dan van der Ster’s sentiments are.

As someone attempting to move an OSD to bluestore, with limited/no LVM 
experience, it is a completely different beast and complexity level compared to 
the ceph-disk/filestore days.

ceph-deploy was a very simple tool that did exactly what I was looking to do, 
but now we have deprecated ceph-disk halfway into a release, ceph-deploy 
doesn’t appear to fully support ceph-volume, which is now the official way to 
manage OSDs moving forward.

My ceph-volume create statement ‘succeeded’ but the OSD doesn’t start, so now I 
am trying to zap the disk to try to recreate the OSD, and the zap is failing as 
Dan’s did.

And yes, I was able to get it zapped using the lvremove, vgremove, pvremove 
commands, but that is not obvious to someone who hasn’t used LVM extensively 
for storage management before.

I also want to mirror Dan’s sentiments about the unnecessary complexity imposed 
on what I expect is the default use case of an entire disk being used. I can’t 
see anything more than the ‘entire disk’ method being the largest use case for 
users of ceph, especially the smaller clusters trying to maximize 
hardware/spend.

Just wanted to piggy back this thread to echo Dan’s frustration.

Thanks,

Reed

> On Jan 8, 2018, at 10:41 AM, Alfredo Deza  wrote:
> 
> On Mon, Jan 8, 2018 at 10:53 AM, Dan van der Ster  > wrote:
>> On Mon, Jan 8, 2018 at 4:37 PM, Alfredo Deza  wrote:
>>> On Thu, Dec 21, 2017 at 11:35 AM, Stefan Kooman  wrote:
 Quoting Dan van der Ster (d...@vanderster.com):
> Thanks Stefan. But isn't there also some vgremove or lvremove magic
> that needs to bring down these /dev/dm-... devices I have?
 
 Ah, you want to clean up properly before that. Sure:
 
 lvremove -f /
 vgremove 
 pvremove /dev/ceph-device (should wipe labels)
 
 So ideally there should be a ceph-volume lvm destroy / zap option that
 takes care of this:
 
 1) Properly remove LV/VG/PV as shown above
 2) wipefs to get rid of LVM signatures
 3) dd zeroes to get rid of signatures that might still be there
>>> 
>>> ceph-volume does have a 'zap' subcommand, but it does not remove
>>> logical volumes or groups. It is intended to leave those in place for
>>> re-use. It uses wipefs, but
>>> not in a way that would end up removing LVM signatures.
>>> 
>>> Docs for zap are at: http://docs.ceph.com/docs/master/ceph-volume/lvm/zap/
>>> 
>>> The reason for not attempting removal is that an LV might not be a
>>> 1-to-1 device to volume group. It is being suggested here to "vgremove
>>> "
>>> but what if the group has several other LVs that should not get
>>> removed? Similarly, what if the logical volume is not a single PV but
>>> many?
>>> 
>>> We believe that these operations should be up to the administrator
>>> with better context as to what goes where and what (if anything)
>>> really needs to be removed
>>> from LVM.
>> 
>> Maybe I'm missing something, but aren't most (almost all?) use-cases just
>> 
>>   ceph-volume lvm create /dev/
> 
> No
>> 
>> ? Or do you expect most deployments to do something more complicated with 
>> lvm?
>> 
> 
> Yes, we do. For example dmcache, which to ceph-volume looks like a
> plain logical volume, but it can be vary on how it is implemented
> behind the scenes
> 
>> In that above whole-disk case, I think it would be useful to have a
>> very simple cmd to tear down whatever ceph-volume created, so that
>> ceph admins don't need to reverse engineer what ceph-volume is doing
>> with lvm.
> 
> Right, that would work if that was the only supported way of dealing
> with lvm. We aren't imposing this, we added it as a convenience if a
> user did not want
> to deal with lvm at all. LVM has a plethora of ways to create an LV,
> and we don't want to either restrict users to our view of LVM or
> attempt to understand all the many different
> ways that may be and assume some behavior is desired (like removing a VG)
> 
>> 
>> Otherwise, perhaps it would be useful to document the expected normal
>> lifecycle of an lvm osd: create, failure / replacement handling,
>> decommissioning.
>> 
>> Cheers, Dan
>> 
>> 
>> 
>>> 
 
 Gr. Stefan
 
 --
 | BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
 | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS log jam prevention

2017-12-07 Thread Reed Dier
> You can try doubling (several times if necessary) the MDS configs
> `mds_log_max_segments` and `mds_log_max_expiring` to make it more
> aggressively trim its journal. (That may not help since your OSD
> requests are slow.)


This may be obvious, but where is this mds_log located, and what are the 
bottlenecks for it to get behind?

Is this something that is located on an OSD? Is this located in the metadata 
pool (which I had previously moved to live on SSDs rather than colocated on the 
HDDs that the filesystem pool lives on)?
Just curious what would be the bottleneck in the MDS trying to trim this log.

I see the MDS process on the active MDS with decent CPU util, but not any real 
disk traffic from the MDS side.

So just looking to see what is bounding me with the trimming to keep it so far 
behind (I increased both max segments and max expiring by 4x each).

Thanks,

Reed

> On Dec 5, 2017, at 4:02 PM, Patrick Donnelly <pdonn...@redhat.com> wrote:
> 
> On Tue, Dec 5, 2017 at 8:07 AM, Reed Dier <reed.d...@focusvq.com> wrote:
>> Been trying to do a fairly large rsync onto a 3x replicated, filestore HDD
>> backed CephFS pool.
>> 
>> Luminous 12.2.1 for all daemons, kernel CephFS driver, Ubuntu 16.04 running
>> mix of 4.8 and 4.10 kernels, 2x10GbE networking between all daemons and
>> clients.
> 
> You should try a newer kernel client if possible since the MDS is
> having trouble trimming its cache.
> 
>> HEALTH_ERR 1 MDSs report oversized cache; 1 MDSs have many clients failing
>> to respond to cache pressure; 1 MDSs behind on tr
>> imming; noout,nodeep-scrub flag(s) set; application not enabled on 1
>> pool(s); 242 slow requests are blocked > 32 sec
>> ; 769378 stuck requests are blocked > 4096 sec
>> MDS_CACHE_OVERSIZED 1 MDSs report oversized cache
>>mdsdb(mds.0): MDS cache is too large (23GB/8GB); 1018 inodes in use by
>> clients, 1 stray files
>> MDS_CLIENT_RECALL_MANY 1 MDSs have many clients failing to respond to cache
>> pressure
>>mdsdb(mds.0): Many clients (37) failing to respond to cache
>> pressureclient_count: 37
>> MDS_TRIM 1 MDSs behind on trimming
>>mdsdb(mds.0): Behind on trimming (36252/30)max_segments: 30,
>> num_segments: 36252
> 
> See also: http://tracker.ceph.com/issues/21975
> 
> You can try doubling (several times if necessary) the MDS configs
> `mds_log_max_segments` and `mds_log_max_expiring` to make it more
> aggressively trim its journal. (That may not help since your OSD
> requests are slow.)
> 
> -- 
> Patrick Donnelly

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS log jam prevention

2017-12-05 Thread Reed Dier
Been trying to do a fairly large rsync onto a 3x replicated, filestore HDD 
backed CephFS pool.

Luminous 12.2.1 for all daemons, kernel CephFS driver, Ubuntu 16.04 running mix 
of 4.8 and 4.10 kernels, 2x10GbE networking between all daemons and clients.

> $ ceph versions
> {
> "mon": {
> "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
> luminous (stable)": 3
> },
> "mgr": {
> "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
> luminous (stable)": 3
> },
> "osd": {
> "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
> luminous (stable)": 74
> },
> "mds": {
> "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
> luminous (stable)": 2
> },
> "overall": {
> "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
> luminous (stable)": 82
> }
> }

>  
> HEALTH_ERR
>  1 MDSs report oversized cache; 1 MDSs have many clients failing to respond 
> to cache pressure; 1 MDSs behind on tr
> imming; noout,nodeep-scrub flag(s) set; application not enabled on 1 pool(s); 
> 242 slow requests are blocked > 32 sec
> ; 769378 stuck requests are blocked > 4096 sec
> MDS_CACHE_OVERSIZED 1 MDSs report oversized cache
> mdsdb(mds.0): MDS cache is too large (23GB/8GB); 1018 inodes in use by 
> clients, 1 stray files
> MDS_CLIENT_RECALL_MANY 1 MDSs have many clients failing to respond to cache 
> pressure
> mdsdb(mds.0): Many clients (37) failing to respond to cache 
> pressureclient_count: 37
> MDS_TRIM 1 MDSs behind on trimming
> mdsdb(mds.0): Behind on trimming (36252/30)max_segments: 30, 
> num_segments: 36252
> OSDMAP_FLAGS noout,nodeep-scrub flag(s) set
> REQUEST_SLOW 242 slow requests are blocked > 32 sec
> 236 ops are blocked > 2097.15 sec
> 3 ops are blocked > 1048.58 sec
> 2 ops are blocked > 524.288 sec
> 1 ops are blocked > 32.768 sec
> REQUEST_STUCK 769378 stuck requests are blocked > 4096 sec
> 91 ops are blocked > 67108.9 sec
> 121258 ops are blocked > 33554.4 sec
> 308189 ops are blocked > 16777.2 sec
> 251586 ops are blocked > 8388.61 sec
> 88254 ops are blocked > 4194.3 sec
> osds 0,1,3,6,8,12,15,16,17,21,22,23 have stuck requests > 16777.2 sec
> osds 4,7,9,10,11,14,18,20 have stuck requests > 33554.4 sec
> osd.13 has stuck requests > 67108.9 sec

This is across 8 nodes, holding 3x 8TB HDD’s each, all backed by Intel P3600 
NVMe drives for journaling.
Removed SSD OSD’s for brevity.

> $ ceph osd tree
> ID  CLASS WEIGHTTYPE NAME STATUS REWEIGHT PRI-AFF
> -1387.28799 root ssd
>  -1   174.51500 root default
> -10   174.51500 rack default.rack2
> -5543.62000 chassis node2425
>  -221.81000 host node24
>   0   hdd   7.26999 osd.0 up  1.0 1.0
>   8   hdd   7.26999 osd.8 up  1.0 1.0
>  16   hdd   7.26999 osd.16up  1.0 1.0
>  -321.81000 host node25
>   1   hdd   7.26999 osd.1 up  1.0 1.0
>   9   hdd   7.26999 osd.9 up  1.0 1.0
>  17   hdd   7.26999 osd.17up  1.0 1.0
> -5643.63499 chassis node2627
>  -421.81999 host node26
>   2   hdd   7.27499 osd.2 up  1.0 1.0
>  10   hdd   7.26999 osd.10up  1.0 1.0
>  18   hdd   7.27499 osd.18up  1.0 1.0
>  -521.81499 host node27
>   3   hdd   7.26999 osd.3 up  1.0 1.0
>  11   hdd   7.26999 osd.11up  1.0 1.0
>  19   hdd   7.27499 osd.19up  1.0 1.0
> -5743.62999 chassis node2829
>  -621.81499 host node28
>   4   hdd   7.26999 osd.4 up  1.0 1.0
>  12   hdd   7.26999 osd.12up  1.0 1.0
>  20   hdd   7.27499 osd.20up  1.0 1.0
>  -721.81499 host node29
>   5   hdd   7.26999 osd.5 up  1.0 1.0
>  13   hdd   7.26999 osd.13up  1.0 1.0
>  21   hdd   7.27499 osd.21up  1.0 1.0
> -5843.62999 chassis node3031
>  -821.81499 host node30
>   6   hdd   7.26999 osd.6 up  1.0 1.0
>  14   hdd   7.26999 osd.14up  1.0 1.0
>  22   hdd   7.27499 osd.22

Re: [ceph-users] CephFS metadata pool to SSDs

2017-10-13 Thread Reed Dier
As always, appreciate the help and knowledge of the collective ML mind.

> If you aren't using DC SSDs and this is prod, then I wouldn't recommend 
> moving towards this model. 

These are Samsung SM863a’s and Micron 5100 MAXs, all roughly 6-12 months old, 
with the most worn drive showing 23 P/E cycles so far.

Thanks again,

Reed

> On Oct 12, 2017, at 4:18 PM, John Spray <jsp...@redhat.com> wrote:
> 
> On Thu, Oct 12, 2017 at 9:34 PM, Reed Dier <reed.d...@focusvq.com> wrote:
>> I found an older ML entry from 2015 and not much else, mostly detailing the
>> doing performance testing to dispel poor performance numbers presented by
>> OP.
>> 
>> Currently have the metadata pool on my slow 24 HDDs, and am curious if I
>> should see any increased performance with CephFS by moving the metadata pool
>> onto SSD medium.
> 
> It depends a lot on the workload.
> 
> The primary advantage of moving metadata to dedicated drives
> (especially SSDs) is that it makes the system more deterministic under
> load.  The most benefit will be seen on systems which had previously
> had shared HDD OSDs that were fully saturated with data IO, and were
> consequently suffering from very slow metadata writes.
> 
> The impact will also depend on whether the metadata workload fit in
> the mds_cache_size or not: if the MDS is frequently missing its cache
> then the metadata pool latency will be more important.
> 
> On systems with plenty of spare IOPs, with non-latency-sensitive
> workloads, one might see little or no difference in performance when
> using SSDs, as those systems would typically bottleneck on the number
> of operations per second MDS daemon (CPU bound).  Systems like that
> would benefit more from multiple MDS daemons.
> 
> Then again, systems with plenty of spare IOPs can quickly become
> congested during recovery/backfill scenarios, so having SSDs for
> metadata is a nice risk mitigation to keep the system more responsive
> during bad times.
> 
>> My thought is that the SSDs are lower latency, and it removes those iops
>> from the slower spinning disks.
>> 
>> My next concern would be write amplification on the SSDs. Would this thrash
>> the SSD lifespan with tons of little writes or should it not be too heavy of
>> a workload to matter too much?
> 
> The MDS is comparatively efficient in how it writes out metadata:
> journal writes get batched up into larger IOs, and if something is
> frequently modified then it doesn't get written back every time (just
> when it falls off the end of the journal, or periodically).
> 
> If you've got SSDs that you're confident enough to use for data or
> general workloads, I wouldn't be too worried about using them for
> CephFS metadata.
> 
>> My last question from the operations standpoint, if I use:
>> # ceph osd pool set fs-metadata crush_ruleset 
>> Will this just start to backfill the metadata pool over to the SSDs until it
>> satisfies the crush requirements for size and failure domains and not skip a
>> beat?
> 
> On a healthy cluster, yes, this should just work.  The level of impact
> you see will depend on how much else you're trying to do with the
> system.  The prioritization of client IO vs. backfill IO has been
> improved in luminous, so you should use luminous if you can.
> 
> Because the overall size of the metadata pool is small, the smart
> thing is probably to find a time that is quiet for your system, and do
> the crush rule change at that time to get it over with quickly, rather
> than trying to do it during normal operations.
> 
> Cheers,
> John
> 
>> 
>> Obviously things like enabling dirfrags, and multiple MDS ranks will be more
>> likely to improve performance with CephFS, but the metadata pool uses very
>> little space, and I have the SSDs already, so I figured I would explore it
>> as an option.
>> 
>> Thanks,
>> 
>> Reed
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS metadata pool to SSDs

2017-10-12 Thread Reed Dier
I found an older ML entry from 2015 and not much else, mostly detailing the 
doing performance testing to dispel poor performance numbers presented by OP.

Currently have the metadata pool on my slow 24 HDDs, and am curious if I should 
see any increased performance with CephFS by moving the metadata pool onto SSD 
medium.
My thought is that the SSDs are lower latency, and it removes those iops from 
the slower spinning disks.

My next concern would be write amplification on the SSDs. Would this thrash the 
SSD lifespan with tons of little writes or should it not be too heavy of a 
workload to matter too much?

My last question from the operations standpoint, if I use:
# ceph osd pool set fs-metadata crush_ruleset 
Will this just start to backfill the metadata pool over to the SSDs until it 
satisfies the crush requirements for size and failure domains and not skip a 
beat?

Obviously things like enabling dirfrags, and multiple MDS ranks will be more 
likely to improve performance with CephFS, but the metadata pool uses very 
little space, and I have the SSDs already, so I figured I would explore it as 
an option.

Thanks,

Reed___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] min_size & hybrid OSD latency

2017-10-11 Thread Reed Dier
Just for the sake of putting this in the public forum,

In theory, by placing the primary copy of the object on an SSD medium, and 
placing replica copies on HDD medium, it should still yield some improvement in 
writes, compared to an all HDD scenario.

My logic here is rooted in the idea that the first copy requires a write, ACK, 
and then a read to send a copy to the replicas.
So instead of a slow write, and a slow read on your first hop, you have a fast 
write and fast read on the first hop, before pushing out to the slower second 
hop of 2x slow writes and ACKs.
Doubly so, if you have active io on the cluster, the SSD is taking all of the 
read io away from the slow HDDs, freeing up iops on the HDDs, which in turn 
should clear write ops quicker.

Please poke holes in this if you can.

Hopefully this will be useful for someone searching the ML.

Thanks,

Reed


> On Oct 10, 2017, at 6:50 PM, Christian Balzer  wrote:
> 
> All writes have to be ACKed, the only time where hybrid stuff helps is to
> accelerate reads.
> Which is something that people like me at least have very little interest
> in as the writes need to be fast. 
> 
> Christian
> 
>> (the same setup could involve some high latency OSDs, in the case of
>> country-level cluster)
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> -- 
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com Rakuten Communications
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph monitoring

2017-10-02 Thread Reed Dier
As someone currently running collectd/influxdb/grafana stack for monitoring, I 
am curious if anyone has seen issues moving Jewel -> Luminous.

I thought I remembered reading that collectd wasn’t working perfectly in 
Luminous, likely not helped with the MGR daemon.

Also thought about trying telegraf, however permissions issues in Jewel caused 
me to punt. (see: https://github.com/influxdata/telegraf/issues/1657 
)
Looked like Telegraf could supply pool level performance ops that I wasn’t 
seeing in collectd.

Was planning on starting the Luminous upgrades this week, so this thread seemed 
a good time to ask.

> ii  collectd 5.6.2.37.gfd01cdd-1~xenial

Looks like I’m running the 5.6 branch of collectd, I don’t see any ceph changes 
in the 5.7 branch, so wouldn’t immediately rock the boat with upgrading to 
5.7.x immediately.

Just curious what the early Luminous users are seeing.

Thanks,

Reed

> On Oct 2, 2017, at 2:26 PM, Erik McCormick  wrote:
> 
> On Mon, Oct 2, 2017 at 11:55 AM, Matthew Vernon  > wrote:
>> On 02/10/17 12:34, Osama Hasebou wrote:
>>> Hi Everyone,
>>> 
>>> Is there a guide/tutorial about how to setup Ceph monitoring system
>>> using collectd / grafana / graphite ? Other suggestions are welcome as
>>> well !
>> 
>> We just installed the collectd plugin for ceph, and pointed it at our
>> grahphite server; that did most of what we wanted (we also needed a
>> script to monitor wear on our SSD devices).
>> 
>> Making a dashboard is rather a matter of personal preference - we plot
>> client and s3 i/o, network, server load & CPU use, and have indicator
>> plots for numbers of osds up, and monitor quorum.
>> 
>> [I could share our dashboard JSON, but it's obviously specific to our
>> data sources]
>> 
>> Regards,
>> 
>> Matthew
>> 
>> 
> 
> I for one would love to see your dashboard. host and data source names
> can be easily replaced :)
> 
> -Erik
> 
>> --
>> The Wellcome Trust Sanger Institute is operated by Genome Research
>> Limited, a charity registered in England with number 1021457 and a
>> company registered in England with number 2742969, whose registered
>> office is 215 Euston Road, London, NW1 2BE.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Watch for fstrim running on your Ubuntu systems

2017-07-06 Thread Reed Dier
I could easily see that being the case, especially with Micron as a common 
thread, but it appears that I am on the latest FW for both the SATA and the 
NVMe:

> $ sudo ./msecli -L | egrep 'Device|FW'
> Device Name  : /dev/sda
> FW-Rev   : D0MU027
> Device Name  : /dev/sdb
> FW-Rev   : D0MU027
> Device Name  : /dev/sdc
> FW-Rev   : D0MU027
> Device Name  : /dev/sdd
> FW-Rev   : D0MU027
> Device Name  : /dev/sde
> FW-Rev   : D0MU027
> Device Name  : /dev/sdf
> FW-Rev   : D0MU027
> Device Name  : /dev/sdg
> FW-Rev   : D0MU027
> Device Name  : /dev/sdh
> FW-Rev   : D0MU027
> Device Name  : /dev/sdi
> FW-Rev   : D0MU027
> Device Name  : /dev/sdj
> FW-Rev   : D0MU027
> Device Name  : /dev/nvme0
> FW-Rev   : 0091634

D0MU027 and 1634 are the latest reported FW from Micron, current as of 
04/12/2017 and 12/07/2016, respectively.

Could be current FW doesn’t play nice, so thats on the table. But for now, its 
a thread that can’t be pulled any further.

Appreciate the feedback,

Reed

> On Jul 6, 2017, at 1:18 PM, Peter Maloney 
> <peter.malo...@brockmann-consult.de> wrote:
> 
> Hey,
> 
> I have some SAS Micron S630DC-400 which came with firmware M013 which did the 
> same or worse (takes very long... 100% blocked for about 5min for 16GB 
> trimmed), and works just fine with firmware M017 (4s for 32GB trimmed). So 
> maybe you just need an update.
> 
> Peter
> 
> 
> 
> On 07/06/17 18:39, Reed Dier wrote:
>> Hi Wido,
>> 
>> I came across this ancient ML entry with no responses and wanted to follow 
>> up with you to see if you recalled any solution to this.
>> Copying the ceph-users list to preserve any replies that may result for 
>> archival.
>> 
>> I have a couple of boxes with 10x Micron 5100 SATA SSD’s, journaled on 
>> Micron 9100 NVMe SSD’s; ceph 10.2.7; Ubuntu 16.04 4.8 kernel.
>> 
>> I have noticed now twice that I’ve had SSD’s flapping due to the fstrim 
>> eating up the io 100%.
>> It eventually righted itself after a little less than 8 hours.
>> Noout flag was set, so it didn’t create any unnecessary rebalance or whatnot.
>> 
>> Timeline showing that only 1 OSD ever went down at a time, but they seemed 
>> to go down in a rolling fashion during the fstrim session.
>> You can actually see in the OSD graph all 10 OSD’s on this node go down 1 by 
>> 1 over time.
>> 
>> 
>> And the OSD’s were going down because of:
>> 
>>> 2017-07-02 13:47:32.618752 7ff612721700  1 heartbeat_map is_healthy 
>>> 'OSD::osd_op_tp thread 0x7ff5ecd0c700' had timed out after 15
>>> 2017-07-02 13:47:32.618757 7ff612721700  1 heartbeat_map is_healthy 
>>> 'FileStore::op_tp thread 0x7ff608d9e700' had timed out after 60
>>> 2017-07-02 13:47:32.618760 7ff612721700  1 heartbeat_map is_healthy 
>>> 'FileStore::op_tp thread 0x7ff608d9e700' had suicide timed out after 180
>>> 2017-07-02 13:47:32.624567 7ff612721700 -1 common/HeartbeatMap.cc 
>>> <http://heartbeatmap.cc/>: In function 'bool 
>>> ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, 
>>> time_t)' thread 7ff612721700 time 2017-07-02 13:47:32.618784
>>> common/HeartbeatMap.cc <http://heartbeatmap.cc/>: 86: FAILED assert(0 == 
>>> "hit suicide timeout")
>> 
>> 
>> I am curious if you were able to nice it or something similar to mitigate 
>> this issue?
>> Oddly, I have similar machines with Samsung SM863a’s with Intel P3700 
>> journals that do not appear to be affected by the fstrim load issue despite 
>> identical weekly cron jobs enabled. Only the Micron drives (newer) have had 
>> these issues.
>> 
>> Appreciate any pointers,
>> 
>> Reed
>> 
>>> Wido den Hollander wido at 42on.com  
>>> <mailto:ceph-users%40lists.ceph.com?Subject=Re%3A%20%5Bceph-users%5D%20Watch%20for%20fstrim%20running%20on%20your%20Ubuntu%20systems=%3C5486BF08.3010505%4042on.com%3E>
>>> Tue Dec 9 01:21:16 PST 2014
>>> Hi,
>>> 
>>> Last sunday I got a call early in the morning that a Ceph cluster was
>>> having some issues. Slow requests and OSDs marking each other down.
>>> 
>>> Since this is a 100% SSD cluster I was a bit confused and started
>>> investigating.
>>> 
>>> It took me about 15 minutes to see that fstrim was running and was
>>> utilizing the SSDs 100%.
>>> 
>>>

Re: [ceph-users] Watch for fstrim running on your Ubuntu systems

2017-07-06 Thread Reed Dier
Hi Wido,

I came across this ancient ML entry with no responses and wanted to follow up 
with you to see if you recalled any solution to this.
Copying the ceph-users list to preserve any replies that may result for 
archival.

I have a couple of boxes with 10x Micron 5100 SATA SSD’s, journaled on Micron 
9100 NVMe SSD’s; ceph 10.2.7; Ubuntu 16.04 4.8 kernel.

I have noticed now twice that I’ve had SSD’s flapping due to the fstrim eating 
up the io 100%.
It eventually righted itself after a little less than 8 hours.
Noout flag was set, so it didn’t create any unnecessary rebalance or whatnot.

Timeline showing that only 1 OSD ever went down at a time, but they seemed to 
go down in a rolling fashion during the fstrim session.
You can actually see in the OSD graph all 10 OSD’s on this node go down 1 by 1 
over time.


And the OSD’s were going down because of:

> 2017-07-02 13:47:32.618752 7ff612721700  1 heartbeat_map is_healthy 
> 'OSD::osd_op_tp thread 0x7ff5ecd0c700' had timed out after 15
> 2017-07-02 13:47:32.618757 7ff612721700  1 heartbeat_map is_healthy 
> 'FileStore::op_tp thread 0x7ff608d9e700' had timed out after 60
> 2017-07-02 13:47:32.618760 7ff612721700  1 heartbeat_map is_healthy 
> 'FileStore::op_tp thread 0x7ff608d9e700' had suicide timed out after 180
> 2017-07-02 13:47:32.624567 7ff612721700 -1 common/HeartbeatMap.cc: In 
> function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, 
> const char*, time_t)' thread 7ff612721700 time 2017-07-02 13:47:32.618784
> common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")


I am curious if you were able to nice it or something similar to mitigate this 
issue?
Oddly, I have similar machines with Samsung SM863a’s with Intel P3700 journals 
that do not appear to be affected by the fstrim load issue despite identical 
weekly cron jobs enabled. Only the Micron drives (newer) have had these issues.

Appreciate any pointers,

Reed

> Wido den Hollander wido at 42on.com  
> 
> Tue Dec 9 01:21:16 PST 2014
> Hi,
> 
> Last sunday I got a call early in the morning that a Ceph cluster was
> having some issues. Slow requests and OSDs marking each other down.
> 
> Since this is a 100% SSD cluster I was a bit confused and started
> investigating.
> 
> It took me about 15 minutes to see that fstrim was running and was
> utilizing the SSDs 100%.
> 
> On Ubuntu 14.04 there is a weekly CRON which executes fstrim-all. It
> detects all mountpoints which can be trimmed and starts to trim those.
> 
> On the Intel SSDs used here it caused them to become 100% busy for a
> couple of minutes. That was enough for them to no longer respond on
> heartbeats, thus timing out and being marked down.
> 
> Luckily we had the "out interval" set to 1800 seconds on that cluster,
> so no OSD was marked as "out".
> 
> fstrim-all does not execute fstrim with a ionice priority. From what I
> understand, but haven't tested yet, is that running fstrim with ionice
> -c Idle should solve this.
> 
> It's weird that this issue didn't come up earlier on that cluster, but
> after killing fstrim all problems we resolved and the cluster ran
> happily again.
> 
> So watch out for fstrim on early Sunday mornings on Ubuntu!
> 
> -- 
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ideas on the UI/UX improvement of ceph-mgr: Cluster Status Dashboard

2017-06-29 Thread Reed Dier
I’d like to see per pool iops/usage, et al.

Being able to see rados vs rbd vs whatever else performance, or pools with 
different backing mediums and see which workloads result in what performance.

Most of this I pretty well cobble together with collectd, but it would still be 
nice to have out of the box going forward.

Reed

> On Jun 25, 2017, at 11:49 PM, saumay agrawal  wrote:
> 
> Hi everyone!
> 
> I am working on the improvement of the web-based dashboard for Ceph.
> My intention is to add some UI elements to visualise some performance
> counters of a Ceph cluster. This gives a better overview to the users
> of the dashboard about how the Ceph cluster is performing and, if
> necessary, where they can make necessary optimisations to get even
> better performance from the cluster.
> 
> Here is my suggestion on the two perf counters, commit latency and
> apply latency. They are visualised using line graphs. I have prepared
> UI mockups for the same.
> 1. OSD apply latency
> [https://drive.google.com/open?id=0ByXy5gIBzlhYNS1MbTJJRDhtSG8]
> 2. OSD commit latency
> [https://drive.google.com/open?id=0ByXy5gIBzlhYNElyVU00TGtHeVU]
> 
> These mockups show the latency values (y-axis) against the instant of
> time (x-axis). The latency values for different OSDs are highlighted
> using different colours. The average latency value of all OSDs is
> shown specifically in red. This representation allows the dashboard
> user to compare the performances of an OSD with other OSDs, as well as
> with the average performance of the cluster.
> 
> The line width in these graphs is specially kept less, so as to give a
> crisp and clear representation for more number of OSDs. However, this
> approach may clutter the graph and make it incomprehensible for a
> cluster having significantly higher number of OSDs. For such
> situations, we can retain only the average latency indications from
> both the graphs to make things more simple for the dashboard user.
> 
> Also, higher latency values suggest bad performance. We can come up
> with some specific values for both the counters, above which we can
> say that the cluster is performing very bad. If the value of any of
> the OSDs exceeds this value, we can highlight entire graph in a light
> red shade to draw the attention of user towards it.
> 
> I am planning to use AJAX based templates and plugins (like
> Flotcharts) for these graphs. This would allow real-time update of the
> graphs without having any need to reload the entire dashboard page.
> 
> Another feature I propose to add is the representation of the version
> distribution of all the clients in a cluster. This can be categorised
> into distribution
> 1. on the basis of ceph version
> [https://drive.google.com/open?id=0ByXy5gIBzlhYYmw5cXF2bkdTWWM] and,
> 2. on the basis of kernel version
> [https://drive.google.com/open?id=0ByXy5gIBzlhYczFuRTBTRDcwcnc]
> 
> I have used doughnut charts instead of regular pie charts, as they
> have some whitespace at their centre. This whitespace makes the chart
> appear less cluttered, while properly indicating the appropriate
> fraction of the total value. Also, we can later add some data to
> display at this centre space when we hover over a particular slice of
> the chart.
> 
> The main purpose of this visualisation is to identify any number of
> clients left behind while updating the clients of the cluster. Suppose
> a cluster has 50 clients running ceph jewel. In the process of
> updating this cluster, 40 clients get updated to ceph luminous, while
> the other 10 clients remain behind on ceph jewel. This may occur due
> to some bug or any interruption in the update process. In such
> scenarios, the user can find which clients have not been updated and
> update them according to his needs.  It may also give a clear picture
> for troubleshooting, during any package dependency issues due to the
> kernel. The clients are represented in both, absolutes numbers as well
> as the percentage of the entire cluster, for a better overview.
> 
> An interesting approach could be highlighting the older version(s)
> specifically to grab the attention of the user. For example, a user
> running ceph jewel may not need to update as necessarily compared to
> the user running ceph hammer.
> 
> As of now, I am looking for plugins in AdminLTE to implement these two
> elements in the dashboard. I would like to have feedbacks and
> suggestions on these two from the ceph community, on how can I make
> them more informative about the cluster.
> 
> Also a request to the various ceph users and developers. It would be
> great if you could share the various metrics you are using as a
> performance indicator for your cluster, and how you are using them.
> Any metrics being used to identify the issues in a cluster can also be
> shared.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> 

Re: [ceph-users] Changing SSD Landscape

2017-06-08 Thread Reed Dier
I did stumble across Samsung PM1725/a in both AIC and 2.5” U.2 form factor.

AIC starts at 1.6T and goes up to 6.4T, while 2.5” goes from 800G up to 6.4T.

The thing that caught my eye with this model is the x8 lanes in AIC, and the 
5DWPD over 5 years.

No idea on how available it is, or how it compares price wise, but comparing to 
the Micron 9100, you can get 5DWPD compared to 3DWPD, which when talking in 
terms of journal devices, which could be a big difference in lifespan.

And from what I read, the PM1725a isn’t as performant as say the P3700, or some 
other enterprise NVMe drives like the HGST SN100, its still NVMe, and leaps and 
bounds lower latency and deeper queuing compared to SATA SSDs.

Reed

> On Jun 8, 2017, at 2:43 AM, Luis Periquito <periqu...@gmail.com> wrote:
> 
> Looking at that anandtech comparison it seems the Micron usually is
> worse than the P3700.
> 
> This week I asked for a few nodes with P3700 400G and got an answer as
> they're end of sale, and the supplier wouldn't be able to get it
> anywhere in the world. Has anyone got a good replacement for these?
> 
> The official replacement is the P4600, but those start at 2T and has
> the appropriate price rise (it's slightly cheaper per GB than the
> P3700), and it hasn't been officially released yet.
> 
> The P4800X (Optane) costs about the same as the P4600 and is small...
> 
> Not really sure about the Micron 9100, and couldn't find anything
> interesting/comparable in the Samsung range...
> 
> 
> On Wed, May 17, 2017 at 5:03 PM, Reed Dier <reed.d...@focusvq.com> wrote:
>> Agreed, the issue I have seen is that the P4800X (Optane) is demonstrably
>> more expensive than the P3700 for a roughly equivalent amount of storage
>> space (400G v 375G).
>> 
>> However, the P4800X is perfectly suited to a Ceph environment, with 30 DWPD,
>> or 12.3 PBW. And on top of that, it seems to generally outperform the P3700
>> in terms of latency, iops, and raw throughput, especially at greater queue
>> depths. The biggest thing I took away was performance consistency.
>> 
>> Anandtech did a good comparison against the P3700 and the Micron 9100 MAX,
>> ironically the 9100 MAX has been the model I have been looking at to replace
>> P3700’s in future OSD nodes.
>> 
>> http://www.anandtech.com/show/11209/intel-optane-ssd-dc-p4800x-review-a-deep-dive-into-3d-xpoint-enterprise-performance/
>> 
>> There are also the DC P4500 and P4600 models in the pipeline from Intel,
>> also utilizing 3D NAND, however I have been told that they will not be
>> shipping in volume until mid to late Q3.
>> And as was stated earlier, these are all starting at much larger storage
>> sizes, 1-4T in size, and with respective endurance ratings of 1.79 PBW and
>> 10.49 PBW for endurance on the 2TB versions of each of those. Which should
>> equal about .5 and ~3 DWPD for most workloads.
>> 
>> At least the Micron 5100 MAX are finally shipping in volume to offer a
>> replacement to Intel S3610, though no good replacement for the S3710 yet
>> that I’ve seen on the endurance part.
>> 
>> Reed
>> 
>> On May 17, 2017, at 5:44 AM, Luis Periquito <periqu...@gmail.com> wrote:
>> 
>> Anyway, in a couple months we'll start testing the Optane drives. They
>> are small and perhaps ideal journals, or?
>> 
>> The problem with optanes is price: from what I've seen they cost 2x or
>> 3x as much as the P3700...
>> But at least from what I've read they do look really great...
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD scrub during recovery

2017-05-30 Thread Reed Dier
Thanks,

This makes sense, but just wanted to sanity check my assumption against reality.

In my specific case, 24 of the OSD’s are HDD, 30 SSD in different roots/pools, 
and so deep scrubs on the other 23 spinning disks could in theory eat iops on a 
disk currently backfilling to the other OSD.

Either way, make sense, and thanks for the insight.

And don’t worry Wido, they aren’t SMR drives!

Thanks,

Reed


> On May 30, 2017, at 11:03 AM, Wido den Hollander <w...@42on.com> wrote:
> 
>> 
>> Op 30 mei 2017 om 17:37 schreef Reed Dier <reed.d...@focusvq.com>:
>> 
>> 
>> Lost an OSD and having to rebuild it.
>> 
>> 8TB drive, so it has to backfill a ton of data.
>> Been taking a while, so looked at ceph -s and noticed that deep/scrubs were 
>> running even though I’m running newest Jewel (10.2.7) and OSD’s have the 
>> osd_scrub_during_recovery set to false.
>> 
>>> $ cat /etc/ceph/ceph.conf | grep scrub | grep recovery
>>> osd_scrub_during_recovery = false
>> 
>>> $ sudo ceph daemon osd.0 config show | grep scrub | grep recovery
>>>"osd_scrub_during_recovery": "false”,
>> 
>>> $ ceph --version
>>> ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)
>> 
>>>cluster edeb727e-c6d3-4347-bfbb-b9ce7f60514b
>>> health HEALTH_WARN
>>>133 pgs backfill_wait
>>>10 pgs backfilling
>>>143 pgs degraded
>>>143 pgs stuck degraded
>>>143 pgs stuck unclean
>>>143 pgs stuck undersized
>>>143 pgs undersized
>>>recovery 22081436/1672287847 objects degraded (1.320%)
>>>recovery 20054800/1672287847 objects misplaced (1.199%)
>>>noout flag(s) set
>>> monmap e1: 3 mons at 
>>> {core=10.0.1.249:6789/0,db=10.0.1.251:6789/0,dev=10.0.1.250:6789/0}
>>>election epoch 4234, quorum 0,1,2 core,dev,db
>>>  fsmap e5013: 1/1/1 up {0=core=up:active}, 1 up:standby
>>> osdmap e27892: 54 osds: 54 up, 54 in; 143 remapped pgs
>>>flags noout,nodeep-scrub,sortbitwise,require_jewel_osds
>>>  pgmap v13840713: 4292 pgs, 6 pools, 59004 GB data, 564 Mobjects
>>>159 TB used, 69000 GB / 226 TB avail
>>>22081436/1672287847 objects degraded (1.320%)
>>>20054800/1672287847 objects misplaced (1.199%)
>>>4143 active+clean
>>> 133 active+undersized+degraded+remapped+wait_backfill
>>>  10 active+undersized+degraded+remapped+backfilling
>>>   6 active+clean+scrubbing+deep
>>> recovery io 21855 kB/s, 346 objects/s
>>>  client io 30021 kB/s rd, 1275 kB/s wr, 291 op/s rd, 62 op/s wr
>> 
>> Looking at the ceph documentation for ‘master'
>> 
>>> osd scrub during recovery
>>> 
>>> Description:Allow scrub during recovery. Setting this to false will 
>>> disable scheduling new scrub (and deep–scrub) while there is active 
>>> recovery. Already running scrubs will be continued. This might be useful to 
>>> reduce load on busy clusters.
>>> Type:   Boolean
>>> Default:true
>> 
>> 
>> Are backfills not treated as recovery operations? Is it only preventing 
>> scrubs on the OSD’s that are actively recovering/backfilling?
>> 
>> Just curious as to why the feature did not seem to kick in as expected.
> 
> It is per OSD. So only on that OSD new (deep-)scrubs will not be started as 
> long as a recovery/backfill operation is active there.
> 
> So other OSDs which have nothing to do with it will still perform scrubs.
> 
> Wido
> 
>> 
>> Thanks,
>> 
>> Reed___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD scrub during recovery

2017-05-30 Thread Reed Dier
Lost an OSD and having to rebuild it.

8TB drive, so it has to backfill a ton of data.
Been taking a while, so looked at ceph -s and noticed that deep/scrubs were 
running even though I’m running newest Jewel (10.2.7) and OSD’s have the 
osd_scrub_during_recovery set to false.

> $ cat /etc/ceph/ceph.conf | grep scrub | grep recovery
> osd_scrub_during_recovery = false

> $ sudo ceph daemon osd.0 config show | grep scrub | grep recovery
> "osd_scrub_during_recovery": "false”,

> $ ceph --version
> ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)

> cluster edeb727e-c6d3-4347-bfbb-b9ce7f60514b
>  health HEALTH_WARN
> 133 pgs backfill_wait
> 10 pgs backfilling
> 143 pgs degraded
> 143 pgs stuck degraded
> 143 pgs stuck unclean
> 143 pgs stuck undersized
> 143 pgs undersized
> recovery 22081436/1672287847 objects degraded (1.320%)
> recovery 20054800/1672287847 objects misplaced (1.199%)
> noout flag(s) set
>  monmap e1: 3 mons at 
> {core=10.0.1.249:6789/0,db=10.0.1.251:6789/0,dev=10.0.1.250:6789/0}
> election epoch 4234, quorum 0,1,2 core,dev,db
>   fsmap e5013: 1/1/1 up {0=core=up:active}, 1 up:standby
>  osdmap e27892: 54 osds: 54 up, 54 in; 143 remapped pgs
> flags noout,nodeep-scrub,sortbitwise,require_jewel_osds
>   pgmap v13840713: 4292 pgs, 6 pools, 59004 GB data, 564 Mobjects
> 159 TB used, 69000 GB / 226 TB avail
> 22081436/1672287847 objects degraded (1.320%)
> 20054800/1672287847 objects misplaced (1.199%)
> 4143 active+clean
>  133 active+undersized+degraded+remapped+wait_backfill
>   10 active+undersized+degraded+remapped+backfilling
>6 active+clean+scrubbing+deep
> recovery io 21855 kB/s, 346 objects/s
>   client io 30021 kB/s rd, 1275 kB/s wr, 291 op/s rd, 62 op/s wr

Looking at the ceph documentation for ‘master'

> osd scrub during recovery
> 
> Description:  Allow scrub during recovery. Setting this to false will disable 
> scheduling new scrub (and deep–scrub) while there is active recovery. Already 
> running scrubs will be continued. This might be useful to reduce load on busy 
> clusters.
> Type: Boolean
> Default:  true


Are backfills not treated as recovery operations? Is it only preventing scrubs 
on the OSD’s that are actively recovering/backfilling?

Just curious as to why the feature did not seem to kick in as expected.

Thanks,

Reed___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Changing SSD Landscape

2017-05-18 Thread Reed Dier
> BTW, you asked about Samsung parts earlier. We are running these
> SM863's in a block storage cluster:
> 
> Model Family: Samsung based SSDs
> Device Model: SAMSUNG MZ7KM240HAGR-0E005
> Firmware Version: GXM1003Q
> 
>  
> 177 Wear_Leveling_Count 0x0013   094   094   005Pre-fail
> Always   -   2195
> 
> The problem is that I don't know how to see how many writes have gone
> through these drives.
> 
> But maybe they're EOL anyway?
> 
> Cheers, Dan

I have SM863a 1.9T’s in an all SSD pool.

Model Family: Samsung based SSDs
Device Model: SAMSUNG MZ7KM1T9HMJP-5

The easiest way to read the number of ‘drive writes’ is the WLC/177 attribute. 
Where ‘Value’ is going to be normalized value of percentage used (out of 100%) 
counting down, and the ‘raw value’ is going to be your actual Program/Erase 
Cycles average value, aka your drive writes.

> ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
> WHEN_FAILED RAW_VALUE
>   9 Power_On_Hours  0x0032   099   099   000Old_age   Always  
>  -   1758
> 177 Wear_Leveling_Count 0x0013   099   099   005Pre-fail  Always  
>  -   7


So in my case,  for this drive in question, the average of all the NAND has 
been fully written 7 times.

The 1.9T SM863 is rated at 12.32 PBW, with a warranty period of 5 years, so 
~3.6 DWPD, or ~6,500 drive writes for the total life of the drive.

Now your drive shows 2,195 PE Cycles, which would be about 33% of the total PE 
cycles its rated for. I’m guessing that some of the NAND may have higher PE 
cycles than others, and the raw value reported may be the max value, rather 
than the average.

Intel reports the min/avg/max on their drives using isdct.

> $ sudo isdct show -smart ad -intelssd 0
> 
> - SMART Attributes PHMD_400AGN -
> - AD -
> AverageEraseCycles : 256
> Description : Wear Leveling Count
> ID : AD
> MaximumEraseCycles : 327
> MinimumEraseCycles : 188
> Normalized : 98
> Raw : 1099533058236

This is a P3700, one of the oldest in use. So this one has seen ~2% of its life 
expectancy usage, where some NAND has seen 75% more PE cycles than others.

Would be curious what the raw value for Samsung is reporting, but thats an easy 
way to gauge drive writes.

Reed

> On May 18, 2017, at 3:30 AM, Dan van der Ster  wrote:
> 
> On Thu, May 18, 2017 at 3:11 AM, Christian Balzer  > wrote:
>> On Wed, 17 May 2017 18:02:06 -0700 Ben Hines wrote:
>> 
>>> Well, ceph journals are of course going away with the imminent bluestore.
>> Not really, in many senses.
>> 
> 
> But we should expect far fewer writes to pass through the RocksDB and
> its WAL, right? So perhaps lower endurance flash will be usable.
> 
> BTW, you asked about Samsung parts earlier. We are running these
> SM863's in a block storage cluster:
> 
> Model Family: Samsung based SSDs
> Device Model: SAMSUNG MZ7KM240HAGR-0E005
> Firmware Version: GXM1003Q
> 
>  9 Power_On_Hours  0x0032   098   098   000Old_age
> Always   -   9971
> 177 Wear_Leveling_Count 0x0013   094   094   005Pre-fail
> Always   -   2195
> 241 Total_LBAs_Written  0x0032   099   099   000Old_age
> Always   -   701300549904
> 242 Total_LBAs_Read 0x0032   099   099   000Old_age
> Always   -   20421265
> 251 NAND_Writes 0x0032   100   100   000Old_age
> Always   -   1148921417736
> 
> The problem is that I don't know how to see how many writes have gone
> through these drives.
> Total_LBAs_Written appears to be bogus -- it's based on time. It
> matches exactly the 3.6DWPD spec'd for that model:
>  3.6*240GB*9971 hours = 358.95TB
>  701300549904 LBAs * 512Bytes/LBA = 359.06TB
> 
> If we trust Wear_Leveling_Count then we're only dropping 6% in a year
> -- these should be good.
> 
> But maybe they're EOL anyway?
> 
> Cheers, Dan
> 
>>> Are small SSDs still useful for something with Bluestore?
>>> 
>> Of course, the WAL and other bits for the rocksdb, read up on it.
>> 
>> On top of that is the potential to improve things further with things
>> like bcache.
>> 
>>> For speccing out a cluster today that is a many 6+ months away from being
>>> required, which I am going to be doing, i was thinking all-SSD would be the
>>> way to go. (or is all-spinner performant with Bluestore?) Too early to make
>>> that call?
>>> 
>> Your call and funeral with regards to all spinners (depending on your
>> needs).
>> Bluestore at the very best of circumstances could double your IOPS, but
>> there are other factors at play and most people who NEED SSD journals now
>> would want something with SSDs in Bluestore as well.
>> 
>> If you're planning to actually deploy a (entirely) Bluestore cluster in
>> production with mission critical data before next year, you're a lot
>> braver than me.
>> An early adoption scheme with Bluestore nodes being in their own failure
>> domain (rack) would 

Re: [ceph-users] Changing SSD Landscape

2017-05-17 Thread Reed Dier
Agreed, the issue I have seen is that the P4800X (Optane) is demonstrably more 
expensive than the P3700 for a roughly equivalent amount of storage space (400G 
v 375G).

However, the P4800X is perfectly suited to a Ceph environment, with 30 DWPD, or 
12.3 PBW. And on top of that, it seems to generally outperform the P3700 in 
terms of latency, iops, and raw throughput, especially at greater queue depths. 
The biggest thing I took away was performance consistency.

Anandtech did a good comparison against the P3700 and the Micron 9100 MAX, 
ironically the 9100 MAX has been the model I have been looking at to replace 
P3700’s in future OSD nodes.

http://www.anandtech.com/show/11209/intel-optane-ssd-dc-p4800x-review-a-deep-dive-into-3d-xpoint-enterprise-performance/
 


There are also the DC P4500 and P4600 models in the pipeline from Intel, also 
utilizing 3D NAND, however I have been told that they will not be shipping in 
volume until mid to late Q3.
And as was stated earlier, these are all starting at much larger storage sizes, 
1-4T in size, and with respective endurance ratings of 1.79 PBW and 10.49 PBW 
for endurance on the 2TB versions of each of those. Which should equal about .5 
and ~3 DWPD for most workloads.

At least the Micron 5100 MAX are finally shipping in volume to offer a 
replacement to Intel S3610, though no good replacement for the S3710 yet that 
I’ve seen on the endurance part.

Reed

> On May 17, 2017, at 5:44 AM, Luis Periquito  wrote:
> 
>>> Anyway, in a couple months we'll start testing the Optane drives. They
>>> are small and perhaps ideal journals, or?
>>> 
> The problem with optanes is price: from what I've seen they cost 2x or
> 3x as much as the P3700...
> But at least from what I've read they do look really great...
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sharing SSD journals and SSD drive choice

2017-04-26 Thread Reed Dier
Hi Adam,

How did you settle on the P3608 vs say the P3600 or P3700 for journals? And 
also the 1.6T size? Seems overkill, unless its pulling double duty beyond OSD 
journals.

Only improvement over the P3x00 is the move from x4 lanes to x8 lanes on the 
PCIe bus, but the P3600/P3700 offer much more in terms of endurance, and at 
lower prices compared to the P3608.
How big are your journal sizes, or are you over provisioning to increase 
endurance on the card?

It would seem the new P4800X will be a perfect journaling device with >30DWPD, 
and even lower latency, even though it is “low” storage size, 375GB would still 
hold 15 25GB journals, which seems excessively large.

Reed

> On Apr 26, 2017, at 10:20 AM, Chris Apsey  wrote:
> 
> Adam,
> 
> Before we deployed our cluster, we did extensive testing on all kinds of 
> SSDs, from consumer-grade TLC SATA all the way to Enterprise PCI-E NVME 
> Drives.  We ended up going with a ratio of 1x Intel P3608 PCI-E 1.6 TB to 12x 
> HGST 10TB SAS3 HDDs.  It provided the best price/performance/density balance 
> for us overall.  As a frame of reference, we have 384 OSDs spread across 16 
> nodes.
> 
> A few (anecdotal) notes:
> 
> 1. Consumer SSDs have unpredictable performance under load; write latency can 
> go from normal to unusable with almost no warning.  Enterprise drives 
> generally show much less load sensitivity.
> 2. Write endurance; while it may appear that having several consumer-grade 
> SSDs backing a smaller number of OSDs will yield better longevity than an 
> enterprise grade SSD backing a larger number of OSDs, the reality is that 
> enterprise drives that use SLC or eMLC are generally an order of magnitude 
> more reliable when all is said and done.
> 3. Power Loss protection (PLP).  Consumer drives generally don't do well when 
> power is suddenly lost.  Yes, we should all have UPS, etc., but things 
> happen.  Enterprise drives are much more tolerant of environmental failures.  
> Recovering from misplaced objects while also attempting to serve clients is 
> no fun.
> 
> 
> 
> 
> 
> ---
> v/r
> 
> Chris Apsey
> bitskr...@bitskrieg.net
> https://www.bitskrieg.net
> 
> On 2017-04-26 10:53, Adam Carheden wrote:
>> What I'm trying to get from the list is /why/ the "enterprise" drives
>> are important. Performance? Reliability? Something else?
>> The Intel was the only one I was seriously considering. The others were
>> just ones I had for other purposes, so I thought I'd see how they fared
>> in benchmarks.
>> The Intel was the clear winner, but my tests did show that throughput
>> tanked with more threads. Hypothetically, if I was throwing 16 OSDs at
>> it, all with osd op threads = 2, do the benchmarks below not show that
>> the Hynix would be a better choice (at least for performance)?
>> Also, 4 x Intel DC S3520 costs as much as 1 x Intel DC S3610. Obviously
>> the single drive leaves more bays free for OSD disks, but is there any
>> other reason a single S3610 is preferable to 4 S3520s? Wouldn't 4xS3520s
>> mean:
>> a) fewer OSDs go down if the SSD fails
>> b) better throughput (I'm speculating that the S3610 isn't 4 times
>> faster than the S3520)
>> c) load spread across 4 SATA channels (I suppose this doesn't really
>> matter since the drives can't throttle the SATA bus).
>> --
>> Adam Carheden
>> On 04/26/2017 01:55 AM, Eneko Lacunza wrote:
>>> Adam,
>>> What David said before about SSD drives is very important. I will tell
>>> you another way: use enterprise grade SSD drives, not consumer grade.
>>> Also, pay attention to endurance.
>>> The only suitable drive for Ceph I see in your tests is SSDSC2BB150G7,
>>> and probably it isn't even the most suitable SATA SSD disk from Intel;
>>> better use S3610 o S3710 series.
>>> Cheers
>>> Eneko
>>> El 25/04/17 a las 21:02, Adam Carheden escribió:
 On 04/25/2017 11:57 AM, David wrote:
> On 19 Apr 2017 18:01, "Adam Carheden"  > wrote:
> Does anyone know if XFS uses a single thread to write to it's
> journal?
> You probably know this but just to avoid any confusion, the journal in
> this context isn't the metadata journaling in XFS, it's a separate
> journal written to by the OSD daemons
 Ha! I didn't know that.
> I think the number of threads per OSD is controlled by the 'osd op
> threads' setting which defaults to 2
 So the ideal (for performance) CEPH cluster would be one SSD per HDD
 with 'osd op threads' set to whatever value fio shows as the optimal
 number of threads for that drive then?
> I would avoid the SanDisk and Hynix. The s3500 isn't too bad. Perhaps
> consider going up to a 37xx and putting more OSDs on it. Of course with
> the caveat that you'll lose more OSDs if it goes down.
 Why would you avoid the SanDisk and Hynix? Reliability (I think those
 two are both TLC)? Brand trust? If it's my benchmarks in my previous
 email, why 

Re: [ceph-users] Adding New OSD Problem

2017-04-25 Thread Reed Dier
Others will likely be able to provide some better responses, but I’ll take a 
shot to see if anything makes sense.

With 10.2.6 you should be able to set 'osd scrub during recovery’ to false to 
prevent any new scrubs from occurring during a recovery event. Current scrubs 
will complete, but future scrubs will not being until recovery has completed.

Also, adding just one OSD on the new server, assuming all 6 are ready(?) will 
cause a good deal of unnecessary data reshuffling as you add more OSD’s.
And on top of that, assuming the pool’s crush ruleset is ‘chooseleaf first 0 
type host’ then that should create a bit of an unbalanced weighting. Any reason 
you aren’t bringing in all 6 OSD’s at once?
You should be able to set noscrub, noscrub-deep, norebalance, nobackfill, and 
norecover flags (also probably want noout to prevent rebalance if OSDs flap), 
wait for scrubs to complete (especially deep), add your 6 OSD’s, unset your 
flags for recovery/rebalance/backfill, and it will then move data only once, 
and hopefully not have the scrub load. After recovery, unset the scrub flags, 
and be back to normal.

Caveat, no VM’s running on my cluster, but those seem like low hanging fruit 
for possible load lightening during a rebalance.

Reed

> On Apr 25, 2017, at 3:47 PM, Ramazan Terzi  wrote:
> 
> Hello,
> 
> I have a Ceph Cluster with specifications below:
> 3 x Monitor node
> 6 x Storage Node (6 disk per Storage Node, 6TB SATA Disks, all disks have SSD 
> journals)
> Distributed public and private networks. All NICs are 10Gbit/s
> osd pool default size = 3
> osd pool default min size = 2
> 
> Ceph version is Jewel 10.2.6.
> 
> Current health status:
>cluster 
> health HEALTH_OK
> monmap e9: 3 mons at 
> {ceph-mon01=xxx:6789/0,ceph-mon02=xxx:6789/0,ceph-mon03=xxx:6789/0}
>election epoch 84, quorum 0,1,2 ceph-mon01,ceph-mon02,ceph-mon03
> osdmap e1512: 36 osds: 36 up, 36 in
>flags sortbitwise,require_jewel_osds
>  pgmap v7698673: 1408 pgs, 5 pools, 37365 GB data, 9436 kobjects
>83871 GB used, 114 TB / 196 TB avail
>1408 active+clean
> 
> My cluster is active and a lot of virtual machines running on it (Linux and 
> Windows VM's, database clusters, web servers etc).
> 
> When I want to add a new storage node with 1 disk, I'm getting huge problems. 
> With new osd, crushmap updated and Ceph Cluster turns into recovery mode. 
> Everything is OK. But after a while, some runnings VM's became unmanageable. 
> Servers become unresponsive one by one. Recovery process would take an 
> average of 20 hours. For this reason, I removed the new osd. Recovery process 
> completed and everythink become normal.
> 
> When new osd added, health status:
>cluster 
> health HEALTH_WARN
>91 pgs backfill_wait
>1 pgs bacfilling
>28 pgs degraded
>28 pgs recovery_wait
>28 phs stuck degraded
>recovery 2195/18486602 objects degraded (0.012%)
>recovery 1279784/18486602 objects misplaced (6.923%)
> monmap e9: 3 mons at 
> {ceph-mon01=xxx:6789/0,ceph-mon02=xxx:6789/0,ceph-mon03=xxx:6789/0}
>election epoch 84, quorum 0,1,2 ceph-mon01,ceph-mon02,ceph-mon03
> osdmap e1512: 37 osds: 37 up, 37 in
>flags sortbitwise,require_jewel_osds
>  pgmap v7698673: 1408 pgs, 5 pools, 37365 GB data, 9436 kobjects
>83871 GB used, 114 TB / 201 TB avail
>2195/18486602 objects degraded (0.012%)
>1279784/18486602 objects misplaced (6.923%)
>1286 active+clean
>91 active+remapped+wait_backfill
>   28 active+recovery_wait+degraded
> 2 active+clean+scrubbing+deep
> 1 active+remapped+backfilling
> recovery io 430 MB/s, 119 objects/s
> client io 36174 B/s rrd, 5567 kB/s wr, 5 op/s rd, 700 op/s wr
> 
> Some Ceph config parameters:
> osd_max_backfills = 1
> osd_backfill_full_ratio = 0.85
> osd_recovery_max_active = 3
> osd_recovery_threads = 1
> 
> How I can add new OSD's safely?
> 
> Best regards,
> Ramazan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Primary Affinity

2017-04-19 Thread Reed Dier
In this case the spinners have their journals on an NVMe drive, 3 OSD : 1 NVMe 
Journal.

Will be trying tomorrow to get some benchmarks and compare some hdd/ssd/hybrid 
workloads to see performance differences across the three backing layers.

Most client traffic is read oriented to begin with, so keeping reads quick is 
likely the biggest goal here.

Appreciate everyone’s input and advice.

Reed

> On Apr 19, 2017, at 5:59 PM, Anthony D'Atri  wrote:
> 
> Re ratio, I think you’re right.
> 
> Write performance depends for sure on what the journal devices are.  If the 
> journals are colo’d on spinners, then for sure the affinity game isn’t going 
> to help writes massively.
> 
> My understanding of write latency is that min_size journals have to be 
> written before the op returns, so if journals aren’t on SSD’s that’s going to 
> be a big bottleneck.
> 
> 
> 
> 
>> Hi,
>> 
 Assuming production level, we would keep a pretty close 1:2 SSD:HDD ratio,
>>> 1:4-5 is common but depends on your needs and the devices in question, ie. 
>>> assuming LFF drives and that you aren’t using crummy journals.
>> 
>> You might be speaking about different ratios here. I think that Anthony is 
>> speaking about journal/OSD and Reed speaking about capacity ratio between 
>> and HDD and SSD tier/root. 
>> 
>> I have been experimenting with hybrid setups (1 copy on SSD + 2 copies on 
>> HDD), like Richard says you’ll get much better random read performance with 
>> primary OSD on SSD but write performance won’t be amazing since you still 
>> have 2 HDD copies to write before ACK. 
>> 
>> I know the doc suggests using primary affinity but since it’s a OSD level 
>> setting it does not play well with other storage tiers so I searched for 
>> other options. From what I have tested, a rule that selects the 
>> first/primary OSD from the ssd-root then the rest of the copies from the 
>> hdd-root works. Though I am not sure it is *guaranteed* that the first OSD 
>> selected will be primary.
>> 
>> “rule hybrid {
>> ruleset 2
>> type replicated
>> min_size 1
>> max_size 10
>> step take ssd-root
>> step chooseleaf firstn 1 type host
>> step emit
>> step take hdd-root
>> step chooseleaf firstn -1 type host
>> step emit
>> }”
>> 
>> Cheers,
>> Maxime
>> 
>> 
>> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Primary Affinity

2017-04-19 Thread Reed Dier
Hi Maxime,

This is a very interesting concept. Instead of the primary affinity being used 
to choose SSD for primary copy, you set crush rule to first choose an osd in 
the ‘ssd-root’, then the ‘hdd-root’ for the second set.

And with 'step chooseleaf first {num}’
> If {num} > 0 && < pool-num-replicas, choose that many buckets. 
So 1 chooses that bucket
> If {num} < 0, it means pool-num-replicas - {num}
And -1 means it will fill remaining replicas on this bucket.

This is a very interesting concept, one I had not considered.
Really appreciate this feedback.

Thanks,

Reed

> On Apr 19, 2017, at 12:15 PM, Maxime Guyot  wrote:
> 
> Hi,
> 
>>> Assuming production level, we would keep a pretty close 1:2 SSD:HDD ratio,
>> 1:4-5 is common but depends on your needs and the devices in question, ie. 
>> assuming LFF drives and that you aren’t using crummy journals.
> 
> You might be speaking about different ratios here. I think that Anthony is 
> speaking about journal/OSD and Reed speaking about capacity ratio between and 
> HDD and SSD tier/root. 
> 
> I have been experimenting with hybrid setups (1 copy on SSD + 2 copies on 
> HDD), like Richard says you’ll get much better random read performance with 
> primary OSD on SSD but write performance won’t be amazing since you still 
> have 2 HDD copies to write before ACK. 
> 
> I know the doc suggests using primary affinity but since it’s a OSD level 
> setting it does not play well with other storage tiers so I searched for 
> other options. From what I have tested, a rule that selects the first/primary 
> OSD from the ssd-root then the rest of the copies from the hdd-root works. 
> Though I am not sure it is *guaranteed* that the first OSD selected will be 
> primary.
> 
> “rule hybrid {
>  ruleset 2
>  type replicated
>  min_size 1
>  max_size 10
>  step take ssd-root
>  step chooseleaf firstn 1 type host
>  step emit
>  step take hdd-root
>  step chooseleaf firstn -1 type host
>  step emit
> }”
> 
> Cheers,
> Maxime
> 
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] SSD Primary Affinity

2017-04-17 Thread Reed Dier
Hi all,

I am looking at a way to scale performance and usable space using something 
like Primary Affinity to effectively use 3x replication across 1 primary SSD 
OSD, and 2 replicated HDD OSD’s.

Assuming production level, we would keep a pretty close 1:2 SSD:HDD ratio, but 
looking to experiment with some of my non-production cluster.
Jewel 10.2.6; Ubuntu 16.04; 4.4 kernel, with some 4.8 kernel; 2x10G ethernet to 
each node.

First of all, is this even a valid architecture decision? Obviously SSD’s will 
wear quicker, but with proper endurance ratings (3-5 DWPD) and keeping an eye 
on them, it should boost performance levels considerably compared to spinning 
disks, while allowing me to provide that performance at roughly half the cost 
of all-flash to achieve the capacity levels I am looking to hit.

Currently in CRUSH, I have an all HDD root level, and an all SSD root level, 
for this ceph cluster.

I assume that I can not cross-pollinate these two root-level crush tiers to do 
quick performance benchmarks?

And I am assuming that if I wanted to actually make this work in production, I 
would set primary affinity for any SSD OSD that I want the acting copy of my 
PG’s to live on (SSD) to be 1, and set the primary affinity for my HDD OSD’s to 
be 0, and CRUSH would figure that out to get all of the data to size=3.

Does anyone have this running in production?
Anyone have any comments/concerns/issues with this?
Any comparisons between this and cache-tiering?

Workload is pretty simple, mostly RADOS object store, with CephFS as well.
We have found that the 8TB HDDs were not very conducive to our workloads in 
testing, which got better with more scale, but was still very slow (even with 
NVMe journals).
And for the record, these are the Seagate Enterprise Capacity drives, so PMR, 
not SMR (ST8000NM0065).

So trying to find the easiest way that I can test/benchmark the feasibility of 
this hybrid/primary affinity architecture in the lab to get a better 
understanding moving forward.

Any insight is appreciated.

Thanks,

Reed


$ ceph osd tree
ID  WEIGHTTYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-13  52.37358 root ssd
-11  52.37358 rack ssd.rack2
-14  17.45700 host ceph00
 24   1.74599 osd.24 up  1.0  1.0
 25   1.74599 osd.25 up  1.0  1.0
 26   1.74599 osd.26 up  1.0  1.0
 27   1.74599 osd.27 up  1.0  1.0
 28   1.74599 osd.28 up  1.0  1.0
 29   1.74599 osd.29 up  1.0  1.0
 30   1.74599 osd.30 up  1.0  1.0
 31   1.74599 osd.31 up  1.0  1.0
 32   1.74599 osd.32 up  1.0  1.0
 33   1.74599 osd.33 up  1.0  1.0
-15  17.45700 host ceph01
 34   1.74599 osd.34 up  1.0  1.0
 35   1.74599 osd.35 up  1.0  1.0
 36   1.74599 osd.36 up  1.0  1.0
 37   1.74599 osd.37 up  1.0  1.0
 38   1.74599 osd.38 up  1.0  1.0
 39   1.74599 osd.39 up  1.0  1.0
 40   1.74599 osd.40 up  1.0  1.0
 41   1.74599 osd.41 up  1.0  1.0
 42   1.74599 osd.42 up  1.0  1.0
 43   1.74599 osd.43 up  1.0  1.0
-16  17.45958 host ceph02
 45   1.74599 osd.45 up  1.0  1.0
 46   1.74599 osd.46 up  1.0  1.0
 47   1.74599 osd.47 up  1.0  1.0
 48   1.74599 osd.48 up  1.0  1.0
 49   1.74599 osd.49 up  1.0  1.0
 50   1.74599 osd.50 up  1.0  1.0
 51   1.74599 osd.51 up  1.0  1.0
 52   1.74599 osd.52 up  1.0  1.0
 53   1.74599 osd.53 up  1.0  1.0
 44   1.74570 osd.44 up  1.0  1.0
-10 0 rack default.rack2
-12 0 chassis default.rack2.U16
 -1 174.51492 root default
 -2  21.81000 host node24
  0   7.26999 osd.0  up  1.0  1.0
  8   7.26999 osd.8  up  1.0  1.0
 16   7.26999 osd.16 up  

[ceph-users] Strange crush / ceph-deploy issue

2017-03-31 Thread Reed Dier
Trying to add a batch of OSD’s to my cluster, (Jewel 10.2.6, Ubuntu 16.04)

2 new nodes (ceph01,ceph02), 10 OSD’s per node.

I am trying to steer the OSD’s into a different root pool with crush location 
set in ceph.conf with 
> [osd.34]
> crush_location = "host=ceph01 rack=ssd.rack2 root=ssd"
> 
> [osd.35]
> crush_location = "host=ceph01 rack=ssd.rack2 root=ssd"
> 
> [osd.36]
> crush_location = "host=ceph01 rack=ssd.rack2 root=ssd"
> 
> [osd.37]
> crush_location = "host=ceph01 rack=ssd.rack2 root=ssd"
> 
> [osd.38]
> crush_location = "host=ceph01 rack=ssd.rack2 root=ssd"
> 
> [osd.39]
> crush_location = "host=ceph01 rack=ssd.rack2 root=ssd"
> 
> [osd.40]
> crush_location = "host=ceph01 rack=ssd.rack2 root=ssd"
> 
> [osd.41]
> crush_location = "host=ceph01 rack=ssd.rack2 root=ssd"
> 
> [osd.42]
> crush_location = "host=ceph01 rack=ssd.rack2 root=ssd"
> 
> [osd.43]
> crush_location = "host=ceph01 rack=ssd.rack2 root=ssd”
> 
> [osd.44]
> crush_location = "host=ceph02 rack=ssd.rack2 root=ssd"
> 
> [osd.45]
> crush_location = "host=ceph02 rack=ssd.rack2 root=ssd"
> 
> [osd.46]
> crush_location = "host=ceph02 rack=ssd.rack2 root=ssd"
> 
> [osd.47]
> crush_location = "host=ceph02 rack=ssd.rack2 root=ssd"
> 
> [osd.48]
> crush_location = "host=ceph02 rack=ssd.rack2 root=ssd"
> 
> [osd.49]
> crush_location = "host=ceph02 rack=ssd.rack2 root=ssd"
> 
> [osd.50]
> crush_location = "host=ceph02 rack=ssd.rack2 root=ssd"
> 
> [osd.51]
> crush_location = "host=ceph02 rack=ssd.rack2 root=ssd"
> 
> [osd.52]
> crush_location = "host=ceph02 rack=ssd.rack2 root=ssd"
> 
> [osd.53]
> crush_location = "host=ceph02 rack=ssd.rack2 root=ssd”

Adding ceph01 and its OSDs went without a hitch.
However, ceph02 is completely getting lost, and its osd’s are getting zero 
weighted into the bottom of the osd tree at the root level.

> $ ceph osd tree
> ID  WEIGHTTYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -13  34.91394 root ssd
> -11  34.91394 rack ssd.rack2
> -14  17.45697 host ceph00
>  24   1.74570 osd.24 up  1.0  1.0
>  25   1.74570 osd.25 up  1.0  1.0
>  26   1.74570 osd.26 up  1.0  1.0
>  27   1.74570 osd.27 up  1.0  1.0
>  28   1.74570 osd.28 up  1.0  1.0
>  29   1.74570 osd.29 up  1.0  1.0
>  30   1.74570 osd.30 up  1.0  1.0
>  31   1.74570 osd.31 up  1.0  1.0
>  32   1.74570 osd.32 up  1.0  1.0
>  33   1.74570 osd.33 up  1.0  1.0
> -15  17.45697 host ceph01
>  34   1.74570 osd.34 up  1.0  1.0
>  35   1.74570 osd.35 up  1.0  1.0
>  36   1.74570 osd.36 up  1.0  1.0
>  37   1.74570 osd.37 up  1.0  1.0
>  38   1.74570 osd.38 up  1.0  1.0
>  39   1.74570 osd.39 up  1.0  1.0
>  40   1.74570 osd.40 up  1.0  1.0
>  41   1.74570 osd.41 up  1.0  1.0
>  42   1.74570 osd.42 up  1.0  1.0
>  43   1.74570 osd.43 up  1.0  1.0
> -16 0 host ceph02
> -10 0 rack default.rack2
> -12 0 chassis default.rack2.U16
>  -1 174.51584 root default
>  -2  21.81029 host node24
>   0   7.27010 osd.0  up  1.0  1.0
>   8   7.27010 osd.8  up  1.0  1.0
>  16   7.27010 osd.16 up  1.0  1.0
>  -3  21.81029 host node25
>   1   7.27010 osd.1  up  1.0  1.0
>   9   7.27010 osd.9  up  1.0  1.0
>  17   7.27010 osd.17 up  1.0  1.0
>  -4  21.81987 host node26
>  10   7.27010 osd.10 up  1.0  1.0
>  18   7.27489 osd.18 up  1.0  1.0
>   2   7.27489 osd.2  up  1.0  1.0
>  -5  21.81508 host node27
>   3   7.27010 osd.3  up  1.0  1.0
>  11   7.27010 osd.11 up  1.0  1.0
>  19   7.27489 osd.19 up  1.0  1.0
>  -6  21.81508 host node28
>   4   7.27010 osd.4  up  1.0  

Re: [ceph-users] Ceph PG repair

2017-03-08 Thread Reed Dier
This PG/object is still doing something rather odd.

Attempted to repair the object, which it supposedly attempted, but now I appear 
to have less visibility.

> $ ceph health detail
> HEALTH_ERR 3 pgs inconsistent; 4 scrub errors; mds0: Many clients (20) 
> failing to respond to cache pressure; noout,sortbitwise,require_jewel_osds 
> flag(s) set
> pg 10.2d8 is active+clean+inconsistent, acting [18,17,22]
> pg 10.7bd is active+clean+inconsistent, acting [8,23,17]
> pg 17.ec is active+clean+inconsistent, acting [23,2,21]
> 4 scrub errors
> noout,sortbitwise,require_jewel_osds flag(s) set


23 is the osd scheduled for replacement, generated another read error.

However, 17.ec does not show in the rados list inconsistent pg objects command

> $ rados list-inconsistent-pg objects
> ["10.2d8","10.7bd”]

And examining 10.2d8 as before, I’m presented with this:

> $ rados list-inconsistent-obj 10.2d8 --format=json-pretty
> {
> "epoch": 21094,
> "inconsistents": []
> }

Even though in the logs, the deep scrub and repair both show that the object 
was not repaired.

> $ zgrep 10.2d8 ceph-*
> ceph-osd.18.log.2.gz:2017-03-06 15:10:08.729827 7fc8dfeb8700  0 
> log_channel(cluster) log [INF] : 10.2d8 repair starts
> ceph-osd.18.log.2.gz:2017-03-06 15:13:49.793839 7fc8dfeb8700 -1 
> log_channel(cluster) log [ERR] : 10.2d8 recorded data digest 0x7fa9879c != on 
> disk 0xa6798e03 on {object.name}:head
> ceph-osd.18.log.2.gz:2017-03-06 15:13:49.793941 7fc8dfeb8700 -1 
> log_channel(cluster) log [ERR] : repair 10.2d8 {object.name}:head on disk 
> size (15913) does not match object info size (10280) adjusted for ondisk to 
> (10280)
> ceph-osd.18.log.2.gz:2017-03-06 15:46:13.286268 7fc8dfeb8700 -1 
> log_channel(cluster) log [ERR] : 10.2d8 repair 2 errors, 0 fixed
> ceph-osd.18.log.4.gz:2017-03-04 18:16:23.693057 7fc8dd6b3700  0 
> log_channel(cluster) log [INF] : 10.2d8 deep-scrub starts
> ceph-osd.18.log.4.gz:2017-03-04 18:19:25.471322 7fc8dfeb8700 -1 
> log_channel(cluster) log [ERR] : 10.2d8 recorded data digest 0x7fa9879c != on 
> disk 0xa6798e03 on {object.name}:head
> ceph-osd.18.log.4.gz:2017-03-04 18:19:25.471403 7fc8dfeb8700 -1 
> log_channel(cluster) log [ERR] : deep-scrub 10.2d8 {object.name}:head on disk 
> size (15913) does not match object info size (10280) adjusted for ondisk to 
> (10280)
> ceph-osd.18.log.4.gz:2017-03-04 18:55:39.617841 7fc8dd6b3700 -1 
> log_channel(cluster) log [ERR] : 10.2d8 deep-scrub 2 errors


File size and md5 still match.

> ls -la 
> /var/lib/ceph/osd/ceph-*/current/10.2d8_head/DIR_8/DIR_D/DIR_2/DIR_4/DIR_4/DIR_A/{object.name}
> -rw-r--r-- 1 ceph ceph 15913 Mar  2 17:24 
> /var/lib/ceph/osd/ceph-17/current/10.2d8_head/DIR_8/DIR_D/DIR_2/DIR_4/DIR_4/DIR_A/{object.name}

> -rw-r--r-- 1 ceph ceph 15913 Mar  2 17:24 
> /var/lib/ceph/osd/ceph-18/current/10.2d8_head/DIR_8/DIR_D/DIR_2/DIR_4/DIR_4/DIR_A/{object.name}
> -rw-r--r-- 1 ceph ceph 15913 Mar  2 17:24 
> /var/lib/ceph/osd/ceph-22/current/10.2d8_head/DIR_8/DIR_D/DIR_2/DIR_4/DIR_4/DIR_A/{object.name}

> md5sum 
> /var/lib/ceph/osd/ceph-*/current/10.2d8_head/DIR_8/DIR_D/DIR_2/DIR_4/DIR_4/DIR_A/{object.name}
> 55a76349b758d68945e5028784c59f24  
> /var/lib/ceph/osd/ceph-17/current/10.2d8_head/DIR_8/DIR_D/DIR_2/DIR_4/DIR_4/DIR_A/{object.name}
> 55a76349b758d68945e5028784c59f24  
> /var/lib/ceph/osd/ceph-18/current/10.2d8_head/DIR_8/DIR_D/DIR_2/DIR_4/DIR_4/DIR_A/{object.name}
> 55a76349b758d68945e5028784c59f24  
> /var/lib/ceph/osd/ceph-22/current/10.2d8_head/DIR_8/DIR_D/DIR_2/DIR_4/DIR_4/DIR_A/{object.name}


So is the object actually inconsistent?
Is rados somehow behind on something, not showing the third inconsistent PG?

Appreciate any help.

Reed

> On Mar 2, 2017, at 9:21 AM, Reed Dier <reed.d...@focusvq.com> wrote:
> 
> Over the weekend, two inconsistent PG’s popped up in my cluster. This being 
> after having scrubs disabled for close to 6 weeks after a very long rebalance 
> after adding 33% more OSD’s, an OSD failing, increasing PG’s, etc.
> 
> It appears we came out the other end with 2 inconsistent PG’s and I’m trying 
> to resolve them, and not seeming to have much luck.
> Ubuntu 16.04, Jewel 10.2.5, 3x replicated pool for reference.
> 
>> $ ceph health detail
>> HEALTH_ERR 2 pgs inconsistent; 3 scrub errors; 
>> noout,sortbitwise,require_jewel_osds flag(s) set
>> pg 10.7bd is active+clean+inconsistent, acting [8,23,17]
>> pg 10.2d8 is active+clean+inconsistent, acting [18,17,22]
>> 3 scrub errors
> 
>> $ rados list-inconsistent-pg objects
>> ["10.2d8","10.7bd”]
> 
> Pretty straight forward, 2 PG’s with inconsistent copies. Lets dig deeper.
> 
>> $ rados list-inconsistent-obj 10

[ceph-users] Ceph PG repair

2017-03-02 Thread Reed Dier
Over the weekend, two inconsistent PG’s popped up in my cluster. This being 
after having scrubs disabled for close to 6 weeks after a very long rebalance 
after adding 33% more OSD’s, an OSD failing, increasing PG’s, etc.

It appears we came out the other end with 2 inconsistent PG’s and I’m trying to 
resolve them, and not seeming to have much luck.
Ubuntu 16.04, Jewel 10.2.5, 3x replicated pool for reference.

> $ ceph health detail
> HEALTH_ERR 2 pgs inconsistent; 3 scrub errors; 
> noout,sortbitwise,require_jewel_osds flag(s) set
> pg 10.7bd is active+clean+inconsistent, acting [8,23,17]
> pg 10.2d8 is active+clean+inconsistent, acting [18,17,22]
> 3 scrub errors

> $ rados list-inconsistent-pg objects
> ["10.2d8","10.7bd”]

Pretty straight forward, 2 PG’s with inconsistent copies. Lets dig deeper.

> $ rados list-inconsistent-obj 10.2d8 --format=json-pretty
> {
> "epoch": 21094,
> "inconsistents": [
> {
> "object": {
> "name": “object.name",
> "nspace": “namespace.name",
> "locator": "",
> "snap": "head"
> },
> "errors": [],
> "shards": [
> {
> "osd": 17,
> "size": 15913,
> "omap_digest": "0x",
> "data_digest": "0xa6798e03",
> "errors": []
> },
> {
> "osd": 18,
> "size": 15913,
> "omap_digest": "0x",
> "data_digest": "0xa6798e03",
> "errors": []
> },
> {
> "osd": 22,
> "size": 15913,
> "omap_digest": "0x",
> "data_digest": "0xa6798e03",
> "errors": [
> "data_digest_mismatch_oi"
> ]
> }
> ]
> }
> ]
> }

> $ rados list-inconsistent-obj 10.7bd --format=json-pretty
> {
> "epoch": 21070,
> "inconsistents": [
> {
> "object": {
> "name": “object2.name",
> "nspace": “namespace.name",
> "locator": "",
> "snap": "head"
> },
> "errors": [
> "read_error"
> ],
> "shards": [
> {
> "osd": 8,
> "size": 27691,
> "omap_digest": "0x",
> "data_digest": "0x9ce36903",
> "errors": []
> },
> {
> "osd": 17,
> "size": 27691,
> "omap_digest": "0x",
> "data_digest": "0x9ce36903",
> "errors": []
> },
> {
> "osd": 23,
> "size": 27691,
> "errors": [
> "read_error"
> ]
> }
> ]
> }
> ]
> }


So we have one PG (10.7bd) with a read error on osd.23, which is known and 
scheduled for replacement.
We also have a data digest mismatch on PG 10.2d8 on osd.22, which I have been 
attempting to repair with no real tangible results.

> $ ceph pg repair 10.2d8
> instructing pg 10.2d8 on osd.18 to repair

I’ve run the ceph pg repair command multiple times, and each time, it instructs 
osd.18 to repair to the PG.
Is this to assume that osd.18 is the acting member of the copies, and its being 
told to backfill the known-good copy of the PG over the agreed upon wrong 
version on osd.22.

> $ zgrep 'ERR' /var/log/ceph/*
> /var/log/ceph/ceph-osd.18.log.7.gz:2017-02-23 20:45:21.561164 7fc8dfeb8700 -1 
> log_channel(cluster) log [ERR] : 10.2d8 recorded data digest 0x7fa9879c != on 
> disk 0xa6798e03 on 10:1b42251f:{object.name}:head
> /var/log/ceph/ceph-osd.18.log.7.gz:2017-02-23 20:45:21.561225 7fc8dfeb8700 -1 
> log_channel(cluster) log [ERR] : deep-scrub 10.2d8 
> 10:1b42251f:{object.name}:head on disk size (15913) does not match object 
> info size (10280) adjusted for ondisk to (10280)
> /var/log/ceph/ceph-osd.18.log.7.gz:2017-02-23 21:05:59.935815 7fc8dfeb8700 -1 
> log_channel(cluster) log [ERR] : 10.2d8 deep-scrub 2 errors


> $ ceph pg 10.2d8 query
> {
> "state": "active+clean+inconsistent",
> "snap_trimq": "[]",
> "epoch": 21746,
> "up": [
> 18,
> 17,
> 22
> ],
> "acting": [
> 18,
> 17,
> 22
> ],
> "actingbackfill": [
> "17",
> "18",
> "22"
> ],

However, no recovery io ever occurs, and the PG never goes active+clean. Not 
seeing anything exciting in the logs of the OSD’s nor the mon’s.

I’ve found a few articles and mailing list entries that 

[ceph-users] Backfill/recovery prioritization

2017-02-01 Thread Reed Dier
Have a smallish cluster that has been expanding with almost a 50% increase in 
the number of OSD (16->24).

This has caused some issues with data integrity and cluster performance as we 
have increased PG count, and added OSDs.

8x nodes with 3x drives, connected over 2x10G.

My problem is that I have PG’s that have become grossly undersized 
(size=3,min_size=2), in some cases just 1 copy, which created a deadlock of io, 
backed up behind this PG without enough copies.

It has been backfilling and recovering at a steady pace, but it seems that all 
backfills are weighted equally, and the more serious PG’s could be at the front 
of the queue, or at the very end of the queue, with no apparent rhyme or reason.

This has been exasperated by a failing, but not failed OSD, which I have moved 
out, but still up, in an attempt for it to try and move its data off of itself 
gracefully, and not take on new io.

I guess my question would be “is there a way to get the most important/critical 
recovery/backfill operations completed ahead of less important/critical 
backfill/recovery.” i.e., tackle the 1 copy PG’s that are blocking io, ahead of 
the less used PG’s that have 2 copies and backfilling to make their 3rd.

Thanks,

Reed
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD create with SSD journal

2017-01-11 Thread Reed Dier
Interesting, I feel silly having not checked ownership of the dev device.

Will chown before next deploy and report back for sake of possibly helping 
someone else down the line.

Thanks,

Reed
> On Jan 11, 2017, at 3:07 PM, Stillwell, Bryan J <bryan.stillw...@charter.com> 
> wrote:
> 
> On 1/11/17, 10:31 AM, "ceph-users on behalf of Reed Dier"
> <ceph-users-boun...@lists.ceph.com on behalf of reed.d...@focusvq.com>
> wrote:
> 
>>> 2017-01-03 12:10:23.514577 7f1d821f2800  0 ceph version 10.2.5
>>> (c461ee19ecbc0c5c330aca20f7392c9a00730367), process ceph-osd, pid 19754
>>> 2017-01-03 12:10:23.517465 7f1d821f2800  1
>>> filestore(/var/lib/ceph/tmp/mnt.WaQmjK) mkfs in
>>> /var/lib/ceph/tmp/mnt.WaQmjK
>>> 2017-01-03 12:10:23.517494 7f1d821f2800  1
>>> filestore(/var/lib/ceph/tmp/mnt.WaQmjK) mkfs fsid is already set to
>>> 644058d7-e1b0-4abe-92e2-43b17d75148e
>>> 2017-01-03 12:10:23.517499 7f1d821f2800  1
>>> filestore(/var/lib/ceph/tmp/mnt.WaQmjK) write_version_stamp 4
>>> 2017-01-03 12:10:23.517678 7f1d821f2800  0
>>> filestore(/var/lib/ceph/tmp/mnt.WaQmjK) backend xfs (magic 0x58465342)
>>> 2017-01-03 12:10:23.519898 7f1d821f2800  1
>>> filestore(/var/lib/ceph/tmp/mnt.WaQmjK) leveldb db exists/created
>>> 2017-01-03 12:10:23.520035 7f1d821f2800 -1
>>> filestore(/var/lib/ceph/tmp/mnt.WaQmjK) mkjournal error creating journal
>>> on /var/lib/ceph/tmp/mnt.WaQmjK/journal: (13) Permission denied
>>> 2017-01-03 12:10:23.520049 7f1d821f2800 -1 OSD::mkfs: ObjectStore::mkfs
>>> failed with error -13
>>> 2017-01-03 12:10:23.520100 7f1d821f2800 -1 ESC[0;31m ** ERROR: error
>>> creating empty object store in /var/lib/ceph/tmp/mnt.WaQmjK: (13)
>>> Permission deniedESC[0m
>> 
>> I needed up creating the OSD¹s with on-disk journals, then going back and
>> moving the journals to the NVMe partition as intended, but hoping to do
>> this all in one fell swoop, so hoping there may be some pointers on
>> something I may be doing incorrectly with ceph-deploy for the external
>> journal location. Adding a handful of OSD¹s soon, and would like to do it
>> correctly from the start.
> 
> What's the ownership of the journal device (/dev/nvme0n1p5)?
> 
> It should be owned by ceph:ceph or you'll get the permission denied errors
> message.
> 
> Bryan
> 
> E-MAIL CONFIDENTIALITY NOTICE: 
> The contents of this e-mail message and any attachments are intended solely 
> for the addressee(s) and may contain confidential and/or legally privileged 
> information. If you are not the intended recipient of this message or if this 
> message has been addressed to you in error, please immediately alert the 
> sender by reply e-mail and then delete this message and any attachments. If 
> you are not the intended recipient, you are notified that any use, 
> dissemination, distribution, copying, or storage of this message or any 
> attachment is strictly prohibited.
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD create with SSD journal

2017-01-11 Thread Reed Dier
So I was attempting to add an OSD to my ceph-cluster (running Jewel 10.2.5), 
using ceph-deploy (1.5.35), on Ubuntu.

I have 2 OSD’s on this node, attempting to add third.

The first two OSD’s I created with on-disk journals, then later moved them to 
partitions on the NVMe system disk (Intel P3600).
I carved out 3 8GB partitions on the NVMe disk for journaling purposes, only 
using two originally, with one left for one more OSD when the time came.

> [2017-01-03 12:03:53,667][node25][DEBUG ] /dev/sda :
> [2017-01-03 12:03:53,667][node25][DEBUG ]  /dev/sda2 ceph journal
> [2017-01-03 12:03:53,668][node25][DEBUG ]  /dev/sda1 ceph data, active, 
> cluster ceph, osd.1, journal /dev/nvme0n1p2
> [2017-01-03 12:03:53,668][node25][DEBUG ] /dev/sdb :
> [2017-01-03 12:03:53,668][node25][DEBUG ]  /dev/sdb2 ceph journal
> [2017-01-03 12:03:53,668][node25][DEBUG ]  /dev/sdb1 ceph data, active, 
> cluster ceph, osd.9, journal /dev/nvme0n1p4

When attempting to add the new OSD (disk /dev/sdc, journal /dev/nvme0n1p5), I 
zapped SDC, then I attempted to prepare the OSD using:
> ceph-deploy --username root osd prepare node25:sdc:/dev/nvme0n1p5

When ceph-deploy finished, I had 1 down, 1 out.
> [2017-01-03 12:08:34,229][node25][INFO  ] checking OSD status...
> [2017-01-03 12:08:34,229][node25][DEBUG ] find the location of an executable
> [2017-01-03 12:08:34,232][node25][INFO  ] Running command: /usr/bin/ceph 
> --cluster=ceph osd stat --format=json
> [2017-01-03 12:08:34,397][node25][WARNING] there is 1 OSD down
> [2017-01-03 12:08:34,397][node25][WARNING] there is 1 OSD out
> [2017-01-03 12:08:34,398][ceph_deploy.osd][DEBUG ] Host node25 is now ready 
> for osd use.

However, when I tried to activate, it would fail.

For the life of me, I’m not seeing those logs sadly.

I then tried to create, instead of prepare-activate, after zapping SDC again.
> ceph-deploy --username root osd create node25:sdc:/dev/nvme0n1p5


> [2017-01-03 12:10:26,362][node25][INFO  ] checking OSD status...
> [2017-01-03 12:10:26,363][node25][DEBUG ] find the location of an executable
> [2017-01-03 12:10:26,365][node25][INFO  ] Running command: /usr/bin/ceph 
> --cluster=ceph osd stat --format=json
> [2017-01-03 12:10:26,630][node25][WARNING] there is 1 OSD down
> [2017-01-03 12:10:26,631][ceph_deploy.osd][DEBUG ] Host node25 is now ready 
> for osd use.

This obviously brought the OSD in, but down, and I was unable to bring it up.

It appears that this failed to create the filesystem for the journal in the 
temporary directory, and thus punting before creating the final OSD directory 
in /var/lib/ceph/osd/

> 2017-01-03 12:08:30.680705 7fedbed08800  0 set uid:gid to 64045:64045 
> (ceph:ceph)
> 2017-01-03 12:08:30.680732 7fedbed08800  0 ceph version 10.2.5 
> (c461ee19ecbc0c5c330aca20f7392c9a00730367), process ceph-osd, pid 10894
> 2017-01-03 12:08:30.684279 7fedbed08800  1 
> filestore(/var/lib/ceph/tmp/mnt.00hwsQ) mkfs in /var/lib/ceph/tmp/mnt.00hwsQ
> 2017-01-03 12:08:30.684311 7fedbed08800  1 
> filestore(/var/lib/ceph/tmp/mnt.00hwsQ) mkfs fsid is already set to 
> f0bfe11f-b89d-44dd-88ca-1d62251a03e9
> 2017-01-03 12:08:30.684336 7fedbed08800  1 
> filestore(/var/lib/ceph/tmp/mnt.00hwsQ) write_version_stamp 4
> 2017-01-03 12:08:30.685565 7fedbed08800  0 
> filestore(/var/lib/ceph/tmp/mnt.00hwsQ) backend xfs (magic 0x58465342)
> 2017-01-03 12:08:30.687479 7fedbed08800  1 
> filestore(/var/lib/ceph/tmp/mnt.00hwsQ) leveldb db exists/created
> 2017-01-03 12:08:30.687775 7fedbed08800 -1 
> filestore(/var/lib/ceph/tmp/mnt.00hwsQ) mkjournal error creating journal on 
> /var/lib/ceph/tmp/mnt.00hwsQ/journal: (13) Permission denied
> 2017-01-03 12:08:30.687801 7fedbed08800 -1 OSD::mkfs: ObjectStore::mkfs 
> failed with error -13
> 2017-01-03 12:08:30.687859 7fedbed08800 -1 ESC[0;31m ** ERROR: error creating 
> empty object store in /var/lib/ceph/tmp/mnt.00hwsQ: (13) Permission 
> deniedESC[0m
> 2017-01-03 12:08:31.563884 7f6541787800  0 set uid:gid to 64045:64045 
> (ceph:ceph)
> 2017-01-03 12:08:31.563919 7f6541787800  0 ceph version 10.2.5 
> (c461ee19ecbc0c5c330aca20f7392c9a00730367), process ceph-osd, pid 10977
> 2017-01-03 12:08:31.567261 7f6541787800  1 
> filestore(/var/lib/ceph/tmp/mnt.kRftwx) mkfs in /var/lib/ceph/tmp/mnt.kRftwx
> 2017-01-03 12:08:31.567294 7f6541787800  1 
> filestore(/var/lib/ceph/tmp/mnt.kRftwx) mkfs fsid is already set to 
> f0bfe11f-b89d-44dd-88ca-1d62251a03e9
> 2017-01-03 12:08:31.567298 7f6541787800  1 
> filestore(/var/lib/ceph/tmp/mnt.kRftwx) write_version_stamp 4
> 2017-01-03 12:08:31.567561 7f6541787800  0 
> filestore(/var/lib/ceph/tmp/mnt.kRftwx) backend xfs (magic 0x58465342)
> 2017-01-03 12:08:31.594423 7f6541787800  1 
> filestore(/var/lib/ceph/tmp/mnt.kRftwx) leveldb db exists/created
> 2017-01-03 12:08:31.594553 7f6541787800 -1 
> filestore(/var/lib/ceph/tmp/mnt.kRftwx) mkjournal error creating journal on 
> /var/lib/ceph/tmp/mnt.kRftwx/journal: (13) Permission denied
> 2017-01-03 12:08:31.594572 7f6541787800 

Re: [ceph-users] High load on OSD processes

2016-12-09 Thread Reed Dier
I don’t think there is a graceful path to downgrade.

There is a hot fix upstream I believe. My understanding is the build is being 
tested for release.

Francois Lafont posted in the other thread:

> Begin forwarded message:
> 
> From: Francois Lafont <francois.lafont.1...@gmail.com>
> Subject: Re: [ceph-users] 10.2.4 Jewel released
> Date: December 9, 2016 at 11:54:06 AM CST
> To: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
> Content-Type: text/plain; charset="us-ascii"
> 
> On 12/09/2016 06:39 PM, Alex Evonosky wrote:
> 
>> Sounds great.  May I asked what procedure you did to upgrade?
> 
> Of course. ;)
> 
> It's here: https://shaman.ceph.com/repos/ceph/wip-msgr-jewel-fix2/
> (I think this link was pointed by Greg Farnum or Sage Weil in a
> previous message).
> 
> Personally I use Ubuntu Trusty, so for me in the page above leads me
> to use this line in my "sources.list":
> 
> deb 
> http://3.chacra.ceph.com/r/ceph/wip-msgr-jewel-fix2/5d3c76c1c6e991649f0beedb80e6823606176d9e/ubuntu/trusty/flavors/default/
>  trusty main
> 
> And after that "apt-get update && apt-get upgrade" etc.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
This is obviously geared towards Ubuntu/Debian, though I’d assume theres an rpm 
of the same build accessible.

Reed

> On Dec 9, 2016, at 4:43 PM, lewis.geo...@innoscale.net wrote:
> 
> Hi Reed,
> Yes, this was just installed yesterday and that is the version. I just 
> retested and it is exactly 15 minutes when the load starts to climb. 
>  
> So, just like Diego, do you know if there is a fix for this yet and when it 
> might be available on the repo? Should I try to install the prior minor 
> release version for now?
>  
> Thank you for the information.
>  
> Have a good day,
>  
> Lewis George
>  
>  
>  
> From: "Diego Castro" <diego.cas...@getupcloud.com>
> Sent: Friday, December 9, 2016 2:26 PM
> To: "Reed Dier" <reed.d...@focusvq.com>
> Cc: lewis.geo...@innoscale.net, ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] High load on OSD processes
>  
> Same here, is there any ETA to publish CentOS packages?
>  
>  
> ---
> Diego Castro / The CloudFather
> GetupCloud.com - Eliminamos a Gravidade
>  
> 2016-12-09 18:59 GMT-03:00 Reed Dier <reed.d...@focusvq.com 
> <mailto:reed.d...@focusvq.com>>:
> Assuming you deployed within the last 48 hours, I’m going to bet you are 
> using v10.2.4 which has an issue that causes high cpu utilization.
>  
> Should see large ramp up in loadav after 15 minutes exactly.
>  
> See mailing list thread here: 
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg34390.html 
> <https://www.mail-archive.com/ceph-users@lists.ceph.com/msg34390.html>
>  
> Reed
>  
>  
>> On Dec 9, 2016, at 3:25 PM, lewis.geo...@innoscale.net 
>> <mailto:lewis.geo...@innoscale.net> wrote:
>> Hello,
>> I am testing out a new node setup for us and I have configured a node in a 
>> single node cluster. It has 24 OSDs. Everything looked okay during the 
>> initial build and I was able to run the 'rados bench' on it just fine. 
>> However, if I just let the cluster sit and run for a few minutes without 
>> anything happening, the load starts to go up quickly. Each OSD device ends 
>> up using 130% CPU, with the load on the box hitting 550.00. No operations 
>> are going on, nothing shows up in the logs as happening or wrong. If I 
>> restart the OSD processes, the load stays down for a few minutes(almost at 
>> nothing) and then just jumps back up again.
>>  
>> Any idea what could cause this or a direction I can look to check it?
>>  
>> Have a good day,
>>  
>> Lewis George
>>  
>>  
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>  

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] High load on OSD processes

2016-12-09 Thread Reed Dier
Assuming you deployed within the last 48 hours, I’m going to bet you are using 
v10.2.4 which has an issue that causes high cpu utilization.

Should see large ramp up in loadav after 15 minutes exactly.

See mailing list thread here: 
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg34390.html 


Reed


> On Dec 9, 2016, at 3:25 PM, lewis.geo...@innoscale.net wrote:
> 
> Hello,
> I am testing out a new node setup for us and I have configured a node in a 
> single node cluster. It has 24 OSDs. Everything looked okay during the 
> initial build and I was able to run the 'rados bench' on it just fine. 
> However, if I just let the cluster sit and run for a few minutes without 
> anything happening, the load starts to go up quickly. Each OSD device ends up 
> using 130% CPU, with the load on the box hitting 550.00. No operations are 
> going on, nothing shows up in the logs as happening or wrong. If I restart 
> the OSD processes, the load stays down for a few minutes(almost at nothing) 
> and then just jumps back up again.
>  
> Any idea what could cause this or a direction I can look to check it?
>  
> Have a good day,
>  
> Lewis George
>  
>  
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrate OSD Journal to SSD

2016-12-02 Thread Reed Dier

> On Dec 1, 2016, at 6:26 PM, Christian Balzer <ch...@gol.com> wrote:
> 
> On Thu, 1 Dec 2016 18:06:38 -0600 Reed Dier wrote:
> 
>> Apologies if this has been asked dozens of times before, but most answers 
>> are from pre-Jewel days, and want to double check that the methodology still 
>> holds.
>> 
> It does.
> 
>> Currently have 16 OSD’s across 8 machines with on-disk journals, created 
>> using ceph-deploy.
>> 
>> These machines have NVMe storage (Intel P3600 series) for the system volume, 
>> and am thinking about carving out a partition for SSD journals for the 
>> OSD’s. The drives don’t make tons of use of the local storage, so should 
>> have plenty of io overhead to support the OSD journaling, as well as the 
>> P3600 should have the endurance to handle the added write wear.
>> 
> Slight disconnect there, money for a NVMe (which size?) and on disk
> journals? ^_-

NVMe was already in place before the ceph project began. 400GB P3600, with 
~275GB available space after swap partition.

>> From what I’ve read, you need a partition per OSD journal, so with the 
>> probability of a third (and final) OSD being added to each node, I should 
>> create 3 partitions, each ~8GB in size (is this a good value? 8TB OSD’s, is 
>> the journal size based on size of data or number of objects, or something 
>> else?).
>> 
> Journal size is unrelated to the OSD per se, with default parameters and
> HDDs for OSDs a size of 10GB would be more than adequate, the default of
> 5GB would do as well.

I was under the impression that it was agnostic to either metric, but figured I 
should ask while I had the chance.

>> So:
>> {create partitions}
>> set noout
>> service ceph stop osd.$i
>> ceph-osd -i osd.$i —flush-journal
>> rm -f rm -f /var/lib/ceph/osd//journal
> Typo and there should be no need for -f. ^_^
> 
>> ln -s  /var/lib/ceph/osd//journal /dev/
> Even though in your case with a single(?) NVMe there is little chance for
> confusion, ALWAYS reference to devices by their UUID or similar, I prefer
> the ID:
> ---
> lrwxrwxrwx   1 root root44 May 21  2015 journal -> 
> /dev/disk/by-id/wwn-0x55cd2e404b73d570-part4
> —

Correct, would reference by UUID.

Thanks again for the sanity check.

Reed

> 
>> ceph-osd -i osd.$i -mkjournal
>> service ceph start osd.$i
>> ceph osd unset noout
>> 
>> Does this logic appear to hold up?
>> 
> Yup.
> 
> Christian
> 
>> Appreciate the help.
>> 
>> Thanks,
>> 
>> Reed
> 
> -- 
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com Global OnLine Japan/Rakuten Communications
> http://www.gol.com/

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Migrate OSD Journal to SSD

2016-12-01 Thread Reed Dier
Apologies if this has been asked dozens of times before, but most answers are 
from pre-Jewel days, and want to double check that the methodology still holds.

Currently have 16 OSD’s across 8 machines with on-disk journals, created using 
ceph-deploy.

These machines have NVMe storage (Intel P3600 series) for the system volume, 
and am thinking about carving out a partition for SSD journals for the OSD’s. 
The drives don’t make tons of use of the local storage, so should have plenty 
of io overhead to support the OSD journaling, as well as the P3600 should have 
the endurance to handle the added write wear.

From what I’ve read, you need a partition per OSD journal, so with the 
probability of a third (and final) OSD being added to each node, I should 
create 3 partitions, each ~8GB in size (is this a good value? 8TB OSD’s, is the 
journal size based on size of data or number of objects, or something else?).

So:
{create partitions}
set noout
service ceph stop osd.$i
ceph-osd -i osd.$i —flush-journal
rm -f rm -f /var/lib/ceph/osd//journal
ln -s  /var/lib/ceph/osd//journal /dev/
ceph-osd -i osd.$i -mkjournal
service ceph start osd.$i
ceph osd unset noout

Does this logic appear to hold up?

Appreciate the help.

Thanks,

Reed___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS in existing pool namespace

2016-10-27 Thread Reed Dier
Looking to add CephFS into our Ceph cluster (10.2.3), and trying to plan for 
that addition.

Currently only using RADOS on a single replicated, non-EC, pool, no RBD or RGW, 
and segmenting logically in namespaces.

No auth scoping at this time, but likely something we will be moving to in the 
future as our Ceph cluster grows in size and use.

The main question at hand is bringing CephFS, by way of the kernel driver, into 
our cluster. We are trying to be more efficient with our PG enumeration, and 
questioning whether there is efficiency or unwanted complexity by way of 
creating a namespace in the existing pool, versus a completely separate pool. 

On top of that, how does the cephfs-metadata pool/namespace equate into that? 
Is this even feasible?

Barring feasibility, how do others plan their pg_num for separate pools for 
cephfs and the metadata pool, compared to a standard object pool?

Hopefully someone has some experience with this and can comment.

TL;DR - is there a way to specify cephfs_data and cephfs_metadata ‘pools’ as a 
namespace, rather than entire pools?
$ ceph fs new
--metadata-namespace  --data-namespace 
 is the name of the pool where metadata is stored,  
the namespace within the aforementioned pool.
 and  analogous with the metadata side.

Thanks,

Reed
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-21 Thread Reed Dier

> On Oct 19, 2016, at 7:54 PM, Christian Balzer  wrote:
> 
> 
> Hello,
> 
> On Wed, 19 Oct 2016 12:28:28 + Jim Kilborn wrote:
> 
>> I have setup a new linux cluster to allow migration from our old SAN based 
>> cluster to a new cluster with ceph.
>> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
> As others mentioned, not a good choice, but also not the (main) cause of
> your problems.
> 
>> I am basically running stock ceph settings, with just turning the write 
>> cache off via hdparm on the drives, and temporarily turning of scrubbing.
>> 
> The former is bound to kill performance, if you care that much for your
> data but can't guarantee constant power (UPS, dual PSUs, etc), consider
> using a BBU caching controller.

I wanted to comment on this small bolded bit, in the early days of my ceph 
cluster, testing resiliency to power failure (worst case scenario), when the 
on-disk write cache was enabled on my drives, I would lose that OSD to leveldb 
corruption, even with BBU.

With BBU + no disk-level cache, the OSD would come back, with no data loss, 
however performance would be significantly degraded. (xfsaild process with 99% 
iowait, cured by zapping disk and recreating OSD)

For reference, these were Seagate ST8000NM0065, backed by an LSI 3108 RoC, with 
the OSD set as a single RAID0 VD. On disk journaling.

There was a decent enough hit to write performance after disabling write 
caching at the disk layer, but write-back caching at the controller layer 
provided enough of a negating increase, that the data security was an 
acceptable trade off.

Was a tough way to learn how important this was after data center was struck by 
lightning two weeks after initial ceph cluster install and one phase of power 
was knocked out for 15 minutes, taking half the non-dual-PSU nodes with it.

Just want to make sure that people learn from that painful experience.

Reed

> 
> The later I venture you did because performance was abysmal with scrubbing
> enabled.
> Which is always a good indicator that your cluster needs tuning, improving.
> 
>> The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
>> Server performance should be good.  
> Memory is fine, CPU I can't tell from the model number and I'm not
> inclined to look up or guess, but that usually only becomes a bottleneck
> when dealing with all SSD setup and things requiring the lowest latency
> possible.
> 
> 
>> Since I am running cephfs, I have tiering setup.
> That should read "on top of EC pools", and as John said, not a good idea
> at all, both EC pools and cache-tiering.
> 
>> Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. 
>> So the idea is to ensure a single host failure.
>> Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a 
>> replicated set with size=2
> 
> This isn't a Seagate, you mean Samsung. And that's a consumer model,
> ill suited for this task, even with the DC level SSDs below as journals.
> 
> And as such a replication of 2 is also ill advised, I've seen these SSDs
> die w/o ANY warning whatsoever and long before their (abysmal) endurance
> was exhausted.
> 
>> The cache tier also has a 128GB SM863 SSD that is being used as a journal 
>> for the cache SSD. It has power loss protection
> 
> Those are fine. If you re-do you cluster, don't put more than 4-5 journals
> on them.
> 
>> My crush map is setup to ensure the cache pool uses only the 4 850 pro and 
>> the erasure code uses only the 16 spinning 4TB drives.
>> 
>> The problems that I am seeing is that I start copying data from our old san 
>> to the ceph volume, and once the cache tier gets to my  target_max_bytes of 
>> 1.4 TB, I start seeing:
>> 
>> HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; 
>> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>> 26 ops are blocked > 65.536 sec on osd.0
>> 37 ops are blocked > 32.768 sec on osd.0
>> 1 osds have slow requests
>> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>> 
>> osd.0 is the cache ssd
>> 
>> If I watch iostat on the cache ssd, I see the queue lengths are high and the 
>> await are high
>> Below is the iostat on the cache drive (osd.0) on the first host. The 
>> avgqu-sz is between 87 and 182 and the await is between 88ms and 1193ms
>> 
>> Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
>> avgqu-sz   await r_await w_await  svctm  %util
>> sdb
>>  0.00 0.339.00   84.33 0.9620.11   462.40
>> 75.92  397.56  125.67  426.58  10.70  99.90
>>  0.00 0.67   30.00   87.33 5.9621.03   471.20
>> 67.86  910.95   87.00 1193.99   8.27  97.07
>>  0.0016.67   33.00  289.33 4.2118.80   146.20
>> 29.83   88.99   93.91   88.43   3.10  99.83
>>  0.00 7.337.67  261.67 1.9219.63   163.81   
>> 117.42  331.97  182.04  336.36   3.71 100.00
>> 
>> 
>> If I look 

Re: [ceph-users] OSD won't come back "UP"

2016-10-07 Thread Reed Dier
Resolved.

Apparently it took the OSD almost 2.5 hours to fully boot.

Had not seen this behavior before, but it eventually booted itself back into 
the crush map.

Bookend log stamps below.

> 2016-10-07 21:33:39.241720 7f3d59a97800  0 set uid:gid to 64045:64045 
> (ceph:ceph)

> 2016-10-07 23:53:29.617038 7f3d59a97800  0 osd.0 4360 done with init, 
> starting boot process

I had noticed that there was a consistent read operation on the “down/out” osd 
tied to that osd’s PID, which led me to believe it was doing something with its 
time.

Also for reference, this was a 26% full 8TB disk.
> Filesystem1K-blocksUsed  Available Use% Mounted on

> /dev/sda17806165996  1953556296 5852609700  26% 
> /var/lib/ceph/osd/ceph-0

Reed


> On Oct 7, 2016, at 7:33 PM, Reed Dier <reed.d...@focusvq.com> wrote:
> 
> Attempting to adjust parameters of some of my recovery options, I restarted a 
> single osd in the cluster with the following syntax:
> 
>> sudo restart ceph-osd id=0
> 
> 
> The osd restarts without issue, status shows running with the PID.
> 
>> sudo status ceph-osd id=0
>> ceph-osd (ceph/0) start/running, process 2685
> 
> 
> The osd marked itself down cleanly.
> 
>> 2016-10-07 19:36:20.872883 mon.0 10.0.1.249:6789/0 1475867 : cluster [INF] 
>> osd.0 marked itself down
> 
>> 2016-10-07 19:36:21.590874 mon.0 10.0.1.249:6789/0 1475869 : cluster [INF] 
>> osdmap e4361: 16 osds: 15 up, 16 in
> 
> The mon’s show this from one of many subsequent attempts to restart the osd.
> 
>> 2016-10-07 19:58:16.222949 mon.1 [INF] from='client.? 10.0.1.25:0/324114592' 
>> entity='osd.0' cmd=[{"prefix": "osd crush create-or-move", "args": 
>> ["host=node24", "root=default"], "id": 0, "weight": 7.2701}]: dispatch
>> 2016-10-07 19:58:16.223626 mon.0 [INF] from='client.6557620 :/0' 
>> entity='osd.0' cmd=[{"prefix": "osd crush create-or-move", "args": 
>> ["host=node24", "root=default"], "id": 0, "weight": 7.2701}]: dispatch
> 
> mon logs show this when grepping for the osd.0 in the mon log
> 
>> 2016-10-07 19:36:20.872882 7fd39aced700  0 log_channel(cluster) log [INF] : 
>> osd.0 marked itself down
>> 2016-10-07 19:36:27.698708 7fd39aced700  0 log_channel(audit) log [INF] : 
>> from='client.6554095 :/0' entity='osd.0' cmd=[{"prefix": "osd crush 
>> create-or-move", "args": ["host=node24", "root=default"], "id": 0, "weight": 
>> 7.2701}]: dispatch
>> 2016-10-07 19:36:27.706374 7fd39aced700  0 mon.core@0(leader).osd e4363 
>> create-or-move crush item name 'osd.0' initial_weight 7.2701 at location 
>> {host=node24,root=default}
>> 2016-10-07 19:39:30.515494 7fd39aced700  0 log_channel(audit) log [INF] : 
>> from='client.6554587 :/0' entity='osd.0' cmd=[{"prefix": "osd crush 
>> create-or-move", "args": ["host=node24", "root=default"], "id": 0, "weight": 
>> 7.2701}]: dispatch
>> 2016-10-07 19:39:30.515618 7fd39aced700  0 mon.core@0(leader).osd e4363 
>> create-or-move crush item name 'osd.0' initial_weight 7.2701 at location 
>> {host=node24,root=default}
>> 2016-10-07 19:41:59.714517 7fd39b4ee700  0 log_channel(cluster) log [INF] : 
>> osd.0 out (down for 338.148761)
> 
> 
> Everything running latest Jewel release
> 
>> ceph --version
>> ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)
> 
> Any help with this is extremely appreciated. Hoping someone has dealt with 
> this before.
> 
> Reed Dier

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD won't come back "UP"

2016-10-07 Thread Reed Dier
Attempting to adjust parameters of some of my recovery options, I restarted a 
single osd in the cluster with the following syntax:

> sudo restart ceph-osd id=0


The osd restarts without issue, status shows running with the PID.

> sudo status ceph-osd id=0
> ceph-osd (ceph/0) start/running, process 2685


The osd marked itself down cleanly.

> 2016-10-07 19:36:20.872883 mon.0 10.0.1.249:6789/0 1475867 : cluster [INF] 
> osd.0 marked itself down

> 2016-10-07 19:36:21.590874 mon.0 10.0.1.249:6789/0 1475869 : cluster [INF] 
> osdmap e4361: 16 osds: 15 up, 16 in

The mon’s show this from one of many subsequent attempts to restart the osd.

> 2016-10-07 19:58:16.222949 mon.1 [INF] from='client.? 10.0.1.25:0/324114592' 
> entity='osd.0' cmd=[{"prefix": "osd crush create-or-move", "args": 
> ["host=node24", "root=default"], "id": 0, "weight": 7.2701}]: dispatch
> 2016-10-07 19:58:16.223626 mon.0 [INF] from='client.6557620 :/0' 
> entity='osd.0' cmd=[{"prefix": "osd crush create-or-move", "args": 
> ["host=node24", "root=default"], "id": 0, "weight": 7.2701}]: dispatch

mon logs show this when grepping for the osd.0 in the mon log

> 2016-10-07 19:36:20.872882 7fd39aced700  0 log_channel(cluster) log [INF] : 
> osd.0 marked itself down
> 2016-10-07 19:36:27.698708 7fd39aced700  0 log_channel(audit) log [INF] : 
> from='client.6554095 :/0' entity='osd.0' cmd=[{"prefix": "osd crush 
> create-or-move", "args": ["host=node24", "root=default"], "id": 0, "weight": 
> 7.2701}]: dispatch
> 2016-10-07 19:36:27.706374 7fd39aced700  0 mon.core@0(leader).osd e4363 
> create-or-move crush item name 'osd.0' initial_weight 7.2701 at location 
> {host=node24,root=default}
> 2016-10-07 19:39:30.515494 7fd39aced700  0 log_channel(audit) log [INF] : 
> from='client.6554587 :/0' entity='osd.0' cmd=[{"prefix": "osd crush 
> create-or-move", "args": ["host=node24", "root=default"], "id": 0, "weight": 
> 7.2701}]: dispatch
> 2016-10-07 19:39:30.515618 7fd39aced700  0 mon.core@0(leader).osd e4363 
> create-or-move crush item name 'osd.0' initial_weight 7.2701 at location 
> {host=node24,root=default}
> 2016-10-07 19:41:59.714517 7fd39b4ee700  0 log_channel(cluster) log [INF] : 
> osd.0 out (down for 338.148761)


Everything running latest Jewel release

> ceph --version
> ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b)

Any help with this is extremely appreciated. Hoping someone has dealt with this 
before.

Reed Dier
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Recovery/Backfill Speedup

2016-10-04 Thread Reed Dier
Attempting to expand our small ceph cluster currently.

Have 8 nodes, 3 mons, and went from a single 8TB disk per node to 2x 8TB disks 
per node, and the rebalancing process is excruciatingly slow.

Originally at 576 PGs before expansion, and wanted to allow rebalance to finish 
before expanding the PG count for the single pool, and the replication size.

I have stopped scrubs for the time being, as well as set client and recovery io 
to equal parts so that client io is not burying the recovery io. Also have 
increased the number of recovery threads per osd.

> [osd]
> osd_recovery_threads = 5
> filestore_max_sync_interval = 30
> osd_client_op_priority = 32
> osd_recovery_op_priority = 32

Also, this is 10G networking we are working with and recovery io typically 
hovers between 0-35 MB’s but typically very bursty.
Disks are 8TB 7.2k SAS disks behind an LSI 3108 controller, configured as 
individual RAID0 VD’s, with pdcache disabled, but BBU backed write back caching 
enabled at the controller level.

Have thought about increasing the ‘osd_max_backfills’ as well as 
‘osd_recovery_max_active’, and possibly ‘osd_recovery_max_chunk’ to attempt to 
speed it up, but will hopefully get some insight from the community here.

ceph -s about 4 days in:

>  health HEALTH_WARN
> 255 pgs backfill_wait
> 4 pgs backfilling
> 385 pgs degraded
> 129 pgs recovery_wait
> 388 pgs stuck unclean
> 274 pgs undersized
> recovery 165319973/681597074 objects degraded (24.255%)
> recovery 298607229/681597074 objects misplaced (43.810%)
> noscrub,nodeep-scrub,sortbitwise flag(s) set
>  monmap e1: 3 mons at 
> {core=10.0.1.249:6789/0,db=10.0.1.251:6789/0,dev=10.0.1.250:6789/0}
> election epoch 190, quorum 0,1,2 core,dev,db
>  osdmap e4226: 16 osds: 16 up, 16 in; 303 remapped pgs
> flags noscrub,nodeep-scrub,sortbitwise
>   pgmap v1583742: 576 pgs, 2 pools, 6426 GB data, 292 Mobjects
> 15301 GB used, 101 TB / 116 TB avail
> 165319973/681597074 objects degraded (24.255%)
> 298607229/681597074 objects misplaced (43.810%)
>  249 active+undersized+degraded+remapped+wait_backfill
>  188 active+clean
>   85 active+recovery_wait+degraded
>   22 active+recovery_wait+degraded+remapped
>   22 active+recovery_wait+undersized+degraded+remapped
>3 active+remapped+wait_backfill
>3 active+undersized+degraded+remapped+backfilling
>3 active+degraded+remapped+wait_backfill
>1 active+degraded+remapped+backfilling
> recovery io 9361 kB/s, 415 objects/s
>   client io 597 kB/s rd, 62 op/s rd, 0 op/s wr
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replacing a failed OSD

2016-09-14 Thread Reed Dier
Hi Jim,

This is pretty fresh in my mind so hopefully I can help you out here.

Firstly, the crush map will back fill any holes in the enumeration that are 
existing. So assuming only one drive has been removed from the crush map, it 
will repopulate the same OSD number.

My steps for removing an OSD are run from the host node:

> ceph osd down osd.i
> ceph osd out osd.i
> stop ceph-osd id=i
> umount /var/lib/ceph/osd/ceph-i
> ceph osd crush remove osd.i
> ceph auth del osd.i
> ceph osd rm osd.i


From here, the disk is removed from the ceph cluster, crush map, and is ready 
for removal and replacement.

From there I deploy the new osd with ceph-deploy from my admin node using:

> ceph-deploy disk list nodei
> ceph-deploy disk zap nodei:sdX
> ceph-deploy --overwrite-conf osd prepare nodei:sdX


This will prepare the disk and insert it back into the crush map, bringing it 
back up and in. The OSD number should remain the same, as it will fill the gap 
left from the previous OSD removal.

Hopefully this helps,

Reed

> On Sep 14, 2016, at 11:00 AM, Jim Kilborn  wrote:
> 
> I am finishing testing our new cephfs cluster and wanted to document a failed 
> osd procedure.
> I noticed that when I pulled a drive, to simulate a failure, and run through 
> the replacement steps, the osd has to be removed from the crushmap in order 
> to initialize the new drive as the same osd number.
> 
> Is this correct that I have to remove it from the crushmap, then after the 
> osd is initialized, and mounted, add it back to the crush map? Is there no 
> way to have it reuse the same osd # without removing if from the crush map?
> 
> Thanks for taking the time….
> 
> 
> -  Jim
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD daemon randomly stops

2016-09-02 Thread Reed Dier
OSD has randomly stopped for some reason. Lots of recovery processes currently 
running on the ceph cluster. OSD log with assert below:

> -14> 2016-09-02 11:32:38.672460 7fcf65514700  5 -- op tracker -- seq: 1147, 
> time: 2016-09-02 11:32:38.672460, event: queued_for_pg, op: 
> osd_sub_op_reply(unknown.0.0:0 7.d1 MIN [scrub-reserve] ack, result = 0)
>-13> 2016-09-02 11:32:38.672533 7fcf70d40700  5 -- op tracker -- seq: 
> 1147, time: 2016-09-02 11:32:38.672533, event: reached_pg, op: 
> osd_sub_op_reply(unknown.0.0:0 7.d1 MIN [scrub-reserve] ack, result = 0)
>-12> 2016-09-02 11:32:38.672548 7fcf70d40700  5 -- op tracker -- seq: 
> 1147, time: 2016-09-02 11:32:38.672548, event: started, op: 
> osd_sub_op_reply(unknown.0.0:0 7.d1 MIN [scrub-reserve] ack, result = 0)
>-11> 2016-09-02 11:32:38.672548 7fcf7cd58700  1 -- [].28:6800/27735 <== 
> mon.0 [].249:6789/0 60  pg_stats_ack(0 pgs tid 45) v1  4+0+0 (0 0 0) 
> 0x55a4443b1400 con 0x55a4434a4e80
>-10> 2016-09-02 11:32:38.672559 7fcf70d40700  1 -- [].28:6801/27735 --> 
> [].31:6801/2070838 -- osd_sub_op(unknown.0.0:0 7.d1 MIN [scrub-unreserve] v 
> 0'0 snapset=0=[]:[]) v12 -- ?+0 0x55a443aec100 con 0x55a443be0600
> -9> 2016-09-02 11:32:38.672571 7fcf70d40700  5 -- op tracker -- seq: 
> 1147, time: 2016-09-02 11:32:38.672571, event: done, op: 
> osd_sub_op_reply(unknown.0.0:0 7.d1 MIN [scrub-reserve] ack, result = 0)
> -8> 2016-09-02 11:32:38.681929 7fcf7b555700  1 -- [].28:6801/27735 <== 
> osd.2 [].26:6801/9468 148  MBackfillReserve GRANT  pgid: 15.11, 
> query_epoch: 4235 v3  30+0+0 (3067148394 0 0) 0x55a4441f65a0 con 
> 0x55a4434ab200
> -7> 2016-09-02 11:32:38.682009 7fcf7b555700  5 -- op tracker -- seq: 
> 1148, time: 2016-09-02 11:32:38.682008, event: done, op: MBackfillReserve 
> GRANT  pgid: 15.11, query_epoch: 4235
> -6> 2016-09-02 11:32:38.682068 7fcf73545700  5 osd.4 pg_epoch: 4235 
> pg[15.11( v 895'400028 (859'397021,895'400028] local-les=4234 n=166739 ec=732 
> les/c/f 4234/4003/0 4232/4233/4233) [2,4]/[4] r=0 lpr=4233 pi=4002-4232/47 
> (log bound mismatch
> , actual=[859'396822,895'400028]) bft=2 crt=895'400028 lcod 0'0 mlcod 0'0 
> active+undersized+degraded+remapped+wait_backfill] exit 
> Started/Primary/Active/WaitRemoteBackfillReserved 221.748180 6 0.56
> -5> 2016-09-02 11:32:38.682109 7fcf73545700  5 osd.4 pg_epoch: 4235 
> pg[15.11( v 895'400028 (859'397021,895'400028] local-les=4234 n=166739 ec=732 
> les/c/f 4234/4003/0 4232/4233/4233) [2,4]/[4] r=0 lpr=4233 pi=4002-4232/47 
> (log bound mismatch
> , actual=[859'396822,895'400028]) bft=2 crt=895'400028 lcod 0'0 mlcod 0'0 
> active+undersized+degraded+remapped+wait_backfill] enter 
> Started/Primary/Active/Backfilling
> -4> 2016-09-02 11:32:38.682584 7fcf7b555700  1 -- [].28:6801/27735 <== 
> osd.6 [].30:6801/44406 171  osd pg remove(epoch 4235; pg6.19; ) v2  
> 30+0+0 (522063165 0 0) 0x55a44392f680 con 0x55a443bae100
> -3> 2016-09-02 11:32:38.682600 7fcf7b555700  5 -- op tracker -- seq: 
> 1149, time: 2016-09-02 11:32:38.682600, event: started, op: osd pg 
> remove(epoch 4235; pg6.19; )
> -2> 2016-09-02 11:32:38.682616 7fcf7b555700  5 osd.4 4235 
> queue_pg_for_deletion: 6.19
> -1> 2016-09-02 11:32:38.685425 7fcf7b555700  5 -- op tracker -- seq: 
> 1149, time: 2016-09-02 11:32:38.685421, event: done, op: osd pg remove(epoch 
> 4235; pg6.19; )
>  0> 2016-09-02 11:32:38.690487 7fcf6c537700 -1 osd/ReplicatedPG.cc: In 
> function 'void ReplicatedPG::scan_range(int, int, PG::BackfillInterval*, 
> ThreadPool::TPHandle&)' thread 7fcf6c537700 time 2016-09-02 11:32:38.688536
> osd/ReplicatedPG.cc: 11345: FAILED assert(r >= 0)
> 
>  2016-09-02 11:32:38.711869 7fcf6c537700 -1 *** Caught signal (Aborted) **
>  in thread 7fcf6c537700 thread_name:tp_osd_recov
> 
>  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>  1: (()+0x8ebb02) [0x55a402375b02]
>  2: (()+0x10330) [0x7fcfa2b51330]
>  3: (gsignal()+0x37) [0x7fcfa0bb3c37]
>  4: (abort()+0x148) [0x7fcfa0bb7028]
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x265) [0x55a40246cf85]
>  6: (ReplicatedPG::scan_range(int, int, PG::BackfillInterval*, 
> ThreadPool::TPHandle&)+0xad2) [0x55a401f4f482]
>  7: (ReplicatedPG::update_range(PG::BackfillInterval*, 
> ThreadPool::TPHandle&)+0x614) [0x55a401f4fac4]
>  8: (ReplicatedPG::recover_backfill(int, ThreadPool::TPHandle&, bool*)+0x337) 
> [0x55a401f6fc87]
>  9: (ReplicatedPG::start_recovery_ops(int, ThreadPool::TPHandle&, 
> int*)+0x8a0) [0x55a401fa1160]
>  10: (OSD::do_recovery(PG*, ThreadPool::TPHandle&)+0x355) [0x55a401e31555]
>  11: (OSD::RecoveryWQ::_process(PG*, ThreadPool::TPHandle&)+0xd) 
> [0x55a401e7a0dd]
>  12: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa6e) [0x55a40245e18e]
>  13: (ThreadPool::WorkThread::entry()+0x10) [0x55a40245f070]
>  14: (()+0x8184) [0x7fcfa2b49184]
>  15: (clone()+0x6d) [0x7fcfa0c7737d]


Any help with this appreciated.

Thanks,


Re: [ceph-users] Slow Request on OSD

2016-09-02 Thread Reed Dier
Just to circle back to this:

Drives: Seagate ST8000NM0065
Controller: LSI 3108 RAID-on-Chip
At the time, no BBU on RoC controller.
Each OSD drive was configured as a single RAID0 VD.

What I believe to be the snake that bit us was the Seagate drives’ on-board 
caching.

Using storcli to manage the controller/drive, the pdcache value for /cx/vx was 
set to default, which in this case is on.

So now all of the VD’s have the pdcache value set to off.

At the time the controller’s write-cache setting was also set to write back, 
and has since been set to write-through until BBU’s are installed.

Below is an example of our current settings in use post power-event:

> $ sudo /opt/MegaRAID/storcli/storcli64 /c0/v0 show all
> Controller = 0
> Status = Success
> Description = None
> 
> 
> /c0/v0 :
> ==
> 
> --
> DG/VD TYPE  State Access Consist Cache Cac sCC Size Name
> --
> 0/0   RAID0 Optl  RW Yes RWTD  -   ON  7.276 TB ceph1
> --
> 
> Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|dgrd=Degraded
> Optl=Optimal|RO=Read Only|RW=Read 
> Write|HD=Hidden|TRANS=TransportReady|B=Blocked|
> Consist=ConsistentR=Read Ahead Always|NR=No Read Ahead|WB=WriteBack|
> AWB=Always WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
> Check Consistency
> 
> 
> PDs for VD 0 :
> 
> 
> ---
> EID:Slt DID State DG Size Intf Med SED PI SeSz ModelSp
> ---
> 252:0 9 Onln   0 7.276 TB SAS  HDD N   N  4 KB ST8000NM0065 U
> ---
> 
> EID-Enclosure Device ID|Slt-Slot No.|DID-Device ID|DG-DriveGroup
> DHS-Dedicated Hot Spare|UGood-Unconfigured Good|GHS-Global Hotspare
> UBad-Unconfigured Bad|Onln-Online|Offln-Offline|Intf-Interface
> Med-Media Type|SED-Self Encryptive Drive|PI-Protection Info
> SeSz-Sector Size|Sp-Spun|U-Up|D-Down|T-Transition|F-Foreign
> UGUnsp-Unsupported|UGShld-UnConfigured shielded|HSPShld-Hotspare shielded
> CFShld-Configured shielded|Cpybck-CopyBack|CBShld-Copyback Shielded
> 
> 
> VD0 Properties :
> ==
> Strip Size = 256 KB
> Number of Blocks = 1953374208
> VD has Emulated PD = No
> Span Depth = 1
> Number of Drives Per Span = 1
> Write Cache(initial setting) = WriteThrough
> Disk Cache Policy = Disabled
> Encryption = None
> Data Protection = Disabled
> Active Operations = None
> Exposed to OS = Yes
> Creation Date = 17-06-2016
> Creation Time = 02:49:02 PM
> Emulation type = default
> Cachebypass size = Cachebypass-64k
> Cachebypass Mode = Cachebypass Intelligent
> Is LD Ready for OS Requests = Yes
> SCSI NAA Id = 600304801bb4c0001ef6ca5ea0fcb283


Hopefully this configuration is a much safer configuration, and can help anyone 
else before incurring any destructive issues.

The only less than great part of this configuration is the hit to write I/O due 
to less than optimal write scheduling compared to cached writes. Hope to enable 
write-back at the controller level after BBU installation.

Thanks,

Reed

> On Sep 1, 2016, at 6:21 AM, Cloud List <cloud-l...@sg.or.id> wrote:
> 
> 
> 
> On Thu, Sep 1, 2016 at 3:50 PM, Nick Fisk <n...@fisk.me.uk 
> <mailto:n...@fisk.me.uk>> wrote:
> > > Op 31 augustus 2016 om 23:21 schreef Reed Dier <reed.d...@focusvq.com 
> > > <mailto:reed.d...@focusvq.com>>:
> > >
> > >
> > > Multiple XFS corruptions, multiple leveldb issues. Looked to be result of 
> > > write cache settings which have been adjusted now.
> 
> Reed, I realise that you are probably very busy attempting recovery at the 
> moment, but when things calm down, I think it would be very beneficial to the 
> list if you could expand on what settings caused this to happen. It might 
> just stop this happening to someone else in the future.
> 
> Agree with Nick, when things settle down and (hopefully) all the data is 
> recovered, appreciate if Reed can share what kinid of write cache settings 
> can cause this problem and what adjustment was made to prevent this kind of 
> problem from happening.
> 
> Thank you.
> 
> -ip-

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   >