Re: [ceph-users] SSD Bluestore Backfills Slow

2018-06-04 Thread Reed Dier
Appreciate the input.

Wasn’t sure if ceph-volume was the one setting these bits of metadata or 
something else.

Appreciate the help guys.

Thanks,

Reed

> The fix is in core Ceph (the OSD/BlueStore code), not ceph-volume. :) 
> journal_rotational is still a thing in BlueStore; it represents the combined 
> WAL+DB devices.
> -Greg 
> On Jun 4, 2018, at 11:53 AM, Alfredo Deza  wrote:
> 
> ceph-volume doesn't do anything here with the device metadata, and is
> something that bluestore has as an internal mechanism. Unsure if there
> is anything
> one can do to change this on the OSD itself (vs. injecting args)


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Bluestore Backfills Slow

2018-06-04 Thread Alfredo Deza
On Mon, Jun 4, 2018 at 12:37 PM, Reed Dier  wrote:
> Hi Caspar,
>
> David is correct, in that the issue I was having with SSD OSD’s having NVMe
> bluefs_db reporting as HDD creating an artificial throttle based on what
> David was mentioning, a prevention to keep spinning rust from thrashing. Not
> sure if the journal_rotational bit should be 1, but either way, it shouldn’t
> affect you being hdd OSDs. Curious how these OSD’s were deployed, per the
> below part of the message.
>
> Copying Alfredo, as I’m not sure if something changed with respect to
> ceph-volume in 12.2.2 (when this originally happened) to 12.2.5 (I’m sure
> plenty did), because I recently had an NVMe drive fail on me unexpectedly
> (curse you Micron), and had to nuke and redo some SSD OSDs, and it was my
> first time deploying with ceph-deploy after the ceph-disk deprecation. The
> new OSD’s appear to report correctly wrt to the rotational status, where the
> others did not. So that appears to be working correctly, just wanted to
> provide some positive feedback there. Not sure if there’s an easy way to
> change those metadata tags on the OSDs, so that I don’t have to inject the
> args every time I need to reweight. Also feels like journal_rotational
> wouldn’t be a thing in bluestore?

ceph-volume doesn't do anything here with the device metadata, and is
something that bluestore has as an internal mechanism. Unsure if there
is anything
one can do to change this on the OSD itself (vs. injecting args)

>
> ceph osd metadata |grep ‘id\|model\|type\|rotational’
>
> "id": 63,
>
> "bluefs_db_model": "MTFDHAX1T2MCF-1AN1ZABYY",
>
> "bluefs_db_rotational": "0",
>
> "bluefs_db_type": "nvme",
>
> "bluefs_slow_model": "",
>
> "bluefs_slow_rotational": "0",
>
> "bluefs_slow_type": "ssd",
>
> "bluestore_bdev_model": "",
>
> "bluestore_bdev_rotational": "0",
>
> "bluestore_bdev_type": "ssd",
>
> "journal_rotational": "1",
>
> "rotational": "0"
>
> "id": 64,
>
> "bluefs_db_model": "INTEL SSDPED1D960GAY",
>
> "bluefs_db_rotational": "0",
>
> "bluefs_db_type": "nvme",
>
> "bluefs_slow_model": "",
>
> "bluefs_slow_rotational": "0",
>
> "bluefs_slow_type": "ssd",
>
> "bluestore_bdev_model": "",
>
> "bluestore_bdev_rotational": "0",
>
> "bluestore_bdev_type": "ssd",
>
> "journal_rotational": "0",
>
> "rotational": "0"
>
>
> osd.63 being one deployed using ceph-volume lvm in 12.2.2 and osd.64 being
> redeployed using ceph-deploy in 12.2.5 using ceph-volume  backend.
>
> Reed
>
> On Jun 4, 2018, at 8:16 AM, David Turner  wrote:
>
> I don't believe this really applies to you. The problem here was with an SSD
> osd that was incorrectly labeled as an HDD osd by ceph. The fix was to
> inject a sleep seeing if 0 for those osds to speed up recovery. The sleep is
> needed to not kill hdds to avoid thrashing, but the bug was SSDs were being
> incorrectly identified as HDD and SSDs don't have a problem with thrashing.
>
> You can try increasing osd_max_backfills. Watch your disk utilization as you
> do this so that you don't accidentally kill your client io by setting that
> too high, assuming that still needs priority.
>
> On Mon, Jun 4, 2018, 3:55 AM Caspar Smit  wrote:
>>
>> Hi Reed,
>>
>> "Changing/injecting osd_recovery_sleep_hdd into the running SSD OSD’s on
>> bluestore opened the floodgates."
>>
>> What exactly did you change/inject here?
>>
>> We have a cluster with 10TB SATA HDD's which each have a 100GB SSD based
>> block.db
>>
>> Looking at ceph osd metadata for each of those:
>>
>> "bluefs_db_model": "SAMSUNG MZ7KM960",
>> "bluefs_db_rotational": "0",
>> "bluefs_db_type": "ssd",
>> "bluefs_slow_model": "ST1NM0086-2A",
>> "bluefs_slow_rotational": "1",
>> "bluefs_slow_type": "hdd",
>> "bluestore_bdev_rotational": "1",
>> "bluestore_bdev_type": "hdd",
>> "default_device_class": "hdd",
>> "journal_rotational": "1",
>> "osd_objectstore": "bluestore",
>> "rotational": "1"
>>
>> Looks to me if i'm hitting the same issue, isn't it?
>>
>> ps. An upgrade of Ceph is planned in the near future but for now i would
>> like to use the workaround if applicable to me.
>>
>> Thank you in advance.
>>
>> Kind regards,
>> Caspar Smit
>>
>> 2018-02-26 23:22 GMT+01:00 Reed Dier :
>>>
>>> Quick turn around,
>>>
>>> Changing/injecting osd_recovery_sleep_hdd into the running SSD OSD’s on
>>> bluestore opened the floodgates.
>>>
>>> pool objects-ssd id 20
>>>   recovery io 1512 MB/s, 21547 objects/s
>>>
>>> pool fs-metadata-ssd id 16
>>>   recovery io 0 B/s, 6494 keys/s, 271 objects/s
>>>   client io 82325 B/s rd, 68146 B/s wr, 1 op/s rd, 0 op/s wr
>>>
>>>
>>> Graph of performance jump. Extremely marked.
>>> https://imgur.com/a/LZR9R
>>>
>>> So at least we now have the gun 

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-06-04 Thread Gregory Farnum
On Mon, Jun 4, 2018 at 9:38 AM Reed Dier  wrote:

> Copying Alfredo, as I’m not sure if something changed with respect to
> ceph-volume in 12.2.2 (when this originally happened) to 12.2.5 (I’m sure
> plenty did), because I recently had an NVMe drive fail on me unexpectedly
> (curse you Micron), and had to nuke and redo some SSD OSDs, and it was my
> first time deploying with ceph-deploy after the ceph-disk deprecation. The
> new OSD’s appear to report correctly wrt to the rotational status, where
> the others did not. So that appears to be working correctly, just wanted to
> provide some positive feedback there. Not sure if there’s an easy way to
> change those metadata tags on the OSDs, so that I don’t have to inject the
> args every time I need to reweight. Also feels like journal_rotational
> wouldn’t be a thing in bluestore?
>
>
The fix is in core Ceph (the OSD/BlueStore code), not ceph-volume. :)
journal_rotational is still a thing in BlueStore; it represents the
combined WAL+DB devices.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Bluestore Backfills Slow

2018-06-04 Thread Reed Dier
Hi Caspar,

David is correct, in that the issue I was having with SSD OSD’s having NVMe 
bluefs_db reporting as HDD creating an artificial throttle based on what David 
was mentioning, a prevention to keep spinning rust from thrashing. Not sure if 
the journal_rotational bit should be 1, but either way, it shouldn’t affect you 
being hdd OSDs. Curious how these OSD’s were deployed, per the below part of 
the message.

Copying Alfredo, as I’m not sure if something changed with respect to 
ceph-volume in 12.2.2 (when this originally happened) to 12.2.5 (I’m sure 
plenty did), because I recently had an NVMe drive fail on me unexpectedly 
(curse you Micron), and had to nuke and redo some SSD OSDs, and it was my first 
time deploying with ceph-deploy after the ceph-disk deprecation. The new OSD’s 
appear to report correctly wrt to the rotational status, where the others did 
not. So that appears to be working correctly, just wanted to provide some 
positive feedback there. Not sure if there’s an easy way to change those 
metadata tags on the OSDs, so that I don’t have to inject the args every time I 
need to reweight. Also feels like journal_rotational wouldn’t be a thing in 
bluestore?

> ceph osd metadata |grep ‘id\|model\|type\|rotational’
> "id": 63,
> "bluefs_db_model": "MTFDHAX1T2MCF-1AN1ZABYY",
> "bluefs_db_rotational": "0",
> "bluefs_db_type": "nvme",
> "bluefs_slow_model": "",
> "bluefs_slow_rotational": "0",
> "bluefs_slow_type": "ssd",
> "bluestore_bdev_model": "",
> "bluestore_bdev_rotational": "0",
> "bluestore_bdev_type": "ssd",
> "journal_rotational": "1",
> "rotational": "0"
> "id": 64,
> "bluefs_db_model": "INTEL SSDPED1D960GAY",
> "bluefs_db_rotational": "0",
> "bluefs_db_type": "nvme",
> "bluefs_slow_model": "",
> "bluefs_slow_rotational": "0",
> "bluefs_slow_type": "ssd",
> "bluestore_bdev_model": "",
> "bluestore_bdev_rotational": "0",
> "bluestore_bdev_type": "ssd",
> "journal_rotational": "0",
> "rotational": "0"


osd.63 being one deployed using ceph-volume lvm in 12.2.2 and osd.64 being 
redeployed using ceph-deploy in 12.2.5 using ceph-volume  backend.

Reed

> On Jun 4, 2018, at 8:16 AM, David Turner  wrote:
> 
> I don't believe this really applies to you. The problem here was with an SSD 
> osd that was incorrectly labeled as an HDD osd by ceph. The fix was to inject 
> a sleep seeing if 0 for those osds to speed up recovery. The sleep is needed 
> to not kill hdds to avoid thrashing, but the bug was SSDs were being 
> incorrectly identified as HDD and SSDs don't have a problem with thrashing.
> 
> You can try increasing osd_max_backfills. Watch your disk utilization as you 
> do this so that you don't accidentally kill your client io by setting that 
> too high, assuming that still needs priority.
> 
> On Mon, Jun 4, 2018, 3:55 AM Caspar Smit  > wrote:
> Hi Reed,
> 
> "Changing/injecting osd_recovery_sleep_hdd into the running SSD OSD’s on 
> bluestore opened the floodgates."
> 
> What exactly did you change/inject here?
> 
> We have a cluster with 10TB SATA HDD's which each have a 100GB SSD based 
> block.db
> 
> Looking at ceph osd metadata for each of those:
> 
> "bluefs_db_model": "SAMSUNG MZ7KM960",
> "bluefs_db_rotational": "0",
> "bluefs_db_type": "ssd",
> "bluefs_slow_model": "ST1NM0086-2A",
> "bluefs_slow_rotational": "1",
> "bluefs_slow_type": "hdd",
> "bluestore_bdev_rotational": "1",
> "bluestore_bdev_type": "hdd",
> "default_device_class": "hdd",
> "journal_rotational": "1",
> "osd_objectstore": "bluestore",
> "rotational": "1"
> 
> Looks to me if i'm hitting the same issue, isn't it?
> 
> ps. An upgrade of Ceph is planned in the near future but for now i would like 
> to use the workaround if applicable to me.
> 
> Thank you in advance.
> 
> Kind regards,
> Caspar Smit
> 
> 2018-02-26 23:22 GMT+01:00 Reed Dier  >:
> Quick turn around,
> 
> Changing/injecting osd_recovery_sleep_hdd into the running SSD OSD’s on 
> bluestore opened the floodgates.
> 
>> pool objects-ssd id 20
>>   recovery io 1512 MB/s, 21547 objects/s
>> 
>> pool fs-metadata-ssd id 16
>>   recovery io 0 B/s, 6494 keys/s, 271 objects/s
>>   client io 82325 B/s rd, 68146 B/s wr, 1 op/s rd, 0 op/s wr
> 
> Graph of performance jump. Extremely marked.
> https://imgur.com/a/LZR9R 
> 
> So at least we now have the gun to go with the smoke.
> 
> Thanks for the help and appreciate you pointing me in some directions that I 
> was able to use to figure out the issue.
> 
> Adding to ceph.conf for future OSD conversions.
> 
> Thanks,
> 
> Reed
> 
> 
>> On Feb 26, 2018, at 4:12 PM, Reed Dier > 

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-06-04 Thread David Turner
I don't believe this really applies to you. The problem here was with an
SSD osd that was incorrectly labeled as an HDD osd by ceph. The fix was to
inject a sleep seeing if 0 for those osds to speed up recovery. The sleep
is needed to not kill hdds to avoid thrashing, but the bug was SSDs were
being incorrectly identified as HDD and SSDs don't have a problem with
thrashing.

You can try increasing osd_max_backfills. Watch your disk utilization as
you do this so that you don't accidentally kill your client io by setting
that too high, assuming that still needs priority.

On Mon, Jun 4, 2018, 3:55 AM Caspar Smit  wrote:

> Hi Reed,
>
> "Changing/injecting osd_recovery_sleep_hdd into the running SSD OSD’s on
> bluestore opened the floodgates."
>
> What exactly did you change/inject here?
>
> We have a cluster with 10TB SATA HDD's which each have a 100GB SSD based
> block.db
>
> Looking at ceph osd metadata for each of those:
>
> "bluefs_db_model": "SAMSUNG MZ7KM960",
> "bluefs_db_rotational": "0",
> "bluefs_db_type": "ssd",
> "bluefs_slow_model": "ST1NM0086-2A",
> "bluefs_slow_rotational": "1",
> "bluefs_slow_type": "hdd",
> "bluestore_bdev_rotational": "1",
> "bluestore_bdev_type": "hdd",
> "default_device_class": "hdd",
> *"journal_rotational": "1",*
> "osd_objectstore": "bluestore",
> "rotational": "1"
>
> Looks to me if i'm hitting the same issue, isn't it?
>
> ps. An upgrade of Ceph is planned in the near future but for now i would
> like to use the workaround if applicable to me.
>
> Thank you in advance.
>
> Kind regards,
> Caspar Smit
>
> 2018-02-26 23:22 GMT+01:00 Reed Dier :
>
>> Quick turn around,
>>
>> Changing/injecting osd_recovery_sleep_hdd into the running SSD OSD’s on
>> bluestore opened the floodgates.
>>
>> pool objects-ssd id 20
>>   recovery io 1512 MB/s, 21547 objects/s
>>
>> pool fs-metadata-ssd id 16
>>   recovery io 0 B/s, 6494 keys/s, 271 objects/s
>>   client io 82325 B/s rd, 68146 B/s wr, 1 op/s rd, 0 op/s wr
>>
>>
>> Graph of performance jump. Extremely marked.
>> https://imgur.com/a/LZR9R
>>
>> So at least we now have the gun to go with the smoke.
>>
>> Thanks for the help and appreciate you pointing me in some directions
>> that I was able to use to figure out the issue.
>>
>> Adding to ceph.conf for future OSD conversions.
>>
>> Thanks,
>>
>> Reed
>>
>>
>> On Feb 26, 2018, at 4:12 PM, Reed Dier  wrote:
>>
>> For the record, I am not seeing a demonstrative fix by injecting the
>> value of 0 into the OSDs running.
>>
>> osd_recovery_sleep_hybrid = '0.00' (not observed, change may require
>> restart)
>>
>>
>> If it does indeed need to be restarted, I will need to wait for the
>> current backfills to finish their process as restarting an OSD would bring
>> me under min_size.
>>
>> However, doing config show on the osd daemon appears to have taken the
>> value of 0.
>>
>> ceph daemon osd.24 config show | grep recovery_sleep
>> "osd_recovery_sleep": "0.00",
>> "osd_recovery_sleep_hdd": "0.10",
>> "osd_recovery_sleep_hybrid": "0.00",
>> "osd_recovery_sleep_ssd": "0.00",
>>
>>
>> I may take the restart as an opportunity to also move to 12.2.3 at the
>> same time, since it is not expected that that should affect this issue.
>>
>> I could also attempt to change osd_recovery_sleep_hdd as well, since
>> these are ssd osd’s, it shouldn’t make a difference, but its a free move.
>>
>> Thanks,
>>
>> Reed
>>
>> On Feb 26, 2018, at 3:42 PM, Gregory Farnum  wrote:
>>
>> On Mon, Feb 26, 2018 at 12:26 PM Reed Dier  wrote:
>>
>>> I will try to set the hybrid sleeps to 0 on the affected OSDs as an
>>> interim solution to getting the metadata configured correctly.
>>>
>>
>> Yes, that's a good workaround as long as you don't have any actual hybrid
>> OSDs (or aren't worried about them sleeping...I'm not sure if that setting
>> came from experience or not).
>>
>>
>>>
>>> For reference, here is the complete metadata for osd.24, bluestore SATA
>>> SSD with NVMe block.db.
>>>
>>> {
>>> "id": 24,
>>> "arch": "x86_64",
>>> "back_addr": "",
>>> "back_iface": "bond0",
>>> "bluefs": "1",
>>> "bluefs_db_access_mode": "blk",
>>> "bluefs_db_block_size": "4096",
>>> "bluefs_db_dev": "259:0",
>>> "bluefs_db_dev_node": "nvme0n1",
>>> "bluefs_db_driver": "KernelDevice",
>>> "bluefs_db_model": "INTEL SSDPEDMD400G4 ",
>>> "bluefs_db_partition_path": "/dev/nvme0n1p4",
>>> "bluefs_db_rotational": "0",
>>> "bluefs_db_serial": " ",
>>> "bluefs_db_size": "16000221184",
>>> "bluefs_db_type": "nvme",
>>> "bluefs_single_shared_device": "0",
>>> "bluefs_slow_access_mode": "blk",
>>> "bluefs_slow_block_size": "4096",
>>> "bluefs_slow_dev": "253:8",
>>> "bluefs_slow_dev_node": "dm-8",
>>> 

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-06-04 Thread Caspar Smit
Hi Reed,

"Changing/injecting osd_recovery_sleep_hdd into the running SSD OSD’s on
bluestore opened the floodgates."

What exactly did you change/inject here?

We have a cluster with 10TB SATA HDD's which each have a 100GB SSD based
block.db

Looking at ceph osd metadata for each of those:

"bluefs_db_model": "SAMSUNG MZ7KM960",
"bluefs_db_rotational": "0",
"bluefs_db_type": "ssd",
"bluefs_slow_model": "ST1NM0086-2A",
"bluefs_slow_rotational": "1",
"bluefs_slow_type": "hdd",
"bluestore_bdev_rotational": "1",
"bluestore_bdev_type": "hdd",
"default_device_class": "hdd",
*"journal_rotational": "1",*
"osd_objectstore": "bluestore",
"rotational": "1"

Looks to me if i'm hitting the same issue, isn't it?

ps. An upgrade of Ceph is planned in the near future but for now i would
like to use the workaround if applicable to me.

Thank you in advance.

Kind regards,
Caspar Smit

2018-02-26 23:22 GMT+01:00 Reed Dier :

> Quick turn around,
>
> Changing/injecting osd_recovery_sleep_hdd into the running SSD OSD’s on
> bluestore opened the floodgates.
>
> pool objects-ssd id 20
>   recovery io 1512 MB/s, 21547 objects/s
>
> pool fs-metadata-ssd id 16
>   recovery io 0 B/s, 6494 keys/s, 271 objects/s
>   client io 82325 B/s rd, 68146 B/s wr, 1 op/s rd, 0 op/s wr
>
>
> Graph of performance jump. Extremely marked.
> https://imgur.com/a/LZR9R
>
> So at least we now have the gun to go with the smoke.
>
> Thanks for the help and appreciate you pointing me in some directions that
> I was able to use to figure out the issue.
>
> Adding to ceph.conf for future OSD conversions.
>
> Thanks,
>
> Reed
>
>
> On Feb 26, 2018, at 4:12 PM, Reed Dier  wrote:
>
> For the record, I am not seeing a demonstrative fix by injecting the value
> of 0 into the OSDs running.
>
> osd_recovery_sleep_hybrid = '0.00' (not observed, change may require
> restart)
>
>
> If it does indeed need to be restarted, I will need to wait for the
> current backfills to finish their process as restarting an OSD would bring
> me under min_size.
>
> However, doing config show on the osd daemon appears to have taken the
> value of 0.
>
> ceph daemon osd.24 config show | grep recovery_sleep
> "osd_recovery_sleep": "0.00",
> "osd_recovery_sleep_hdd": "0.10",
> "osd_recovery_sleep_hybrid": "0.00",
> "osd_recovery_sleep_ssd": "0.00",
>
>
> I may take the restart as an opportunity to also move to 12.2.3 at the
> same time, since it is not expected that that should affect this issue.
>
> I could also attempt to change osd_recovery_sleep_hdd as well, since these
> are ssd osd’s, it shouldn’t make a difference, but its a free move.
>
> Thanks,
>
> Reed
>
> On Feb 26, 2018, at 3:42 PM, Gregory Farnum  wrote:
>
> On Mon, Feb 26, 2018 at 12:26 PM Reed Dier  wrote:
>
>> I will try to set the hybrid sleeps to 0 on the affected OSDs as an
>> interim solution to getting the metadata configured correctly.
>>
>
> Yes, that's a good workaround as long as you don't have any actual hybrid
> OSDs (or aren't worried about them sleeping...I'm not sure if that setting
> came from experience or not).
>
>
>>
>> For reference, here is the complete metadata for osd.24, bluestore SATA
>> SSD with NVMe block.db.
>>
>> {
>> "id": 24,
>> "arch": "x86_64",
>> "back_addr": "",
>> "back_iface": "bond0",
>> "bluefs": "1",
>> "bluefs_db_access_mode": "blk",
>> "bluefs_db_block_size": "4096",
>> "bluefs_db_dev": "259:0",
>> "bluefs_db_dev_node": "nvme0n1",
>> "bluefs_db_driver": "KernelDevice",
>> "bluefs_db_model": "INTEL SSDPEDMD400G4 ",
>> "bluefs_db_partition_path": "/dev/nvme0n1p4",
>> "bluefs_db_rotational": "0",
>> "bluefs_db_serial": " ",
>> "bluefs_db_size": "16000221184",
>> "bluefs_db_type": "nvme",
>> "bluefs_single_shared_device": "0",
>> "bluefs_slow_access_mode": "blk",
>> "bluefs_slow_block_size": "4096",
>> "bluefs_slow_dev": "253:8",
>> "bluefs_slow_dev_node": "dm-8",
>> "bluefs_slow_driver": "KernelDevice",
>> "bluefs_slow_model": "",
>> "bluefs_slow_partition_path": "/dev/dm-8",
>> "bluefs_slow_rotational": "0",
>> "bluefs_slow_size": "1920378863616",
>> "bluefs_slow_type": "ssd",
>> "bluestore_bdev_access_mode": "blk",
>> "bluestore_bdev_block_size": "4096",
>> "bluestore_bdev_dev": "253:8",
>> "bluestore_bdev_dev_node": "dm-8",
>> "bluestore_bdev_driver": "KernelDevice",
>> "bluestore_bdev_model": "",
>> "bluestore_bdev_partition_path": "/dev/dm-8",
>> "bluestore_bdev_rotational": "0",
>> "bluestore_bdev_size": "1920378863616",
>> "bluestore_bdev_type": "ssd",
>> "ceph_version": "ceph version 12.2.2
>> 

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-27 Thread David Turner
I have 2 different configurations that are incorrectly showing rotational
for the OSDs.  The [1]first is a server with disks behind controllers and
an NVME riser card.  It has 2 different OSD types, one with the block on an
HDD and WAL on the NVME as well as a pure NVME OSD.  The Hybrid OSD seems
to be showing the correct configuration, but the pure NVME OSD is
incorrectly showing up with a rotational journal.

The [2]second configuration I have is with a new server configuration
without a controller and new NVME disks in 2.5" form factor.  It is also
showing a rotational journal. What I find most interesting between all of
these is that it doesn't appear that journal_rotational is being used by
the hybrid OSDs, while it's gumming up the works for the pure flash OSDs.
This seems to match what the others in this thread have seen.


[1] HDD + NVME WAL
"bluefs_db_rotational": "1",
"bluefs_wal_rotational": "0",
"bluestore_bdev_rotational": "1",
"journal_rotational": "0",
"rotational": "1"
Pure NVME
"bluefs_db_rotational": "0",
"bluestore_bdev_rotational": "0",
"journal_rotational": "1",
"rotational": "0"

[2]No controller NVME
"bluefs_db_rotational": "0",
"bluestore_bdev_rotational": "0",
"journal_rotational": "1",
"rotational": "0"

On Mon, Feb 26, 2018 at 5:54 PM Oliver Freyermuth <
freyerm...@physik.uni-bonn.de> wrote:

> Am 26.02.2018 um 23:29 schrieb Gregory Farnum:
> >
> >
> > On Mon, Feb 26, 2018 at 2:23 PM Reed Dier  > wrote:
> >
> > Quick turn around,
> >
> > Changing/injecting osd_recovery_sleep_hdd into the running SSD OSD’s
> on bluestore opened the floodgates.
> >
> >
> > Oh right, the OSD does not (think it can) have anything it can really do
> if you've got a rotational journal and an SSD main device, and since
> BlueStore was misreporting itself as having a rotational journal the OSD
> falls back to the hard drive settings. Sorry I didn't work through that
> ahead of time; glad this works around it for you!
> > -Greg
>
> To chime in, this also helps for me! Replication is much faster now.
> It's a bit strange though that for my metadata-OSDs I see the following
> with iostat now:
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> sdb   0,00 0,00 1333,00  301,40 143391,20 42861,60
>  227,9221,05   13,788,88   35,44   0,59  96,64
> sda   0,00 0,00 1283,40  258,20 139004,80   876,00
>  181,47 7,184,665,112,40   0,54  83,32
> (MDS should be doing nothing on it)
> while on the OSDs to which things are backfilled I see:
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> sda   0,00 0,00   47,20  458,20   367,20  1628,00
>  7,90 0,921,826,951,29   1,18  59,86
> sdb   0,00 0,00   48,20  589,00   375,20  1892,00
>  7,12 0,400,630,780,62   0,59  37,32
>
> So it seems the "sending" OSDs are now finally taken to their limit (they
> read and write a lot), but the receiving side is rather bored.
> Maybe this strange effect (many writes when actually reading stuff for
> backfilling) is normal for metadata => RocksDB?
>
> In any case, glad this "rotational" issue is int he queue to be fixed in a
> future release ;-).
>
> Cheers,
> Oliver
>
> >
> >
> >
> >> pool objects-ssd id 20
> >>   recovery io 1512 MB/s, 21547 objects/s
> >>
> >> pool fs-metadata-ssd id 16
> >>   recovery io 0 B/s, 6494 keys/s, 271 objects/s
> >>   client io 82325 B/s rd, 68146 B/s wr, 1 op/s rd, 0 op/s wr
> >
> > Graph of performance jump. Extremely marked.
> > https://imgur.com/a/LZR9R
> >
> > So at least we now have the gun to go with the smoke.
> >
> > Thanks for the help and appreciate you pointing me in some
> directions that I was able to use to figure out the issue.
> >
> > Adding to ceph.conf for future OSD conversions.
> >
> > Thanks,
> >
> > Reed
> >
> >
> >> On Feb 26, 2018, at 4:12 PM, Reed Dier  > wrote:
> >>
> >> For the record, I am not seeing a demonstrative fix by injecting
> the value of 0 into the OSDs running.
> >>> osd_recovery_sleep_hybrid = '0.00' (not observed, change may
> require restart)
> >>
> >> If it does indeed need to be restarted, I will need to wait for the
> current backfills to finish their process as restarting an OSD would bring
> me under min_size.
> >>
> >> However, doing config show on the osd daemon appears to have taken
> the value of 0.
> >>
> >>> ceph daemon osd.24 config show | grep recovery_sleep
> >>> "osd_recovery_sleep": "0.00",
> >>> "osd_recovery_sleep_hdd": "0.10",
> >>> 

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-26 Thread Oliver Freyermuth
Am 26.02.2018 um 23:29 schrieb Gregory Farnum:
> 
> 
> On Mon, Feb 26, 2018 at 2:23 PM Reed Dier  > wrote:
> 
> Quick turn around,
> 
> Changing/injecting osd_recovery_sleep_hdd into the running SSD OSD’s on 
> bluestore opened the floodgates.
> 
> 
> Oh right, the OSD does not (think it can) have anything it can really do if 
> you've got a rotational journal and an SSD main device, and since BlueStore 
> was misreporting itself as having a rotational journal the OSD falls back to 
> the hard drive settings. Sorry I didn't work through that ahead of time; glad 
> this works around it for you!
> -Greg

To chime in, this also helps for me! Replication is much faster now. 
It's a bit strange though that for my metadata-OSDs I see the following with 
iostat now:
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sdb   0,00 0,00 1333,00  301,40 143391,20 42861,60   227,92
21,05   13,788,88   35,44   0,59  96,64
sda   0,00 0,00 1283,40  258,20 139004,80   876,00   181,47 
7,184,665,112,40   0,54  83,32
(MDS should be doing nothing on it)
while on the OSDs to which things are backfilled I see:
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0,00 0,00   47,20  458,20   367,20  1628,00 7,90 
0,921,826,951,29   1,18  59,86
sdb   0,00 0,00   48,20  589,00   375,20  1892,00 7,12 
0,400,630,780,62   0,59  37,32

So it seems the "sending" OSDs are now finally taken to their limit (they read 
and write a lot), but the receiving side is rather bored. 
Maybe this strange effect (many writes when actually reading stuff for 
backfilling) is normal for metadata => RocksDB? 

In any case, glad this "rotational" issue is int he queue to be fixed in a 
future release ;-). 

Cheers,
Oliver

>  
> 
> 
>> pool objects-ssd id 20
>>   recovery io 1512 MB/s, 21547 objects/s
>>
>> pool fs-metadata-ssd id 16
>>   recovery io 0 B/s, 6494 keys/s, 271 objects/s
>>   client io 82325 B/s rd, 68146 B/s wr, 1 op/s rd, 0 op/s wr
> 
> Graph of performance jump. Extremely marked.
> https://imgur.com/a/LZR9R
> 
> So at least we now have the gun to go with the smoke.
> 
> Thanks for the help and appreciate you pointing me in some directions 
> that I was able to use to figure out the issue.
> 
> Adding to ceph.conf for future OSD conversions.
> 
> Thanks,
> 
> Reed
> 
> 
>> On Feb 26, 2018, at 4:12 PM, Reed Dier > > wrote:
>>
>> For the record, I am not seeing a demonstrative fix by injecting the 
>> value of 0 into the OSDs running.
>>> osd_recovery_sleep_hybrid = '0.00' (not observed, change may 
>>> require restart)
>>
>> If it does indeed need to be restarted, I will need to wait for the 
>> current backfills to finish their process as restarting an OSD would bring 
>> me under min_size.
>>
>> However, doing config show on the osd daemon appears to have taken the 
>> value of 0.
>>
>>> ceph daemon osd.24 config show | grep recovery_sleep
>>>     "osd_recovery_sleep": "0.00",
>>>     "osd_recovery_sleep_hdd": "0.10",
>>>     "osd_recovery_sleep_hybrid": "0.00",
>>>     "osd_recovery_sleep_ssd": "0.00",
>>
>> I may take the restart as an opportunity to also move to 12.2.3 at the 
>> same time, since it is not expected that that should affect this issue.
>>
>> I could also attempt to change osd_recovery_sleep_hdd as well, since 
>> these are ssd osd’s, it shouldn’t make a difference, but its a free move.
>>
>> Thanks,
>>
>> Reed
>>
>>> On Feb 26, 2018, at 3:42 PM, Gregory Farnum >> > wrote:
>>>
>>> On Mon, Feb 26, 2018 at 12:26 PM Reed Dier >> > wrote:
>>>
>>> I will try to set the hybrid sleeps to 0 on the affected OSDs as an 
>>> interim solution to getting the metadata configured correctly.
>>>
>>>
>>> Yes, that's a good workaround as long as you don't have any actual 
>>> hybrid OSDs (or aren't worried about them sleeping...I'm not sure if that 
>>> setting came from experience or not).
>>>  
>>>
>>>
>>> For reference, here is the complete metadata for osd.24, bluestore 
>>> SATA SSD with NVMe block.db.
>>>
 {
         "id": 24,
         "arch": "x86_64",
         "back_addr": "",
         "back_iface": "bond0",
         "bluefs": "1",
         "bluefs_db_access_mode": "blk",
         "bluefs_db_block_size": "4096",
         "bluefs_db_dev": "259:0",
         

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-26 Thread Gregory Farnum
On Mon, Feb 26, 2018 at 2:23 PM Reed Dier  wrote:

> Quick turn around,
>
> Changing/injecting osd_recovery_sleep_hdd into the running SSD OSD’s on
> bluestore opened the floodgates.
>

Oh right, the OSD does not (think it can) have anything it can really do if
you've got a rotational journal and an SSD main device, and since BlueStore
was misreporting itself as having a rotational journal the OSD falls back
to the hard drive settings. Sorry I didn't work through that ahead of time;
glad this works around it for you!
-Greg


>
> pool objects-ssd id 20
>   recovery io 1512 MB/s, 21547 objects/s
>
> pool fs-metadata-ssd id 16
>   recovery io 0 B/s, 6494 keys/s, 271 objects/s
>   client io 82325 B/s rd, 68146 B/s wr, 1 op/s rd, 0 op/s wr
>
>
> Graph of performance jump. Extremely marked.
> https://imgur.com/a/LZR9R
>
> So at least we now have the gun to go with the smoke.
>
> Thanks for the help and appreciate you pointing me in some directions that
> I was able to use to figure out the issue.
>
> Adding to ceph.conf for future OSD conversions.
>
> Thanks,
>
> Reed
>
>
> On Feb 26, 2018, at 4:12 PM, Reed Dier  wrote:
>
> For the record, I am not seeing a demonstrative fix by injecting the value
> of 0 into the OSDs running.
>
> osd_recovery_sleep_hybrid = '0.00' (not observed, change may require
> restart)
>
>
> If it does indeed need to be restarted, I will need to wait for the
> current backfills to finish their process as restarting an OSD would bring
> me under min_size.
>
> However, doing config show on the osd daemon appears to have taken the
> value of 0.
>
> ceph daemon osd.24 config show | grep recovery_sleep
> "osd_recovery_sleep": "0.00",
> "osd_recovery_sleep_hdd": "0.10",
> "osd_recovery_sleep_hybrid": "0.00",
> "osd_recovery_sleep_ssd": "0.00",
>
>
> I may take the restart as an opportunity to also move to 12.2.3 at the
> same time, since it is not expected that that should affect this issue.
>
> I could also attempt to change osd_recovery_sleep_hdd as well, since these
> are ssd osd’s, it shouldn’t make a difference, but its a free move.
>
> Thanks,
>
> Reed
>
> On Feb 26, 2018, at 3:42 PM, Gregory Farnum  wrote:
>
> On Mon, Feb 26, 2018 at 12:26 PM Reed Dier  wrote:
>
>> I will try to set the hybrid sleeps to 0 on the affected OSDs as an
>> interim solution to getting the metadata configured correctly.
>>
>
> Yes, that's a good workaround as long as you don't have any actual hybrid
> OSDs (or aren't worried about them sleeping...I'm not sure if that setting
> came from experience or not).
>
>
>>
>> For reference, here is the complete metadata for osd.24, bluestore SATA
>> SSD with NVMe block.db.
>>
>> {
>> "id": 24,
>> "arch": "x86_64",
>> "back_addr": "",
>> "back_iface": "bond0",
>> "bluefs": "1",
>> "bluefs_db_access_mode": "blk",
>> "bluefs_db_block_size": "4096",
>> "bluefs_db_dev": "259:0",
>> "bluefs_db_dev_node": "nvme0n1",
>> "bluefs_db_driver": "KernelDevice",
>> "bluefs_db_model": "INTEL SSDPEDMD400G4 ",
>> "bluefs_db_partition_path": "/dev/nvme0n1p4",
>> "bluefs_db_rotational": "0",
>> "bluefs_db_serial": " ",
>> "bluefs_db_size": "16000221184",
>> "bluefs_db_type": "nvme",
>> "bluefs_single_shared_device": "0",
>> "bluefs_slow_access_mode": "blk",
>> "bluefs_slow_block_size": "4096",
>> "bluefs_slow_dev": "253:8",
>> "bluefs_slow_dev_node": "dm-8",
>> "bluefs_slow_driver": "KernelDevice",
>> "bluefs_slow_model": "",
>> "bluefs_slow_partition_path": "/dev/dm-8",
>> "bluefs_slow_rotational": "0",
>> "bluefs_slow_size": "1920378863616",
>> "bluefs_slow_type": "ssd",
>> "bluestore_bdev_access_mode": "blk",
>> "bluestore_bdev_block_size": "4096",
>> "bluestore_bdev_dev": "253:8",
>> "bluestore_bdev_dev_node": "dm-8",
>> "bluestore_bdev_driver": "KernelDevice",
>> "bluestore_bdev_model": "",
>> "bluestore_bdev_partition_path": "/dev/dm-8",
>> "bluestore_bdev_rotational": "0",
>> "bluestore_bdev_size": "1920378863616",
>> "bluestore_bdev_type": "ssd",
>> "ceph_version": "ceph version 12.2.2
>> (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)",
>> "cpu": "Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz",
>> "default_device_class": "ssd",
>> "distro": "ubuntu",
>> "distro_description": "Ubuntu 16.04.3 LTS",
>> "distro_version": "16.04",
>> "front_addr": "",
>> "front_iface": "bond0",
>> "hb_back_addr": "",
>> "hb_front_addr": "",
>> "hostname": “host00",
>> "journal_rotational": "1",
>> "kernel_description": "#29~16.04.2-Ubuntu SMP Tue Jan 9 

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-26 Thread Reed Dier
Quick turn around,

Changing/injecting osd_recovery_sleep_hdd into the running SSD OSD’s on 
bluestore opened the floodgates.

> pool objects-ssd id 20
>   recovery io 1512 MB/s, 21547 objects/s
> 
> pool fs-metadata-ssd id 16
>   recovery io 0 B/s, 6494 keys/s, 271 objects/s
>   client io 82325 B/s rd, 68146 B/s wr, 1 op/s rd, 0 op/s wr

Graph of performance jump. Extremely marked.
https://imgur.com/a/LZR9R 

So at least we now have the gun to go with the smoke.

Thanks for the help and appreciate you pointing me in some directions that I 
was able to use to figure out the issue.

Adding to ceph.conf for future OSD conversions.

Thanks,

Reed


> On Feb 26, 2018, at 4:12 PM, Reed Dier  wrote:
> 
> For the record, I am not seeing a demonstrative fix by injecting the value of 
> 0 into the OSDs running.
>> osd_recovery_sleep_hybrid = '0.00' (not observed, change may require 
>> restart)
> 
> If it does indeed need to be restarted, I will need to wait for the current 
> backfills to finish their process as restarting an OSD would bring me under 
> min_size.
> 
> However, doing config show on the osd daemon appears to have taken the value 
> of 0.
> 
>> ceph daemon osd.24 config show | grep recovery_sleep
>> "osd_recovery_sleep": "0.00",
>> "osd_recovery_sleep_hdd": "0.10",
>> "osd_recovery_sleep_hybrid": "0.00",
>> "osd_recovery_sleep_ssd": "0.00",
> 
> 
> I may take the restart as an opportunity to also move to 12.2.3 at the same 
> time, since it is not expected that that should affect this issue.
> 
> I could also attempt to change osd_recovery_sleep_hdd as well, since these 
> are ssd osd’s, it shouldn’t make a difference, but its a free move.
> 
> Thanks,
> 
> Reed
> 
>> On Feb 26, 2018, at 3:42 PM, Gregory Farnum > > wrote:
>> 
>> On Mon, Feb 26, 2018 at 12:26 PM Reed Dier > > wrote:
>> I will try to set the hybrid sleeps to 0 on the affected OSDs as an interim 
>> solution to getting the metadata configured correctly.
>> 
>> Yes, that's a good workaround as long as you don't have any actual hybrid 
>> OSDs (or aren't worried about them sleeping...I'm not sure if that setting 
>> came from experience or not).
>>  
>> 
>> For reference, here is the complete metadata for osd.24, bluestore SATA SSD 
>> with NVMe block.db.
>> 
>>> {
>>> "id": 24,
>>> "arch": "x86_64",
>>> "back_addr": "",
>>> "back_iface": "bond0",
>>> "bluefs": "1",
>>> "bluefs_db_access_mode": "blk",
>>> "bluefs_db_block_size": "4096",
>>> "bluefs_db_dev": "259:0",
>>> "bluefs_db_dev_node": "nvme0n1",
>>> "bluefs_db_driver": "KernelDevice",
>>> "bluefs_db_model": "INTEL SSDPEDMD400G4 ",
>>> "bluefs_db_partition_path": "/dev/nvme0n1p4",
>>> "bluefs_db_rotational": "0",
>>> "bluefs_db_serial": " ",
>>> "bluefs_db_size": "16000221184",
>>> "bluefs_db_type": "nvme",
>>> "bluefs_single_shared_device": "0",
>>> "bluefs_slow_access_mode": "blk",
>>> "bluefs_slow_block_size": "4096",
>>> "bluefs_slow_dev": "253:8",
>>> "bluefs_slow_dev_node": "dm-8",
>>> "bluefs_slow_driver": "KernelDevice",
>>> "bluefs_slow_model": "",
>>> "bluefs_slow_partition_path": "/dev/dm-8",
>>> "bluefs_slow_rotational": "0",
>>> "bluefs_slow_size": "1920378863616",
>>> "bluefs_slow_type": "ssd",
>>> "bluestore_bdev_access_mode": "blk",
>>> "bluestore_bdev_block_size": "4096",
>>> "bluestore_bdev_dev": "253:8",
>>> "bluestore_bdev_dev_node": "dm-8",
>>> "bluestore_bdev_driver": "KernelDevice",
>>> "bluestore_bdev_model": "",
>>> "bluestore_bdev_partition_path": "/dev/dm-8",
>>> "bluestore_bdev_rotational": "0",
>>> "bluestore_bdev_size": "1920378863616",
>>> "bluestore_bdev_type": "ssd",
>>> "ceph_version": "ceph version 12.2.2 
>>> (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)",
>>> "cpu": "Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz",
>>> "default_device_class": "ssd",
>>> "distro": "ubuntu",
>>> "distro_description": "Ubuntu 16.04.3 LTS",
>>> "distro_version": "16.04",
>>> "front_addr": "",
>>> "front_iface": "bond0",
>>> "hb_back_addr": "",
>>> "hb_front_addr": "",
>>> "hostname": “host00",
>>> "journal_rotational": "1",
>>> "kernel_description": "#29~16.04.2-Ubuntu SMP Tue Jan 9 22:00:44 
>>> UTC 2018",
>>> "kernel_version": "4.13.0-26-generic",
>>> "mem_swap_kb": "124999672",
>>> "mem_total_kb": "131914008",
>>> "os": "Linux",
>>> "osd_data": "/var/lib/ceph/osd/ceph-24",
>>> "osd_objectstore": "bluestore",

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-26 Thread Gregory Farnum
On Mon, Feb 26, 2018 at 12:26 PM Reed Dier  wrote:

> I will try to set the hybrid sleeps to 0 on the affected OSDs as an
> interim solution to getting the metadata configured correctly.
>

Yes, that's a good workaround as long as you don't have any actual hybrid
OSDs (or aren't worried about them sleeping...I'm not sure if that setting
came from experience or not).


>
> For reference, here is the complete metadata for osd.24, bluestore SATA
> SSD with NVMe block.db.
>
> {
> "id": 24,
> "arch": "x86_64",
> "back_addr": "",
> "back_iface": "bond0",
> "bluefs": "1",
> "bluefs_db_access_mode": "blk",
> "bluefs_db_block_size": "4096",
> "bluefs_db_dev": "259:0",
> "bluefs_db_dev_node": "nvme0n1",
> "bluefs_db_driver": "KernelDevice",
> "bluefs_db_model": "INTEL SSDPEDMD400G4 ",
> "bluefs_db_partition_path": "/dev/nvme0n1p4",
> "bluefs_db_rotational": "0",
> "bluefs_db_serial": " ",
> "bluefs_db_size": "16000221184",
> "bluefs_db_type": "nvme",
> "bluefs_single_shared_device": "0",
> "bluefs_slow_access_mode": "blk",
> "bluefs_slow_block_size": "4096",
> "bluefs_slow_dev": "253:8",
> "bluefs_slow_dev_node": "dm-8",
> "bluefs_slow_driver": "KernelDevice",
> "bluefs_slow_model": "",
> "bluefs_slow_partition_path": "/dev/dm-8",
> "bluefs_slow_rotational": "0",
> "bluefs_slow_size": "1920378863616",
> "bluefs_slow_type": "ssd",
> "bluestore_bdev_access_mode": "blk",
> "bluestore_bdev_block_size": "4096",
> "bluestore_bdev_dev": "253:8",
> "bluestore_bdev_dev_node": "dm-8",
> "bluestore_bdev_driver": "KernelDevice",
> "bluestore_bdev_model": "",
> "bluestore_bdev_partition_path": "/dev/dm-8",
> "bluestore_bdev_rotational": "0",
> "bluestore_bdev_size": "1920378863616",
> "bluestore_bdev_type": "ssd",
> "ceph_version": "ceph version 12.2.2
> (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)",
> "cpu": "Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz",
> "default_device_class": "ssd",
> "distro": "ubuntu",
> "distro_description": "Ubuntu 16.04.3 LTS",
> "distro_version": "16.04",
> "front_addr": "",
> "front_iface": "bond0",
> "hb_back_addr": "",
> "hb_front_addr": "",
> "hostname": “host00",
> "journal_rotational": "1",
> "kernel_description": "#29~16.04.2-Ubuntu SMP Tue Jan 9 22:00:44
> UTC 2018",
> "kernel_version": "4.13.0-26-generic",
> "mem_swap_kb": "124999672",
> "mem_total_kb": "131914008",
> "os": "Linux",
> "osd_data": "/var/lib/ceph/osd/ceph-24",
> "osd_objectstore": "bluestore",
> "rotational": "0"
> }
>
>
> So it looks like it correctly guessed(?) the
> bluestore_bdev_type/default_device_class correctly (though it may have been
> an inherited value?), as did bluefs_db_type get set to nvme correctly.
>
> So I’m not sure why journal_rotational is still showing 1.
> Maybe something in the ceph-volume lvm piece that isn’t correctly setting
> that flag on OSD creation?
> Also seems like the journal_rotational field should have been deprecated
> in bluestore as bluefs_db_rotational should cover that, and if there were a
> WAL partition as well, I assume there would be something to the tune of
> bluefs_wal_rotational or something like that, and journal would never be
> used for bluestore?
>

Thanks to both of you for helping diagnose this issue. I created a ticket
and have a PR up to fix it: http://tracker.ceph.com/issues/23141,
https://github.com/ceph/ceph/pull/20602

Until that gets backported into another Luminous release you'll need to do
some kind of workaround though. :/
-Greg


>
> Appreciate the help.
>
> Thanks,
> Reed
>
> On Feb 26, 2018, at 1:28 PM, Gregory Farnum  wrote:
>
> On Mon, Feb 26, 2018 at 11:21 AM Reed Dier  wrote:
>
>> The ‘good perf’ that I reported below was the result of beginning 5 new
>> bluestore conversions which results in a leading edge of ‘good’
>> performance, before trickling off.
>>
>> This performance lasted about 20 minutes, where it backfilled a small set
>> of PGs off of non-bluestore OSDs.
>>
>> Current performance is now hovering around:
>>
>> pool objects-ssd id 20
>>   recovery io 14285 kB/s, 202 objects/s
>>
>> pool fs-metadata-ssd id 16
>>   recovery io 0 B/s, 262 keys/s, 12 objects/s
>>   client io 412 kB/s rd, 67593 B/s wr, 5 op/s rd, 0 op/s wr
>>
>>
>> What are you referencing when you talk about recovery ops per second?
>>
>> These are recovery ops as reported by ceph -s or via stats exported via
>> influx plugin in mgr, and via local collectd collection.
>>
>> Also, what are the values for osd_recovery_sleep_hdd
>> and 

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-26 Thread Reed Dier
I will try to set the hybrid sleeps to 0 on the affected OSDs as an interim 
solution to getting the metadata configured correctly.

For reference, here is the complete metadata for osd.24, bluestore SATA SSD 
with NVMe block.db.

> {
> "id": 24,
> "arch": "x86_64",
> "back_addr": "",
> "back_iface": "bond0",
> "bluefs": "1",
> "bluefs_db_access_mode": "blk",
> "bluefs_db_block_size": "4096",
> "bluefs_db_dev": "259:0",
> "bluefs_db_dev_node": "nvme0n1",
> "bluefs_db_driver": "KernelDevice",
> "bluefs_db_model": "INTEL SSDPEDMD400G4 ",
> "bluefs_db_partition_path": "/dev/nvme0n1p4",
> "bluefs_db_rotational": "0",
> "bluefs_db_serial": " ",
> "bluefs_db_size": "16000221184",
> "bluefs_db_type": "nvme",
> "bluefs_single_shared_device": "0",
> "bluefs_slow_access_mode": "blk",
> "bluefs_slow_block_size": "4096",
> "bluefs_slow_dev": "253:8",
> "bluefs_slow_dev_node": "dm-8",
> "bluefs_slow_driver": "KernelDevice",
> "bluefs_slow_model": "",
> "bluefs_slow_partition_path": "/dev/dm-8",
> "bluefs_slow_rotational": "0",
> "bluefs_slow_size": "1920378863616",
> "bluefs_slow_type": "ssd",
> "bluestore_bdev_access_mode": "blk",
> "bluestore_bdev_block_size": "4096",
> "bluestore_bdev_dev": "253:8",
> "bluestore_bdev_dev_node": "dm-8",
> "bluestore_bdev_driver": "KernelDevice",
> "bluestore_bdev_model": "",
> "bluestore_bdev_partition_path": "/dev/dm-8",
> "bluestore_bdev_rotational": "0",
> "bluestore_bdev_size": "1920378863616",
> "bluestore_bdev_type": "ssd",
> "ceph_version": "ceph version 12.2.2 
> (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)",
> "cpu": "Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz",
> "default_device_class": "ssd",
> "distro": "ubuntu",
> "distro_description": "Ubuntu 16.04.3 LTS",
> "distro_version": "16.04",
> "front_addr": "",
> "front_iface": "bond0",
> "hb_back_addr": "",
> "hb_front_addr": "",
> "hostname": “host00",
> "journal_rotational": "1",
> "kernel_description": "#29~16.04.2-Ubuntu SMP Tue Jan 9 22:00:44 UTC 
> 2018",
> "kernel_version": "4.13.0-26-generic",
> "mem_swap_kb": "124999672",
> "mem_total_kb": "131914008",
> "os": "Linux",
> "osd_data": "/var/lib/ceph/osd/ceph-24",
> "osd_objectstore": "bluestore",
> "rotational": "0"
> }


So it looks like it correctly guessed(?) the 
bluestore_bdev_type/default_device_class correctly (though it may have been an 
inherited value?), as did bluefs_db_type get set to nvme correctly.

So I’m not sure why journal_rotational is still showing 1.
Maybe something in the ceph-volume lvm piece that isn’t correctly setting that 
flag on OSD creation?
Also seems like the journal_rotational field should have been deprecated in 
bluestore as bluefs_db_rotational should cover that, and if there were a WAL 
partition as well, I assume there would be something to the tune of 
bluefs_wal_rotational or something like that, and journal would never be used 
for bluestore?

Appreciate the help.

Thanks,
Reed

> On Feb 26, 2018, at 1:28 PM, Gregory Farnum  wrote:
> 
> On Mon, Feb 26, 2018 at 11:21 AM Reed Dier  > wrote:
> The ‘good perf’ that I reported below was the result of beginning 5 new 
> bluestore conversions which results in a leading edge of ‘good’ performance, 
> before trickling off.
> 
> This performance lasted about 20 minutes, where it backfilled a small set of 
> PGs off of non-bluestore OSDs.
> 
> Current performance is now hovering around:
>> pool objects-ssd id 20
>>   recovery io 14285 kB/s, 202 objects/s
>> 
>> pool fs-metadata-ssd id 16
>>   recovery io 0 B/s, 262 keys/s, 12 objects/s
>>   client io 412 kB/s rd, 67593 B/s wr, 5 op/s rd, 0 op/s wr
> 
>> What are you referencing when you talk about recovery ops per second?
> 
> These are recovery ops as reported by ceph -s or via stats exported via 
> influx plugin in mgr, and via local collectd collection.
> 
>> Also, what are the values for osd_recovery_sleep_hdd and 
>> osd_recovery_sleep_hybrid, and can you validate via "ceph osd metadata" that 
>> your BlueStore SSD OSDs are correctly reporting both themselves and their 
>> journals as non-rotational?
> 
> This yields more interesting results.
> Pasting results for 3 sets of OSDs in this order
>  {0}hdd+nvme block.db
> {24}ssd+nvme block.db
> {59}ssd+nvme journal
> 
>> ceph osd metadata | grep 'id\|rotational'
>> "id": 0,
>> "bluefs_db_rotational": "0",
>> "bluefs_slow_rotational": "1",
>> "bluestore_bdev_rotational": "1",
>> 

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-26 Thread Gregory Farnum
On Mon, Feb 26, 2018 at 11:21 AM Reed Dier  wrote:

> The ‘good perf’ that I reported below was the result of beginning 5 new
> bluestore conversions which results in a leading edge of ‘good’
> performance, before trickling off.
>
> This performance lasted about 20 minutes, where it backfilled a small set
> of PGs off of non-bluestore OSDs.
>
> Current performance is now hovering around:
>
> pool objects-ssd id 20
>   recovery io 14285 kB/s, 202 objects/s
>
> pool fs-metadata-ssd id 16
>   recovery io 0 B/s, 262 keys/s, 12 objects/s
>   client io 412 kB/s rd, 67593 B/s wr, 5 op/s rd, 0 op/s wr
>
>
> What are you referencing when you talk about recovery ops per second?
>
> These are recovery ops as reported by ceph -s or via stats exported via
> influx plugin in mgr, and via local collectd collection.
>
> Also, what are the values for osd_recovery_sleep_hdd
> and osd_recovery_sleep_hybrid, and can you validate via "ceph osd metadata"
> that your BlueStore SSD OSDs are correctly reporting both themselves and
> their journals as non-rotational?
>
>
> This yields more interesting results.
> Pasting results for 3 sets of OSDs in this order
>  {0}hdd+nvme block.db
> {24}ssd+nvme block.db
> {59}ssd+nvme journal
>
> ceph osd metadata | grep 'id\|rotational'
> "id": 0,
> "bluefs_db_rotational": "0",
> "bluefs_slow_rotational": "1",
> "bluestore_bdev_rotational": "1",
> *"journal_rotational": "1",*
> "rotational": “1"
>
> "id": 24,
> "bluefs_db_rotational": "0",
> "bluefs_slow_rotational": "0",
> "bluestore_bdev_rotational": "0",
> *"journal_rotational": "1",*
> "rotational": “0"
>
> "id": 59,
> "journal_rotational": "0",
> "rotational": “0"
>
>
> I wonder if it matters/is correct to see "journal_rotational": “1” for the
> bluestore OSD’s {0,24} with nvme block.db.
>
> Hope this may be helpful in determining the root cause.
>

If you have an SSD main store and a hard drive ("rotational") journal, the
OSD will insert recovery sleeps from the osd_recovery_sleep_hybrid config
option. By default that is .025 (seconds).

I believe you can override the setting (I'm not sure how), but you really
want to correct that flag at the OS layer. Generally when we see this
there's a RAID card or something between the solid-state device and the
host which is lying about the state of the world.
-Greg


>
> If it helps, all of the OSD’s were originally deployed with ceph-deploy,
> but are now being redone with ceph-volume locally on each host.
>
> Thanks,
>
> Reed
>
> On Feb 26, 2018, at 1:00 PM, Gregory Farnum  wrote:
>
> On Mon, Feb 26, 2018 at 9:12 AM Reed Dier  wrote:
>
>> After my last round of backfills completed, I started 5 more bluestore
>> conversions, which helped me recognize a very specific pattern of
>> performance.
>>
>> pool objects-ssd id 20
>>   recovery io 757 MB/s, 10845 objects/s
>>
>> pool fs-metadata-ssd id 16
>>   recovery io 0 B/s, 36265 keys/s, 1633 objects/s
>>   client io 2544 kB/s rd, 36788 B/s wr, 1 op/s rd, 0 op/s wr
>>
>>
>> The “non-throttled” backfills are only coming from filestore SSD OSD’s.
>> When backfilling from bluestore SSD OSD’s, they appear to be throttled at
>> the aforementioned <20 ops per OSD.
>>
>
> Wait, is that the current state? What are you referencing when you talk
> about recovery ops per second?
>
> Also, what are the values for osd_recovery_sleep_hdd
> and osd_recovery_sleep_hybrid, and can you validate via "ceph osd metadata"
> that your BlueStore SSD OSDs are correctly reporting both themselves and
> their journals as non-rotational?
> -Greg
>
>
>>
>> This would corroborate why the first batch of SSD’s I migrated to
>> bluestore were all at “full” speed, as all of the OSD’s they were
>> backfilling from were filestore based, compared to increasingly bluestore
>> backfill targets, leading to increasingly long backfill times as I move
>> from one host to the next.
>>
>> Looking at the recovery settings, the recovery_sleep and
>> recovery_sleep_ssd values across bluestore or filestore OSDs are showing as
>> 0 values, which means no sleep/throttle if I am reading everything
>> correctly.
>>
>> sudo ceph daemon osd.73 config show | grep recovery
>> "osd_allow_recovery_below_min_size": "true",
>> "osd_debug_skip_full_check_in_recovery": "false",
>> "osd_force_recovery_pg_log_entries_factor": "1.30",
>> "osd_min_recovery_priority": "0",
>> "osd_recovery_cost": "20971520",
>> "osd_recovery_delay_start": "0.00",
>> "osd_recovery_forget_lost_objects": "false",
>> "osd_recovery_max_active": "35",
>> "osd_recovery_max_chunk": "8388608",
>> "osd_recovery_max_omap_entries_per_chunk": "64000",
>> "osd_recovery_max_single_start": "1",
>> "osd_recovery_op_priority": "3",
>> "osd_recovery_op_warn_multiple": "16",
>> "osd_recovery_priority": "5",
>> 

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-26 Thread Reed Dier
The ‘good perf’ that I reported below was the result of beginning 5 new 
bluestore conversions which results in a leading edge of ‘good’ performance, 
before trickling off.

This performance lasted about 20 minutes, where it backfilled a small set of 
PGs off of non-bluestore OSDs.

Current performance is now hovering around:
> pool objects-ssd id 20
>   recovery io 14285 kB/s, 202 objects/s
> 
> pool fs-metadata-ssd id 16
>   recovery io 0 B/s, 262 keys/s, 12 objects/s
>   client io 412 kB/s rd, 67593 B/s wr, 5 op/s rd, 0 op/s wr

> What are you referencing when you talk about recovery ops per second?

These are recovery ops as reported by ceph -s or via stats exported via influx 
plugin in mgr, and via local collectd collection.

> Also, what are the values for osd_recovery_sleep_hdd and 
> osd_recovery_sleep_hybrid, and can you validate via "ceph osd metadata" that 
> your BlueStore SSD OSDs are correctly reporting both themselves and their 
> journals as non-rotational?

This yields more interesting results.
Pasting results for 3 sets of OSDs in this order
 {0}hdd+nvme block.db
{24}ssd+nvme block.db
{59}ssd+nvme journal

> ceph osd metadata | grep 'id\|rotational'
> "id": 0,
> "bluefs_db_rotational": "0",
> "bluefs_slow_rotational": "1",
> "bluestore_bdev_rotational": "1",
> "journal_rotational": "1",
> "rotational": “1"
> "id": 24,
> "bluefs_db_rotational": "0",
> "bluefs_slow_rotational": "0",
> "bluestore_bdev_rotational": "0",
> "journal_rotational": "1",
> "rotational": “0"
> "id": 59,
> "journal_rotational": "0",
> "rotational": “0"

I wonder if it matters/is correct to see "journal_rotational": “1” for the 
bluestore OSD’s {0,24} with nvme block.db.

Hope this may be helpful in determining the root cause.

If it helps, all of the OSD’s were originally deployed with ceph-deploy, but 
are now being redone with ceph-volume locally on each host.

Thanks,

Reed

> On Feb 26, 2018, at 1:00 PM, Gregory Farnum  wrote:
> 
> On Mon, Feb 26, 2018 at 9:12 AM Reed Dier  > wrote:
> After my last round of backfills completed, I started 5 more bluestore 
> conversions, which helped me recognize a very specific pattern of performance.
> 
>> pool objects-ssd id 20
>>   recovery io 757 MB/s, 10845 objects/s
>> 
>> pool fs-metadata-ssd id 16
>>   recovery io 0 B/s, 36265 keys/s, 1633 objects/s
>>   client io 2544 kB/s rd, 36788 B/s wr, 1 op/s rd, 0 op/s wr
> 
> The “non-throttled” backfills are only coming from filestore SSD OSD’s.
> When backfilling from bluestore SSD OSD’s, they appear to be throttled at the 
> aforementioned <20 ops per OSD.
> 
> Wait, is that the current state? What are you referencing when you talk about 
> recovery ops per second?
> 
> Also, what are the values for osd_recovery_sleep_hdd and 
> osd_recovery_sleep_hybrid, and can you validate via "ceph osd metadata" that 
> your BlueStore SSD OSDs are correctly reporting both themselves and their 
> journals as non-rotational?
> -Greg
>  
> 
> This would corroborate why the first batch of SSD’s I migrated to bluestore 
> were all at “full” speed, as all of the OSD’s they were backfilling from were 
> filestore based, compared to increasingly bluestore backfill targets, leading 
> to increasingly long backfill times as I move from one host to the next.
> 
> Looking at the recovery settings, the recovery_sleep and recovery_sleep_ssd 
> values across bluestore or filestore OSDs are showing as 0 values, which 
> means no sleep/throttle if I am reading everything correctly.
> 
>> sudo ceph daemon osd.73 config show | grep recovery
>> "osd_allow_recovery_below_min_size": "true",
>> "osd_debug_skip_full_check_in_recovery": "false",
>> "osd_force_recovery_pg_log_entries_factor": "1.30",
>> "osd_min_recovery_priority": "0",
>> "osd_recovery_cost": "20971520",
>> "osd_recovery_delay_start": "0.00",
>> "osd_recovery_forget_lost_objects": "false",
>> "osd_recovery_max_active": "35",
>> "osd_recovery_max_chunk": "8388608",
>> "osd_recovery_max_omap_entries_per_chunk": "64000",
>> "osd_recovery_max_single_start": "1",
>> "osd_recovery_op_priority": "3",
>> "osd_recovery_op_warn_multiple": "16",
>> "osd_recovery_priority": "5",
>> "osd_recovery_retry_interval": "30.00",
>> "osd_recovery_sleep": "0.00",
>> "osd_recovery_sleep_hdd": "0.10",
>> "osd_recovery_sleep_hybrid": "0.025000",
>> "osd_recovery_sleep_ssd": "0.00",
>> "osd_recovery_thread_suicide_timeout": "300",
>> "osd_recovery_thread_timeout": "30",
>> "osd_scrub_during_recovery": "false",
> 
> 
> As far as I know, the device class is configured correctly as far as I know, 
> it all shows as ssd/hdd correctly in ceph osd tree.
> 
> So hopefully this may be enough of a smoking gun to help narrow down where 
> this may be 

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-26 Thread Gregory Farnum
On Mon, Feb 26, 2018 at 9:12 AM Reed Dier  wrote:

> After my last round of backfills completed, I started 5 more bluestore
> conversions, which helped me recognize a very specific pattern of
> performance.
>
> pool objects-ssd id 20
>   recovery io 757 MB/s, 10845 objects/s
>
> pool fs-metadata-ssd id 16
>   recovery io 0 B/s, 36265 keys/s, 1633 objects/s
>   client io 2544 kB/s rd, 36788 B/s wr, 1 op/s rd, 0 op/s wr
>
>
> The “non-throttled” backfills are only coming from filestore SSD OSD’s.
> When backfilling from bluestore SSD OSD’s, they appear to be throttled at
> the aforementioned <20 ops per OSD.
>

Wait, is that the current state? What are you referencing when you talk
about recovery ops per second?

Also, what are the values for osd_recovery_sleep_hdd
and osd_recovery_sleep_hybrid, and can you validate via "ceph osd metadata"
that your BlueStore SSD OSDs are correctly reporting both themselves and
their journals as non-rotational?
-Greg


>
> This would corroborate why the first batch of SSD’s I migrated to
> bluestore were all at “full” speed, as all of the OSD’s they were
> backfilling from were filestore based, compared to increasingly bluestore
> backfill targets, leading to increasingly long backfill times as I move
> from one host to the next.
>
> Looking at the recovery settings, the recovery_sleep and
> recovery_sleep_ssd values across bluestore or filestore OSDs are showing as
> 0 values, which means no sleep/throttle if I am reading everything
> correctly.
>
> sudo ceph daemon osd.73 config show | grep recovery
> "osd_allow_recovery_below_min_size": "true",
> "osd_debug_skip_full_check_in_recovery": "false",
> "osd_force_recovery_pg_log_entries_factor": "1.30",
> "osd_min_recovery_priority": "0",
> "osd_recovery_cost": "20971520",
> "osd_recovery_delay_start": "0.00",
> "osd_recovery_forget_lost_objects": "false",
> "osd_recovery_max_active": "35",
> "osd_recovery_max_chunk": "8388608",
> "osd_recovery_max_omap_entries_per_chunk": "64000",
> "osd_recovery_max_single_start": "1",
> "osd_recovery_op_priority": "3",
> "osd_recovery_op_warn_multiple": "16",
> "osd_recovery_priority": "5",
> "osd_recovery_retry_interval": "30.00",
> *"osd_recovery_sleep": "0.00",*
> "osd_recovery_sleep_hdd": "0.10",
> "osd_recovery_sleep_hybrid": "0.025000",
> *"osd_recovery_sleep_ssd": "0.00",*
> "osd_recovery_thread_suicide_timeout": "300",
> "osd_recovery_thread_timeout": "30",
> "osd_scrub_during_recovery": "false",
>
>
> As far as I know, the device class is configured correctly as far as I
> know, it all shows as ssd/hdd correctly in ceph osd tree.
>
> So hopefully this may be enough of a smoking gun to help narrow down where
> this may be stemming from.
>
> Thanks,
>
> Reed
>
> On Feb 23, 2018, at 10:04 AM, David Turner  wrote:
>
> Here is a [1] link to a ML thread tracking some slow backfilling on
> bluestore.  It came down to the backfill sleep setting for them.  Maybe it
> will help.
>
> [1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg40256.html
>
> On Fri, Feb 23, 2018 at 10:46 AM Reed Dier  wrote:
>
>> Probably unrelated, but I do keep seeing this odd negative objects
>> degraded message on the fs-metadata pool:
>>
>> pool fs-metadata-ssd id 16
>>   -34/3 objects degraded (-1133.333%)
>>   recovery io 0 B/s, 89 keys/s, 2 objects/s
>>   client io 51289 B/s rd, 101 kB/s wr, 0 op/s rd, 0 op/s wr
>>
>>
>> Don’t mean to clutter the ML/thread, however it did seem odd, maybe its a
>> culprit? Maybe its some weird sampling interval issue thats been solved in
>> 12.2.3?
>>
>> Thanks,
>>
>> Reed
>>
>>
>> On Feb 23, 2018, at 8:26 AM, Reed Dier  wrote:
>>
>> Below is ceph -s
>>
>>   cluster:
>> id: {id}
>> health: HEALTH_WARN
>> noout flag(s) set
>> 260610/1068004947 objects misplaced (0.024%)
>> Degraded data redundancy: 23157232/1068004947 objects
>> degraded (2.168%), 332 pgs unclean, 328 pgs degraded, 328 pgs undersized
>>
>>   services:
>> mon: 3 daemons, quorum mon02,mon01,mon03
>> mgr: mon03(active), standbys: mon02
>> mds: cephfs-1/1/1 up  {0=mon03=up:active}, 1 up:standby
>> osd: 74 osds: 74 up, 74 in; 332 remapped pgs
>>  flags noout
>>
>>   data:
>> pools:   5 pools, 5316 pgs
>> objects: 339M objects, 46627 GB
>> usage:   154 TB used, 108 TB / 262 TB avail
>> pgs: 23157232/1068004947 objects degraded (2.168%)
>>  260610/1068004947 objects misplaced (0.024%)
>>  4984 active+clean
>>  183  active+undersized+degraded+remapped+backfilling
>>  145  active+undersized+degraded+remapped+backfill_wait
>>  3active+remapped+backfill_wait
>>  1active+remapped+backfilling
>>
>>   io:
>> client:   8428 kB/s rd, 47905 B/s 

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-26 Thread Reed Dier
After my last round of backfills completed, I started 5 more bluestore 
conversions, which helped me recognize a very specific pattern of performance.

> pool objects-ssd id 20
>   recovery io 757 MB/s, 10845 objects/s
> 
> pool fs-metadata-ssd id 16
>   recovery io 0 B/s, 36265 keys/s, 1633 objects/s
>   client io 2544 kB/s rd, 36788 B/s wr, 1 op/s rd, 0 op/s wr

The “non-throttled” backfills are only coming from filestore SSD OSD’s.
When backfilling from bluestore SSD OSD’s, they appear to be throttled at the 
aforementioned <20 ops per OSD.

This would corroborate why the first batch of SSD’s I migrated to bluestore 
were all at “full” speed, as all of the OSD’s they were backfilling from were 
filestore based, compared to increasingly bluestore backfill targets, leading 
to increasingly long backfill times as I move from one host to the next.

Looking at the recovery settings, the recovery_sleep and recovery_sleep_ssd 
values across bluestore or filestore OSDs are showing as 0 values, which means 
no sleep/throttle if I am reading everything correctly.

> sudo ceph daemon osd.73 config show | grep recovery
> "osd_allow_recovery_below_min_size": "true",
> "osd_debug_skip_full_check_in_recovery": "false",
> "osd_force_recovery_pg_log_entries_factor": "1.30",
> "osd_min_recovery_priority": "0",
> "osd_recovery_cost": "20971520",
> "osd_recovery_delay_start": "0.00",
> "osd_recovery_forget_lost_objects": "false",
> "osd_recovery_max_active": "35",
> "osd_recovery_max_chunk": "8388608",
> "osd_recovery_max_omap_entries_per_chunk": "64000",
> "osd_recovery_max_single_start": "1",
> "osd_recovery_op_priority": "3",
> "osd_recovery_op_warn_multiple": "16",
> "osd_recovery_priority": "5",
> "osd_recovery_retry_interval": "30.00",
> "osd_recovery_sleep": "0.00",
> "osd_recovery_sleep_hdd": "0.10",
> "osd_recovery_sleep_hybrid": "0.025000",
> "osd_recovery_sleep_ssd": "0.00",
> "osd_recovery_thread_suicide_timeout": "300",
> "osd_recovery_thread_timeout": "30",
> "osd_scrub_during_recovery": "false",


As far as I know, the device class is configured correctly as far as I know, it 
all shows as ssd/hdd correctly in ceph osd tree.

So hopefully this may be enough of a smoking gun to help narrow down where this 
may be stemming from.

Thanks,

Reed

> On Feb 23, 2018, at 10:04 AM, David Turner  wrote:
> 
> Here is a [1] link to a ML thread tracking some slow backfilling on 
> bluestore.  It came down to the backfill sleep setting for them.  Maybe it 
> will help.
> 
> [1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg40256.html 
> 
> On Fri, Feb 23, 2018 at 10:46 AM Reed Dier  > wrote:
> Probably unrelated, but I do keep seeing this odd negative objects degraded 
> message on the fs-metadata pool:
> 
>> pool fs-metadata-ssd id 16
>>   -34/3 objects degraded (-1133.333%)
>>   recovery io 0 B/s, 89 keys/s, 2 objects/s
>>   client io 51289 B/s rd, 101 kB/s wr, 0 op/s rd, 0 op/s wr
> 
> Don’t mean to clutter the ML/thread, however it did seem odd, maybe its a 
> culprit? Maybe its some weird sampling interval issue thats been solved in 
> 12.2.3?
> 
> Thanks,
> 
> Reed
> 
> 
>> On Feb 23, 2018, at 8:26 AM, Reed Dier > > wrote:
>> 
>> Below is ceph -s
>> 
>>>   cluster:
>>> id: {id}
>>> health: HEALTH_WARN
>>> noout flag(s) set
>>> 260610/1068004947 objects misplaced (0.024%)
>>> Degraded data redundancy: 23157232/1068004947 objects degraded 
>>> (2.168%), 332 pgs unclean, 328 pgs degraded, 328 pgs undersized
>>> 
>>>   services:
>>> mon: 3 daemons, quorum mon02,mon01,mon03
>>> mgr: mon03(active), standbys: mon02
>>> mds: cephfs-1/1/1 up  {0=mon03=up:active}, 1 up:standby
>>> osd: 74 osds: 74 up, 74 in; 332 remapped pgs
>>>  flags noout
>>> 
>>>   data:
>>> pools:   5 pools, 5316 pgs
>>> objects: 339M objects, 46627 GB
>>> usage:   154 TB used, 108 TB / 262 TB avail
>>> pgs: 23157232/1068004947 objects degraded (2.168%)
>>>  260610/1068004947 objects misplaced (0.024%)
>>>  4984 active+clean
>>>  183  active+undersized+degraded+remapped+backfilling
>>>  145  active+undersized+degraded+remapped+backfill_wait
>>>  3active+remapped+backfill_wait
>>>  1active+remapped+backfilling
>>> 
>>>   io:
>>> client:   8428 kB/s rd, 47905 B/s wr, 130 op/s rd, 0 op/s wr
>>> recovery: 37057 kB/s, 50 keys/s, 217 objects/s
>> 
>> Also the two pools on the SSDs, are the objects pool at 4096 PG, and the 
>> fs-metadata pool at 32 PG.
>> 
>>> Are you sure the recovery is actually going slower, or are the individual 
>>> ops larger or more 

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-23 Thread David Turner
Here is a [1] link to a ML thread tracking some slow backfilling on
bluestore.  It came down to the backfill sleep setting for them.  Maybe it
will help.

[1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg40256.html

On Fri, Feb 23, 2018 at 10:46 AM Reed Dier  wrote:

> Probably unrelated, but I do keep seeing this odd negative objects
> degraded message on the fs-metadata pool:
>
> pool fs-metadata-ssd id 16
>   -34/3 objects degraded (-1133.333%)
>   recovery io 0 B/s, 89 keys/s, 2 objects/s
>   client io 51289 B/s rd, 101 kB/s wr, 0 op/s rd, 0 op/s wr
>
>
> Don’t mean to clutter the ML/thread, however it did seem odd, maybe its a
> culprit? Maybe its some weird sampling interval issue thats been solved in
> 12.2.3?
>
> Thanks,
>
> Reed
>
>
> On Feb 23, 2018, at 8:26 AM, Reed Dier  wrote:
>
> Below is ceph -s
>
>   cluster:
> id: {id}
> health: HEALTH_WARN
> noout flag(s) set
> 260610/1068004947 objects misplaced (0.024%)
> Degraded data redundancy: 23157232/1068004947 objects degraded
> (2.168%), 332 pgs unclean, 328 pgs degraded, 328 pgs undersized
>
>   services:
> mon: 3 daemons, quorum mon02,mon01,mon03
> mgr: mon03(active), standbys: mon02
> mds: cephfs-1/1/1 up  {0=mon03=up:active}, 1 up:standby
> osd: 74 osds: 74 up, 74 in; 332 remapped pgs
>  flags noout
>
>   data:
> pools:   5 pools, 5316 pgs
> objects: 339M objects, 46627 GB
> usage:   154 TB used, 108 TB / 262 TB avail
> pgs: 23157232/1068004947 objects degraded (2.168%)
>  260610/1068004947 objects misplaced (0.024%)
>  4984 active+clean
>  183  active+undersized+degraded+remapped+backfilling
>  145  active+undersized+degraded+remapped+backfill_wait
>  3active+remapped+backfill_wait
>  1active+remapped+backfilling
>
>   io:
> client:   8428 kB/s rd, 47905 B/s wr, 130 op/s rd, 0 op/s wr
> recovery: 37057 kB/s, 50 keys/s, 217 objects/s
>
>
> Also the two pools on the SSDs, are the objects pool at 4096 PG, and the
> fs-metadata pool at 32 PG.
>
> Are you sure the recovery is actually going slower, or are the individual
> ops larger or more expensive?
>
> The objects should not vary wildly in size.
> Even if they were differing in size, the SSDs are roughly idle in their
> current state of backfilling when examining wait in iotop, or atop, or
> sysstat/iostat.
>
> This compares to when I was fully saturating the SATA backplane with over
> 1000MB/s of writes to multiple disks when the backfills were going “full
> speed.”
>
> Here is a breakdown of recovery io by pool:
>
> pool objects-ssd id 20
>   recovery io 6779 kB/s, 92 objects/s
>   client io 3071 kB/s rd, 50 op/s rd, 0 op/s wr
>
> pool fs-metadata-ssd id 16
>   recovery io 0 B/s, 28 keys/s, 2 objects/s
>   client io 109 kB/s rd, 67455 B/s wr, 1 op/s rd, 0 op/s wr
>
> pool cephfs-hdd id 17
>   recovery io 40542 kB/s, 158 objects/s
>   client io 10056 kB/s rd, 142 op/s rd, 0 op/s wr
>
>
> So the 24 HDD’s are outperforming the 50 SSD’s for recovery and client
> traffic at the moment, which seems conspicuous to me.
>
> Most of the OSD’s with recovery ops to the SSDs are reporting 8-12 ops,
> with one OSD occasionally spiking up to 300-500 for a few minutes. Stats
> being pulled by both local CollectD instances on each node, as well as the
> Influx plugin in MGR as we evaluate that against collectd.
>
> Thanks,
>
> Reed
>
>
> On Feb 22, 2018, at 6:21 PM, Gregory Farnum  wrote:
>
> What's the output of "ceph -s" while this is happening?
>
> Is there some identifiable difference between these two states, like you
> get a lot of throughput on the data pools but then metadata recovery is
> slower?
>
> Are you sure the recovery is actually going slower, or are the individual
> ops larger or more expensive?
>
> My WAG is that recovering the metadata pool, composed mostly of
> directories stored in omap objects, is going much slower for some reason.
> You can adjust the cost of those individual ops some by
> changing osd_recovery_max_omap_entries_per_chunk (default: 8096), but I'm
> not sure which way you want to go or indeed if this has anything to do with
> the problem you're seeing. (eg, it could be that reading out the omaps is
> expensive, so you can get higher recovery op numbers by turning down the
> number of entries per request, but not actually see faster backfilling
> because you have to issue more requests.)
> -Greg
>
> On Wed, Feb 21, 2018 at 2:57 PM Reed Dier  wrote:
>
>> Hi all,
>>
>> I am running into an odd situation that I cannot easily explain.
>> I am currently in the midst of destroy and rebuild of OSDs from filestore
>> to bluestore.
>> With my HDDs, I am seeing expected behavior, but with my SSDs I am seeing
>> unexpected behavior. The HDDs and SSDs are set in crush accordingly.
>>
>> My path to 

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-23 Thread Reed Dier
Probably unrelated, but I do keep seeing this odd negative objects degraded 
message on the fs-metadata pool:

> pool fs-metadata-ssd id 16
>   -34/3 objects degraded (-1133.333%)
>   recovery io 0 B/s, 89 keys/s, 2 objects/s
>   client io 51289 B/s rd, 101 kB/s wr, 0 op/s rd, 0 op/s wr

Don’t mean to clutter the ML/thread, however it did seem odd, maybe its a 
culprit? Maybe its some weird sampling interval issue thats been solved in 
12.2.3?

Thanks,

Reed


> On Feb 23, 2018, at 8:26 AM, Reed Dier  wrote:
> 
> Below is ceph -s
> 
>>   cluster:
>> id: {id}
>> health: HEALTH_WARN
>> noout flag(s) set
>> 260610/1068004947 objects misplaced (0.024%)
>> Degraded data redundancy: 23157232/1068004947 objects degraded 
>> (2.168%), 332 pgs unclean, 328 pgs degraded, 328 pgs undersized
>> 
>>   services:
>> mon: 3 daemons, quorum mon02,mon01,mon03
>> mgr: mon03(active), standbys: mon02
>> mds: cephfs-1/1/1 up  {0=mon03=up:active}, 1 up:standby
>> osd: 74 osds: 74 up, 74 in; 332 remapped pgs
>>  flags noout
>> 
>>   data:
>> pools:   5 pools, 5316 pgs
>> objects: 339M objects, 46627 GB
>> usage:   154 TB used, 108 TB / 262 TB avail
>> pgs: 23157232/1068004947 objects degraded (2.168%)
>>  260610/1068004947 objects misplaced (0.024%)
>>  4984 active+clean
>>  183  active+undersized+degraded+remapped+backfilling
>>  145  active+undersized+degraded+remapped+backfill_wait
>>  3active+remapped+backfill_wait
>>  1active+remapped+backfilling
>> 
>>   io:
>> client:   8428 kB/s rd, 47905 B/s wr, 130 op/s rd, 0 op/s wr
>> recovery: 37057 kB/s, 50 keys/s, 217 objects/s
> 
> Also the two pools on the SSDs, are the objects pool at 4096 PG, and the 
> fs-metadata pool at 32 PG.
> 
>> Are you sure the recovery is actually going slower, or are the individual 
>> ops larger or more expensive?
> 
> The objects should not vary wildly in size.
> Even if they were differing in size, the SSDs are roughly idle in their 
> current state of backfilling when examining wait in iotop, or atop, or 
> sysstat/iostat.
> 
> This compares to when I was fully saturating the SATA backplane with over 
> 1000MB/s of writes to multiple disks when the backfills were going “full 
> speed.”
> 
> Here is a breakdown of recovery io by pool:
> 
>> pool objects-ssd id 20
>>   recovery io 6779 kB/s, 92 objects/s
>>   client io 3071 kB/s rd, 50 op/s rd, 0 op/s wr
>> 
>> pool fs-metadata-ssd id 16
>>   recovery io 0 B/s, 28 keys/s, 2 objects/s
>>   client io 109 kB/s rd, 67455 B/s wr, 1 op/s rd, 0 op/s wr
>> 
>> pool cephfs-hdd id 17
>>   recovery io 40542 kB/s, 158 objects/s
>>   client io 10056 kB/s rd, 142 op/s rd, 0 op/s wr
> 
> So the 24 HDD’s are outperforming the 50 SSD’s for recovery and client 
> traffic at the moment, which seems conspicuous to me.
> 
> Most of the OSD’s with recovery ops to the SSDs are reporting 8-12 ops, with 
> one OSD occasionally spiking up to 300-500 for a few minutes. Stats being 
> pulled by both local CollectD instances on each node, as well as the Influx 
> plugin in MGR as we evaluate that against collectd.
> 
> Thanks,
> 
> Reed
> 
> 
>> On Feb 22, 2018, at 6:21 PM, Gregory Farnum > > wrote:
>> 
>> What's the output of "ceph -s" while this is happening?
>> 
>> Is there some identifiable difference between these two states, like you get 
>> a lot of throughput on the data pools but then metadata recovery is slower?
>> 
>> Are you sure the recovery is actually going slower, or are the individual 
>> ops larger or more expensive?
>> 
>> My WAG is that recovering the metadata pool, composed mostly of directories 
>> stored in omap objects, is going much slower for some reason. You can adjust 
>> the cost of those individual ops some by changing 
>> osd_recovery_max_omap_entries_per_chunk (default: 8096), but I'm not sure 
>> which way you want to go or indeed if this has anything to do with the 
>> problem you're seeing. (eg, it could be that reading out the omaps is 
>> expensive, so you can get higher recovery op numbers by turning down the 
>> number of entries per request, but not actually see faster backfilling 
>> because you have to issue more requests.)
>> -Greg
>> 
>> On Wed, Feb 21, 2018 at 2:57 PM Reed Dier > > wrote:
>> Hi all,
>> 
>> I am running into an odd situation that I cannot easily explain.
>> I am currently in the midst of destroy and rebuild of OSDs from filestore to 
>> bluestore.
>> With my HDDs, I am seeing expected behavior, but with my SSDs I am seeing 
>> unexpected behavior. The HDDs and SSDs are set in crush accordingly.
>> 
>> My path to replacing the OSDs is to set the noout, norecover, norebalance 
>> flag, destroy the OSD, create the OSD back, (iterate n times, all within a 
>> single 

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-23 Thread Reed Dier
Below is ceph -s

>   cluster:
> id: {id}
> health: HEALTH_WARN
> noout flag(s) set
> 260610/1068004947 objects misplaced (0.024%)
> Degraded data redundancy: 23157232/1068004947 objects degraded 
> (2.168%), 332 pgs unclean, 328 pgs degraded, 328 pgs undersized
> 
>   services:
> mon: 3 daemons, quorum mon02,mon01,mon03
> mgr: mon03(active), standbys: mon02
> mds: cephfs-1/1/1 up  {0=mon03=up:active}, 1 up:standby
> osd: 74 osds: 74 up, 74 in; 332 remapped pgs
>  flags noout
> 
>   data:
> pools:   5 pools, 5316 pgs
> objects: 339M objects, 46627 GB
> usage:   154 TB used, 108 TB / 262 TB avail
> pgs: 23157232/1068004947 objects degraded (2.168%)
>  260610/1068004947 objects misplaced (0.024%)
>  4984 active+clean
>  183  active+undersized+degraded+remapped+backfilling
>  145  active+undersized+degraded+remapped+backfill_wait
>  3active+remapped+backfill_wait
>  1active+remapped+backfilling
> 
>   io:
> client:   8428 kB/s rd, 47905 B/s wr, 130 op/s rd, 0 op/s wr
> recovery: 37057 kB/s, 50 keys/s, 217 objects/s

Also the two pools on the SSDs, are the objects pool at 4096 PG, and the 
fs-metadata pool at 32 PG.

> Are you sure the recovery is actually going slower, or are the individual ops 
> larger or more expensive?

The objects should not vary wildly in size.
Even if they were differing in size, the SSDs are roughly idle in their current 
state of backfilling when examining wait in iotop, or atop, or sysstat/iostat.

This compares to when I was fully saturating the SATA backplane with over 
1000MB/s of writes to multiple disks when the backfills were going “full speed.”

Here is a breakdown of recovery io by pool:

> pool objects-ssd id 20
>   recovery io 6779 kB/s, 92 objects/s
>   client io 3071 kB/s rd, 50 op/s rd, 0 op/s wr
> 
> pool fs-metadata-ssd id 16
>   recovery io 0 B/s, 28 keys/s, 2 objects/s
>   client io 109 kB/s rd, 67455 B/s wr, 1 op/s rd, 0 op/s wr
> 
> pool cephfs-hdd id 17
>   recovery io 40542 kB/s, 158 objects/s
>   client io 10056 kB/s rd, 142 op/s rd, 0 op/s wr

So the 24 HDD’s are outperforming the 50 SSD’s for recovery and client traffic 
at the moment, which seems conspicuous to me.

Most of the OSD’s with recovery ops to the SSDs are reporting 8-12 ops, with 
one OSD occasionally spiking up to 300-500 for a few minutes. Stats being 
pulled by both local CollectD instances on each node, as well as the Influx 
plugin in MGR as we evaluate that against collectd.

Thanks,

Reed


> On Feb 22, 2018, at 6:21 PM, Gregory Farnum  wrote:
> 
> What's the output of "ceph -s" while this is happening?
> 
> Is there some identifiable difference between these two states, like you get 
> a lot of throughput on the data pools but then metadata recovery is slower?
> 
> Are you sure the recovery is actually going slower, or are the individual ops 
> larger or more expensive?
> 
> My WAG is that recovering the metadata pool, composed mostly of directories 
> stored in omap objects, is going much slower for some reason. You can adjust 
> the cost of those individual ops some by changing 
> osd_recovery_max_omap_entries_per_chunk (default: 8096), but I'm not sure 
> which way you want to go or indeed if this has anything to do with the 
> problem you're seeing. (eg, it could be that reading out the omaps is 
> expensive, so you can get higher recovery op numbers by turning down the 
> number of entries per request, but not actually see faster backfilling 
> because you have to issue more requests.)
> -Greg
> 
> On Wed, Feb 21, 2018 at 2:57 PM Reed Dier  > wrote:
> Hi all,
> 
> I am running into an odd situation that I cannot easily explain.
> I am currently in the midst of destroy and rebuild of OSDs from filestore to 
> bluestore.
> With my HDDs, I am seeing expected behavior, but with my SSDs I am seeing 
> unexpected behavior. The HDDs and SSDs are set in crush accordingly.
> 
> My path to replacing the OSDs is to set the noout, norecover, norebalance 
> flag, destroy the OSD, create the OSD back, (iterate n times, all within a 
> single failure domain), unset the flags, and let it go. It finishes, rinse, 
> repeat.
> 
> For the SSD OSDs, they are SATA SSDs (Samsung SM863a) , 10 to a node, with 2 
> NVMe drives (Intel P3700), 5 SATA SSDs to 1 NVMe drive, 16G partitions for 
> block.db (previously filestore journals).
> 2x10GbE networking between the nodes. SATA backplane caps out at around 10 
> Gb/s as its 2x 6 Gb/s controllers. Luminous 12.2.2.
> 
> When the flags are unset, recovery starts and I see a very large rush of 
> traffic, however, after the first machine completed, the performance tapered 
> off at a rapid pace and trickles. Comparatively, I’m getting 100-200 recovery 
> ops on 3 HDDs, backfilling from 21 other HDDs, where as I’m 

Re: [ceph-users] SSD Bluestore Backfills Slow

2018-02-22 Thread Gregory Farnum
What's the output of "ceph -s" while this is happening?

Is there some identifiable difference between these two states, like you
get a lot of throughput on the data pools but then metadata recovery is
slower?

Are you sure the recovery is actually going slower, or are the individual
ops larger or more expensive?

My WAG is that recovering the metadata pool, composed mostly of directories
stored in omap objects, is going much slower for some reason. You can
adjust the cost of those individual ops some by
changing osd_recovery_max_omap_entries_per_chunk (default: 8096), but I'm
not sure which way you want to go or indeed if this has anything to do with
the problem you're seeing. (eg, it could be that reading out the omaps is
expensive, so you can get higher recovery op numbers by turning down the
number of entries per request, but not actually see faster backfilling
because you have to issue more requests.)
-Greg

On Wed, Feb 21, 2018 at 2:57 PM Reed Dier  wrote:

> Hi all,
>
> I am running into an odd situation that I cannot easily explain.
> I am currently in the midst of destroy and rebuild of OSDs from filestore
> to bluestore.
> With my HDDs, I am seeing expected behavior, but with my SSDs I am seeing
> unexpected behavior. The HDDs and SSDs are set in crush accordingly.
>
> My path to replacing the OSDs is to set the noout, norecover, norebalance
> flag, destroy the OSD, create the OSD back, (iterate n times, all within a
> single failure domain), unset the flags, and let it go. It finishes, rinse,
> repeat.
>
> For the SSD OSDs, they are SATA SSDs (Samsung SM863a) , 10 to a node, with
> 2 NVMe drives (Intel P3700), 5 SATA SSDs to 1 NVMe drive, 16G partitions
> for block.db (previously filestore journals).
> 2x10GbE networking between the nodes. SATA backplane caps out at around 10
> Gb/s as its 2x 6 Gb/s controllers. Luminous 12.2.2.
>
> When the flags are unset, recovery starts and I see a very large rush of
> traffic, however, after the first machine completed, the performance
> tapered off at a rapid pace and trickles. Comparatively, I’m getting
> 100-200 recovery ops on 3 HDDs, backfilling from 21 other HDDs, where as
> I’m getting 150-250 recovery ops on 5 SSDs, backfilling from 40 other SSDs.
> Every once in a while I will see a spike up to 500, 1000, or even 2000 ops
> on the SSDs, often a few hundred recovery ops from one OSD, and 8-15 ops
> from the others that are backfilling.
>
> This is a far cry from the more than 15-30k recovery ops that it started
> off recovering with 1-3k recovery ops from a single OSD to the backfilling
> OSD(s). And an even farther cry from the >15k recovery ops I was sustaining
> for over an hour or more before. I was able to rebuild a 1.9T SSD (1.1T
> used) in a little under an hour, and I could do about 5 at a time and still
> keep it at roughly an hour to backfill all of them, but then I hit a
> roadblock after the first machine, when I tried to do 10 at a time (single
> machine). I am now still experiencing the same thing on the third node,
> while doing 5 OSDs at a time.
>
> The pools associated with these SSDs are cephfs-metadata, as well as a
> pure rados object pool we use for our own internal applications. Both are
> size=3, min_size=2.
>
> It appears I am not the first to run into this, but it looks like there
> was no resolution: https://www.spinics.net/lists/ceph-users/msg41493.html
>
> Recovery parameters for the OSDs match what was in the previous thread,
> sans the osd conf block listed. And current osd_max_backfills = 30 and
> osd_recovery_max_active = 35. Very little activity on the OSDs during this
> period, so should not be any contention for iops on the SSDs.
>
> The only oddity that I can attribute to things is that we had a few
> periods of time where the disk load on one of the mons was high enough to
> cause the mon to drop out of quorum for a brief amount of time, a few
> times. But I wouldn’t think backfills would just get throttled due to mons
> flapping.
>
> Hopefully someone has some experience or can steer me in a path to improve
> the performance of the backfills so that I’m not stuck in backfill
> purgatory longer than I need to be.
>
> Linking an imgur album with some screen grabs of the recovery ops over
> time for the first machine, versus the second and third machines to
> demonstrate the delta between them.
> https://imgur.com/a/OJw4b
>
> Also including a ceph osd df of the SSDs, highlighted in red are the OSDs
> currently backfilling. Could this possibly be PG overdose? I don’t ever run
> into ‘stuck activating’ PGs, its just painfully slow backfills, like they
> are being throttled by ceph, that are causing me to worry. Drives aren’t
> worn, <30 P/E cycles on the drives, so plenty of life left in them.
>
> Thanks,
> Reed
>
> $ ceph osd df
> ID CLASS WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR  PGS
> 24   ssd 1.76109  1.0 1803G 1094G  708G 60.69 1.08 260
> 25   ssd 1.76109  1.0 1803G 1136G  

[ceph-users] SSD Bluestore Backfills Slow

2018-02-21 Thread Reed Dier
Hi all,

I am running into an odd situation that I cannot easily explain.
I am currently in the midst of destroy and rebuild of OSDs from filestore to 
bluestore.
With my HDDs, I am seeing expected behavior, but with my SSDs I am seeing 
unexpected behavior. The HDDs and SSDs are set in crush accordingly.

My path to replacing the OSDs is to set the noout, norecover, norebalance flag, 
destroy the OSD, create the OSD back, (iterate n times, all within a single 
failure domain), unset the flags, and let it go. It finishes, rinse, repeat.

For the SSD OSDs, they are SATA SSDs (Samsung SM863a) , 10 to a node, with 2 
NVMe drives (Intel P3700), 5 SATA SSDs to 1 NVMe drive, 16G partitions for 
block.db (previously filestore journals).
2x10GbE networking between the nodes. SATA backplane caps out at around 10 Gb/s 
as its 2x 6 Gb/s controllers. Luminous 12.2.2.

When the flags are unset, recovery starts and I see a very large rush of 
traffic, however, after the first machine completed, the performance tapered 
off at a rapid pace and trickles. Comparatively, I’m getting 100-200 recovery 
ops on 3 HDDs, backfilling from 21 other HDDs, where as I’m getting 150-250 
recovery ops on 5 SSDs, backfilling from 40 other SSDs. Every once in a while I 
will see a spike up to 500, 1000, or even 2000 ops on the SSDs, often a few 
hundred recovery ops from one OSD, and 8-15 ops from the others that are 
backfilling.

This is a far cry from the more than 15-30k recovery ops that it started off 
recovering with 1-3k recovery ops from a single OSD to the backfilling OSD(s). 
And an even farther cry from the >15k recovery ops I was sustaining for over an 
hour or more before. I was able to rebuild a 1.9T SSD (1.1T used) in a little 
under an hour, and I could do about 5 at a time and still keep it at roughly an 
hour to backfill all of them, but then I hit a roadblock after the first 
machine, when I tried to do 10 at a time (single machine). I am now still 
experiencing the same thing on the third node, while doing 5 OSDs at a time. 

The pools associated with these SSDs are cephfs-metadata, as well as a pure 
rados object pool we use for our own internal applications. Both are size=3, 
min_size=2.

It appears I am not the first to run into this, but it looks like there was no 
resolution: https://www.spinics.net/lists/ceph-users/msg41493.html 


Recovery parameters for the OSDs match what was in the previous thread, sans 
the osd conf block listed. And current osd_max_backfills = 30 and 
osd_recovery_max_active = 35. Very little activity on the OSDs during this 
period, so should not be any contention for iops on the SSDs.

The only oddity that I can attribute to things is that we had a few periods of 
time where the disk load on one of the mons was high enough to cause the mon to 
drop out of quorum for a brief amount of time, a few times. But I wouldn’t 
think backfills would just get throttled due to mons flapping.

Hopefully someone has some experience or can steer me in a path to improve the 
performance of the backfills so that I’m not stuck in backfill purgatory longer 
than I need to be.

Linking an imgur album with some screen grabs of the recovery ops over time for 
the first machine, versus the second and third machines to demonstrate the 
delta between them.
https://imgur.com/a/OJw4b 

Also including a ceph osd df of the SSDs, highlighted in red are the OSDs 
currently backfilling. Could this possibly be PG overdose? I don’t ever run 
into ‘stuck activating’ PGs, its just painfully slow backfills, like they are 
being throttled by ceph, that are causing me to worry. Drives aren’t worn, <30 
P/E cycles on the drives, so plenty of life left in them.

Thanks,
Reed

> $ ceph osd df
> ID CLASS WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR  PGS
> 24   ssd 1.76109  1.0 1803G 1094G  708G 60.69 1.08 260
> 25   ssd 1.76109  1.0 1803G 1136G  667G 63.01 1.12 271
> 26   ssd 1.76109  1.0 1803G 1018G  785G 56.46 1.01 243
> 27   ssd 1.76109  1.0 1803G 1065G  737G 59.10 1.05 253
> 28   ssd 1.76109  1.0 1803G 1026G  776G 56.94 1.02 245
> 29   ssd 1.76109  1.0 1803G 1132G  671G 62.79 1.12 270
> 30   ssd 1.76109  1.0 1803G  944G  859G 52.35 0.93 224
> 31   ssd 1.76109  1.0 1803G 1061G  742G 58.85 1.05 252
> 32   ssd 1.76109  1.0 1803G 1003G  799G 55.67 0.99 239
> 33   ssd 1.76109  1.0 1803G 1049G  753G 58.20 1.04 250
> 34   ssd 1.76109  1.0 1803G 1086G  717G 60.23 1.07 257
> 35   ssd 1.76109  1.0 1803G  978G  824G 54.26 0.97 232
> 36   ssd 1.76109  1.0 1803G 1057G  745G 58.64 1.05 252
> 37   ssd 1.76109  1.0 1803G 1025G  777G 56.88 1.01 244
> 38   ssd 1.76109  1.0 1803G 1047G  756G 58.06 1.04 250
> 39   ssd 1.76109  1.0 1803G 1031G  771G 57.20 1.02 246
> 40   ssd 1.76109  1.0 1803G 1029G  774G 57.07 1.02 245
> 41   ssd 1.76109  1.0 1803G 1033G  770G 57.28 1.02 245
> 42   ssd