Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-03-24 Thread Mazzystr
Hi Sage, Thanks for chiming in.  I can't image how busy you are.

Sorry guys.  I reprovisioned the offending osd right after this email and a
conversation on #ceph.  I do have the output from '/usr/bin/ceph daemon
osd.5 perf dump | /usr/bin/jq .' saved.  I'll be happy to add it to the
issue tracker.

/C


On Fri, Mar 22, 2019 at 7:01 PM Sage Weil  wrote:

> I have a ticket open for this:
>
> http://tracker.ceph.com/issues/38745
>
> Please comment there with the health warning you're seeing and any other
> details so we can figure out why it's happening.
>
> I wouldn't reprovision those OSDs yet, until we know why it happens.
> Also, it's likely that ceph-bluestore-tool can sort it out be
> adding/removing the db volume.
>
> Thanks!
> sage
>
>
> On Fri, 22 Mar 2019, Mazzystr wrote:
>
> > I am also seeing BlueFS spill since updating to Nautilus.  I also see
> high
> > slow_used_bytes and slow_total_bytes metrics.  It sure looks to me that
> the
> > only solution is to zap and rebuilt the osd.  I had to manually check 36
> > osds some of them traditional processes and some containerized.  The lack
> > of tooling here is underwhelming...  As soon as I rebuilt the osd the
> > "BlueFS spill..." warning went away.
> >
> > I use 50Gb db partitions on an nvme with 3 or 6 Tb spinning disks.  I
> don't
> > understand the spillover.
> >
> >
> > On Fri, Feb 15, 2019 at 12:33 PM David Turner 
> wrote:
> >
> > > The answer is probably going to be in how big your DB partition is vs
> how
> > > big your HDD disk is.  From your output it looks like you have a 6TB
> HDD
> > > with a 28GB Blocks.DB partition.  Even though the DB used size isn't
> > > currently full, I would guess that at some point since this OSD was
> created
> > > that it did fill up and what you're seeing is the part of the DB that
> > > spilled over to the data disk.  This is why the official recommendation
> > > (that is quite cautious, but cautious because some use cases will use
> this
> > > up) for a blocks.db partition is 4% of the data drive.  For your 6TB
> disks
> > > that's a recommendation of 240GB per DB partition.  Of course the
> actual
> > > size of the DB needed is dependent on your use case.  But pretty much
> every
> > > use case for a 6TB disk needs a bigger partition than 28GB.
> > >
> > > On Thu, Feb 14, 2019 at 11:58 PM Konstantin Shalygin 
> > > wrote:
> > >
> > >> Wrong metadata paste of osd.73 in previous message.
> > >>
> > >>
> > >> {
> > >>
> > >>  "id": 73,
> > >>  "arch": "x86_64",
> > >>  "back_addr": "10.10.10.6:6804/175338",
> > >>  "back_iface": "vlan3",
> > >>  "bluefs": "1",
> > >>  "bluefs_db_access_mode": "blk",
> > >>  "bluefs_db_block_size": "4096",
> > >>  "bluefs_db_dev": "259:22",
> > >>  "bluefs_db_dev_node": "nvme2n1",
> > >>  "bluefs_db_driver": "KernelDevice",
> > >>  "bluefs_db_model": "INTEL SSDPEDMD400G4 ",
> > >>  "bluefs_db_partition_path": "/dev/nvme2n1p11",
> > >>  "bluefs_db_rotational": "0",
> > >>  "bluefs_db_serial": "CVFT4324002Q400BGN  ",
> > >>  "bluefs_db_size": "30064771072",
> > >>  "bluefs_db_type": "nvme",
> > >>  "bluefs_single_shared_device": "0",
> > >>  "bluefs_slow_access_mode": "blk",
> > >>  "bluefs_slow_block_size": "4096",
> > >>  "bluefs_slow_dev": "8:176",
> > >>  "bluefs_slow_dev_node": "sdl",
> > >>  "bluefs_slow_driver": "KernelDevice",
> > >>  "bluefs_slow_model": "TOSHIBA HDWE160 ",
> > >>  "bluefs_slow_partition_path": "/dev/sdl2",
> > >>  "bluefs_slow_rotational": "1",
> > >>  "bluefs_slow_size": "6001069199360",
> > >>  "bluefs_slow_type": "hdd",
> > >>  "bluefs_wal_access_mode": "blk",
> > >>  "bluefs_wal_block_size": "4096",
> > >>  "bluefs_wal_dev": "259:22",
> > >>  "bluefs_wal_dev_node": "nvme2n1",
> > >>  "bluefs_wal_driver": "KernelDevice",
> > >>  "bluefs_wal_model": "INTEL SSDPEDMD400G4 ",
> > >>  "bluefs_wal_partition_path": "/dev/nvme2n1p12",
> > >>  "bluefs_wal_rotational": "0",
> > >>  "bluefs_wal_serial": "CVFT4324002Q400BGN  ",
> > >>  "bluefs_wal_size": "1073741824",
> > >>  "bluefs_wal_type": "nvme",
> > >>  "bluestore_bdev_access_mode": "blk",
> > >>  "bluestore_bdev_block_size": "4096",
> > >>  "bluestore_bdev_dev": "8:176",
> > >>  "bluestore_bdev_dev_node": "sdl",
> > >>  "bluestore_bdev_driver": "KernelDevice",
> > >>  "bluestore_bdev_model": "TOSHIBA HDWE160 ",
> > >>  "bluestore_bdev_partition_path": "/dev/sdl2",
> > >>  "bluestore_bdev_rotational": "1",
> > >>  "bluestore_bdev_size": "6001069199360",
> > >>  "bluestore_bdev_type": "hdd",
> > >>  "ceph_version": "ceph version 12.2.10
> > >> (177915764b752804194937482a39e95e0ca3de94) luminous (stable)",
> > >>  "cpu": "Intel(R) Xeon(R) CPU E5-2609 v4 @ 1.70GHz",
> > >>  "default_device_class": "hdd",
> > >>  "distro": "centos",
> > >>  

Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-03-22 Thread Konstantin Shalygin

On 3/23/19 12:20 AM, Mazzystr wrote:

inline...

On Fri, Mar 22, 2019 at 1:08 PM Konstantin Shalygin > wrote:


On 3/22/19 11:57 PM, Mazzystr wrote:
> I am also seeing BlueFS spill since updating to Nautilus.  I
also see
> high slow_used_bytes and slow_total_bytes metrics.  It sure
looks to
> me that the only solution is to zap and rebuilt the osd.  I had to
> manually check 36 osds some of them traditional processes and some
> containerized.  The lack of tooling here is underwhelming...  As
soon
> as I rebuilt the osd the "BlueFS spill..." warning went away.
>
> I use 50Gb db partitions on an nvme with 3 or 6 Tb spinning
disks.  I
> don't understand the spillove

Wow, it's something new. What is your upgrade path?


I keep current with community.  All osds have all been rebuilt as of 
luminous.


Also, you record cluster metrics, like via prometheus? To see diff
between upgrades.

Unfortunately not.  I've only had prometheus running for about two 
weeks nd I had it turned off for a couple days for some unknown 
reason... :/


This is sad. Because it's was be good to see the nature of metrics on graph.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-03-22 Thread Mazzystr
inline...

On Fri, Mar 22, 2019 at 1:08 PM Konstantin Shalygin  wrote:

> On 3/22/19 11:57 PM, Mazzystr wrote:
> > I am also seeing BlueFS spill since updating to Nautilus.  I also see
> > high slow_used_bytes and slow_total_bytes metrics.  It sure looks to
> > me that the only solution is to zap and rebuilt the osd.  I had to
> > manually check 36 osds some of them traditional processes and some
> > containerized.  The lack of tooling here is underwhelming...  As soon
> > as I rebuilt the osd the "BlueFS spill..." warning went away.
> >
> > I use 50Gb db partitions on an nvme with 3 or 6 Tb spinning disks.  I
> > don't understand the spillove
>
> Wow, it's something new. What is your upgrade path?
>
>
I keep current with community.  All osds have all been rebuilt as of
luminous.



> Also, you record cluster metrics, like via prometheus? To see diff
> between upgrades.
>
>
Unfortunately not.  I've only had prometheus running for about two weeks
nd I had it turned off for a couple days for some unknown reason... :/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-03-22 Thread Konstantin Shalygin

On 3/22/19 11:57 PM, Mazzystr wrote:
I am also seeing BlueFS spill since updating to Nautilus.  I also see 
high slow_used_bytes and slow_total_bytes metrics.  It sure looks to 
me that the only solution is to zap and rebuilt the osd.  I had to 
manually check 36 osds some of them traditional processes and some 
containerized.  The lack of tooling here is underwhelming...  As soon 
as I rebuilt the osd the "BlueFS spill..." warning went away.


I use 50Gb db partitions on an nvme with 3 or 6 Tb spinning disks.  I 
don't understand the spillove


Wow, it's something new. What is your upgrade path?

Also, you record cluster metrics, like via prometheus? To see diff 
between upgrades.



Thanks,

k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-03-22 Thread Sage Weil
I have a ticket open for this:

http://tracker.ceph.com/issues/38745

Please comment there with the health warning you're seeing and any other 
details so we can figure out why it's happening.

I wouldn't reprovision those OSDs yet, until we know why it happens.  
Also, it's likely that ceph-bluestore-tool can sort it out be 
adding/removing the db volume.

Thanks!
sage


On Fri, 22 Mar 2019, Mazzystr wrote:

> I am also seeing BlueFS spill since updating to Nautilus.  I also see high
> slow_used_bytes and slow_total_bytes metrics.  It sure looks to me that the
> only solution is to zap and rebuilt the osd.  I had to manually check 36
> osds some of them traditional processes and some containerized.  The lack
> of tooling here is underwhelming...  As soon as I rebuilt the osd the
> "BlueFS spill..." warning went away.
> 
> I use 50Gb db partitions on an nvme with 3 or 6 Tb spinning disks.  I don't
> understand the spillover.
> 
> 
> On Fri, Feb 15, 2019 at 12:33 PM David Turner  wrote:
> 
> > The answer is probably going to be in how big your DB partition is vs how
> > big your HDD disk is.  From your output it looks like you have a 6TB HDD
> > with a 28GB Blocks.DB partition.  Even though the DB used size isn't
> > currently full, I would guess that at some point since this OSD was created
> > that it did fill up and what you're seeing is the part of the DB that
> > spilled over to the data disk.  This is why the official recommendation
> > (that is quite cautious, but cautious because some use cases will use this
> > up) for a blocks.db partition is 4% of the data drive.  For your 6TB disks
> > that's a recommendation of 240GB per DB partition.  Of course the actual
> > size of the DB needed is dependent on your use case.  But pretty much every
> > use case for a 6TB disk needs a bigger partition than 28GB.
> >
> > On Thu, Feb 14, 2019 at 11:58 PM Konstantin Shalygin 
> > wrote:
> >
> >> Wrong metadata paste of osd.73 in previous message.
> >>
> >>
> >> {
> >>
> >>  "id": 73,
> >>  "arch": "x86_64",
> >>  "back_addr": "10.10.10.6:6804/175338",
> >>  "back_iface": "vlan3",
> >>  "bluefs": "1",
> >>  "bluefs_db_access_mode": "blk",
> >>  "bluefs_db_block_size": "4096",
> >>  "bluefs_db_dev": "259:22",
> >>  "bluefs_db_dev_node": "nvme2n1",
> >>  "bluefs_db_driver": "KernelDevice",
> >>  "bluefs_db_model": "INTEL SSDPEDMD400G4 ",
> >>  "bluefs_db_partition_path": "/dev/nvme2n1p11",
> >>  "bluefs_db_rotational": "0",
> >>  "bluefs_db_serial": "CVFT4324002Q400BGN  ",
> >>  "bluefs_db_size": "30064771072",
> >>  "bluefs_db_type": "nvme",
> >>  "bluefs_single_shared_device": "0",
> >>  "bluefs_slow_access_mode": "blk",
> >>  "bluefs_slow_block_size": "4096",
> >>  "bluefs_slow_dev": "8:176",
> >>  "bluefs_slow_dev_node": "sdl",
> >>  "bluefs_slow_driver": "KernelDevice",
> >>  "bluefs_slow_model": "TOSHIBA HDWE160 ",
> >>  "bluefs_slow_partition_path": "/dev/sdl2",
> >>  "bluefs_slow_rotational": "1",
> >>  "bluefs_slow_size": "6001069199360",
> >>  "bluefs_slow_type": "hdd",
> >>  "bluefs_wal_access_mode": "blk",
> >>  "bluefs_wal_block_size": "4096",
> >>  "bluefs_wal_dev": "259:22",
> >>  "bluefs_wal_dev_node": "nvme2n1",
> >>  "bluefs_wal_driver": "KernelDevice",
> >>  "bluefs_wal_model": "INTEL SSDPEDMD400G4 ",
> >>  "bluefs_wal_partition_path": "/dev/nvme2n1p12",
> >>  "bluefs_wal_rotational": "0",
> >>  "bluefs_wal_serial": "CVFT4324002Q400BGN  ",
> >>  "bluefs_wal_size": "1073741824",
> >>  "bluefs_wal_type": "nvme",
> >>  "bluestore_bdev_access_mode": "blk",
> >>  "bluestore_bdev_block_size": "4096",
> >>  "bluestore_bdev_dev": "8:176",
> >>  "bluestore_bdev_dev_node": "sdl",
> >>  "bluestore_bdev_driver": "KernelDevice",
> >>  "bluestore_bdev_model": "TOSHIBA HDWE160 ",
> >>  "bluestore_bdev_partition_path": "/dev/sdl2",
> >>  "bluestore_bdev_rotational": "1",
> >>  "bluestore_bdev_size": "6001069199360",
> >>  "bluestore_bdev_type": "hdd",
> >>  "ceph_version": "ceph version 12.2.10
> >> (177915764b752804194937482a39e95e0ca3de94) luminous (stable)",
> >>  "cpu": "Intel(R) Xeon(R) CPU E5-2609 v4 @ 1.70GHz",
> >>  "default_device_class": "hdd",
> >>  "distro": "centos",
> >>  "distro_description": "CentOS Linux 7 (Core)",
> >>  "distro_version": "7",
> >>  "front_addr": "172.16.16.16:6803/175338",
> >>  "front_iface": "vlan4",
> >>  "hb_back_addr": "10.10.10.6:6805/175338",
> >>  "hb_front_addr": "172.16.16.16:6805/175338",
> >>  "hostname": "ceph-osd5",
> >>  "journal_rotational": "0",
> >>  "kernel_description": "#1 SMP Tue Aug 14 21:49:04 UTC 2018",
> >>  "kernel_version": "3.10.0-862.11.6.el7.x86_64",
> >>  "mem_swap_kb": "0",
> >>  "mem_total_kb": "65724256",
> >>  "os": "Linux",
> >>  

Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-03-22 Thread Mazzystr
I am also seeing BlueFS spill since updating to Nautilus.  I also see high
slow_used_bytes and slow_total_bytes metrics.  It sure looks to me that the
only solution is to zap and rebuilt the osd.  I had to manually check 36
osds some of them traditional processes and some containerized.  The lack
of tooling here is underwhelming...  As soon as I rebuilt the osd the
"BlueFS spill..." warning went away.

I use 50Gb db partitions on an nvme with 3 or 6 Tb spinning disks.  I don't
understand the spillover.


On Fri, Feb 15, 2019 at 12:33 PM David Turner  wrote:

> The answer is probably going to be in how big your DB partition is vs how
> big your HDD disk is.  From your output it looks like you have a 6TB HDD
> with a 28GB Blocks.DB partition.  Even though the DB used size isn't
> currently full, I would guess that at some point since this OSD was created
> that it did fill up and what you're seeing is the part of the DB that
> spilled over to the data disk.  This is why the official recommendation
> (that is quite cautious, but cautious because some use cases will use this
> up) for a blocks.db partition is 4% of the data drive.  For your 6TB disks
> that's a recommendation of 240GB per DB partition.  Of course the actual
> size of the DB needed is dependent on your use case.  But pretty much every
> use case for a 6TB disk needs a bigger partition than 28GB.
>
> On Thu, Feb 14, 2019 at 11:58 PM Konstantin Shalygin 
> wrote:
>
>> Wrong metadata paste of osd.73 in previous message.
>>
>>
>> {
>>
>>  "id": 73,
>>  "arch": "x86_64",
>>  "back_addr": "10.10.10.6:6804/175338",
>>  "back_iface": "vlan3",
>>  "bluefs": "1",
>>  "bluefs_db_access_mode": "blk",
>>  "bluefs_db_block_size": "4096",
>>  "bluefs_db_dev": "259:22",
>>  "bluefs_db_dev_node": "nvme2n1",
>>  "bluefs_db_driver": "KernelDevice",
>>  "bluefs_db_model": "INTEL SSDPEDMD400G4 ",
>>  "bluefs_db_partition_path": "/dev/nvme2n1p11",
>>  "bluefs_db_rotational": "0",
>>  "bluefs_db_serial": "CVFT4324002Q400BGN  ",
>>  "bluefs_db_size": "30064771072",
>>  "bluefs_db_type": "nvme",
>>  "bluefs_single_shared_device": "0",
>>  "bluefs_slow_access_mode": "blk",
>>  "bluefs_slow_block_size": "4096",
>>  "bluefs_slow_dev": "8:176",
>>  "bluefs_slow_dev_node": "sdl",
>>  "bluefs_slow_driver": "KernelDevice",
>>  "bluefs_slow_model": "TOSHIBA HDWE160 ",
>>  "bluefs_slow_partition_path": "/dev/sdl2",
>>  "bluefs_slow_rotational": "1",
>>  "bluefs_slow_size": "6001069199360",
>>  "bluefs_slow_type": "hdd",
>>  "bluefs_wal_access_mode": "blk",
>>  "bluefs_wal_block_size": "4096",
>>  "bluefs_wal_dev": "259:22",
>>  "bluefs_wal_dev_node": "nvme2n1",
>>  "bluefs_wal_driver": "KernelDevice",
>>  "bluefs_wal_model": "INTEL SSDPEDMD400G4 ",
>>  "bluefs_wal_partition_path": "/dev/nvme2n1p12",
>>  "bluefs_wal_rotational": "0",
>>  "bluefs_wal_serial": "CVFT4324002Q400BGN  ",
>>  "bluefs_wal_size": "1073741824",
>>  "bluefs_wal_type": "nvme",
>>  "bluestore_bdev_access_mode": "blk",
>>  "bluestore_bdev_block_size": "4096",
>>  "bluestore_bdev_dev": "8:176",
>>  "bluestore_bdev_dev_node": "sdl",
>>  "bluestore_bdev_driver": "KernelDevice",
>>  "bluestore_bdev_model": "TOSHIBA HDWE160 ",
>>  "bluestore_bdev_partition_path": "/dev/sdl2",
>>  "bluestore_bdev_rotational": "1",
>>  "bluestore_bdev_size": "6001069199360",
>>  "bluestore_bdev_type": "hdd",
>>  "ceph_version": "ceph version 12.2.10
>> (177915764b752804194937482a39e95e0ca3de94) luminous (stable)",
>>  "cpu": "Intel(R) Xeon(R) CPU E5-2609 v4 @ 1.70GHz",
>>  "default_device_class": "hdd",
>>  "distro": "centos",
>>  "distro_description": "CentOS Linux 7 (Core)",
>>  "distro_version": "7",
>>  "front_addr": "172.16.16.16:6803/175338",
>>  "front_iface": "vlan4",
>>  "hb_back_addr": "10.10.10.6:6805/175338",
>>  "hb_front_addr": "172.16.16.16:6805/175338",
>>  "hostname": "ceph-osd5",
>>  "journal_rotational": "0",
>>  "kernel_description": "#1 SMP Tue Aug 14 21:49:04 UTC 2018",
>>  "kernel_version": "3.10.0-862.11.6.el7.x86_64",
>>  "mem_swap_kb": "0",
>>  "mem_total_kb": "65724256",
>>  "os": "Linux",
>>  "osd_data": "/var/lib/ceph/osd/ceph-73",
>>  "osd_objectstore": "bluestore",
>>  "rotational": "1"
>> }
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-03-17 Thread Konstantin Shalygin

Yes, I was in a similar situation initially where I had deployed my OSD's with 
25GB DB partitions and after 3GB DB used, everything else was going into slowDB 
on disk. From memory 29GB was just enough to make the DB fit on flash, but 30GB 
is a safe round figure to aim for. With a 30GB DB partition with most RBD type 
workloads all data should reside on flash even for fairly large disks running 
erasure coding.

Nick


Nick, thank you! After upgrading to 12.2.11 I was expand blockDB and for 
a week after compaction slowDB is not used [1].



{
  "gift_bytes": 0,
  "reclaim_bytes": 0,
  "db_total_bytes": 32212897792,
  "db_used_bytes": 6572474368,
  "wal_total_bytes": 1074589696,
  "wal_used_bytes": 528482304,
  "slow_total_bytes": 240043163648,
  "slow_used_bytes": 0,
  "num_files": 113,
  "log_bytes": 8683520,
  "log_compactions": 3,
  "logged_bytes": 203821056,
  "files_written_wal": 2,
  "files_written_sst": 1138,
  "bytes_written_wal": 121626085396,
  "bytes_written_sst": 47053353874
}

I also writed how-to increase partition size for my case, will maybe 
useful for someone [2].


[1] https://ibb.co/tXGqbbt

[2] https://bit.ly/2UFVO9Z

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-25 Thread Nick Fisk
> -Original Message-
> From: Vitaliy Filippov 
> Sent: 23 February 2019 20:31
> To: n...@fisk.me.uk; Serkan Çoban 
> Cc: ceph-users 
> Subject: Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow 
> storage for db - why?
> 
> X-Assp-URIBL failed: 'yourcmc.ru'(black.uribl.com )
> X-Assp-Spam-Level: *
> X-Assp-Envelope-From: vita...@yourcmc.ru
> X-Assp-Intended-For: n...@fisk.me.uk
> X-Assp-ID: ASSP.fisk.me.uk (55095-04241)
> X-Assp-Version: 1.9.1.4(1.0.00)
> 
> Numbers are easy to calculate from RocksDB parameters, however I also don't 
> understand why it's 3 -> 30 -> 300...
> 
> Default memtables are 256 MB, there are 4 of them, so L0 should be 1 GB,
> L1 should be 10 GB, and L2 should be 100 GB?

From how I understand it, RocksDB levels increment by a factor of x10:
256MB+2.56GB+25.6GB=~28-29GB

Although that is greatly simplified way of looking at it, this link explains it 
in more detail:
https://github.com/facebook/rocksdb/wiki/Leveled-Compaction


> 
> >>> These sizes are roughly 3GB,30GB,300GB. Anything in-between those
> >>> sizes are pointless. Only ~3GB of SSD will ever be used out of a
> > 28GB partition. Likewise a 240GB partition is also pointless as only
> > ~30GB will be used.
> >
> > Where did you get those numbers? I would like to read more if you can
> > point to a link.
> 
> --
> With best regards,
>Vitaliy Filippov

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-25 Thread Nick Fisk



> -Original Message-
> From: Konstantin Shalygin 
> Sent: 22 February 2019 14:23
> To: Nick Fisk 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow 
> storage for db - why?
> 
> Bluestore/RocksDB will only put the next level up size of DB on flash if the 
> whole size will fit.
> These sizes are roughly 3GB,30GB,300GB. Anything in-between those sizes are 
> pointless. Only ~3GB of SSD will ever be used out of a
> 28GB partition. Likewise a 240GB partition is also pointless as only ~30GB 
> will be used.
> 
> I'm currently running 30GB partitions on my cluster with a mix of 6,8,10TB 
> disks. The 10TB's are about 75% full and use around 14GB,
> this is on mainly 3x Replica RBD(4MB objects)
> 
> Nick
> 
> Can you explain more? You mean that I should increase my 28Gb to 30Gb and 
> this do a trick?
> How is your db_slow size? We should control it? You control it? How?

Yes, I was in a similar situation initially where I had deployed my OSD's with 
25GB DB partitions and after 3GB DB used, everything else was going into slowDB 
on disk. From memory 29GB was just enough to make the DB fit on flash, but 30GB 
is a safe round figure to aim for. With a 30GB DB partition with most RBD type 
workloads all data should reside on flash even for fairly large disks running 
erasure coding.

Nick

> 
> 
> k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-24 Thread Anthony D'Atri

> Date: Fri, 22 Feb 2019 16:26:34 -0800
> From: solarflow99 
> 
> 
> Aren't you undersized at only 30GB?  I thought you should have 4% of your
> OSDs

The 4% guidance is new.  Until relatively recently the oft-suggested and 
default value was 1%.

> From: "Vitaliy Filippov" 
> Numbers are easy to calculate from RocksDB parameters, however I also  
> don't understand why it's 3 -> 30 -> 300...
> 
> Default memtables are 256 MB, there are 4 of them, so L0 should be 1 GB,  
> L1 should be 10 GB, and L2 should be 100 GB?

I’m very curious as well, one would think that in practice the size and usage 
of the OSD would be factors, something the docs imply.

This is an area where we could really use more concrete guidance.  Clusters 
especially using HDDs are often doing so for $/TB reasons.  Economics and 
available slots are constraints on how much faster WAL+DB storage can be 
provisioned.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-23 Thread Vitaliy Filippov
Numbers are easy to calculate from RocksDB parameters, however I also  
don't understand why it's 3 -> 30 -> 300...


Default memtables are 256 MB, there are 4 of them, so L0 should be 1 GB,  
L1 should be 10 GB, and L2 should be 100 GB?


These sizes are roughly 3GB,30GB,300GB. Anything in-between those  
sizes are pointless. Only ~3GB of SSD will ever be used out of a

28GB partition. Likewise a 240GB partition is also pointless as only
~30GB will be used.

Where did you get those numbers? I would like to read more if you can
point to a link.


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-22 Thread solarflow99
Aren't you undersized at only 30GB?  I thought you should have 4% of your
OSDs


On Fri, Feb 22, 2019 at 3:10 PM Nick Fisk  wrote:

> >On 2/16/19 12:33 AM, David Turner wrote:
> >> The answer is probably going to be in how big your DB partition is vs
> >> how big your HDD disk is.  From your output it looks like you have a
> >> 6TB HDD with a 28GB Blocks.DB partition.  Even though the DB used
> >> size isn't currently full, I would guess that at some point since
> >> this OSD was created that it did fill up and what you're seeing is
> >> the part of the DB that spilled over to the data disk. This is why
> >> the official recommendation (that is quite cautious, but cautious
> >> because some use cases will use this up) for a blocks.db partition is
> >> 4% of the data drive.  For your 6TB disks that's a recommendation of
> >> 240GB per DB partition.  Of course the actual size of the DB needed
> >> is dependent on your use case.  But pretty much every use case for a
> >> 6TB disk needs a bigger partition than 28GB.
> >
> >
> >My current db size of osd.33 is 7910457344 bytes, and osd.73 is
> >2013265920+4685037568 bytes. 7544Mbyte (24.56% of db_total_bytes) vs
> >6388Mbyte (6.69% of db_total_bytes).
> >
> >Why osd.33 is not used slow storage at this case?
>
> Bluestore/RocksDB will only put the next level up size of DB on flash if
> the whole size will fit.
> These sizes are roughly 3GB,30GB,300GB. Anything in-between those sizes
> are pointless. Only ~3GB of SSD will ever be used out of a
> 28GB partition. Likewise a 240GB partition is also pointless as only ~30GB
> will be used.
>
> I'm currently running 30GB partitions on my cluster with a mix of 6,8,10TB
> disks. The 10TB's are about 75% full and use around 14GB,
> this is on mainly 3x Replica RBD(4MB objects)
>
> Nick
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-22 Thread Konstantin Shalygin

Bluestore/RocksDB will only put the next level up size of DB on flash if the 
whole size will fit.
These sizes are roughly 3GB,30GB,300GB. Anything in-between those sizes are 
pointless. Only ~3GB of SSD will ever be used out of a
28GB partition. Likewise a 240GB partition is also pointless as only ~30GB will 
be used.

I'm currently running 30GB partitions on my cluster with a mix of 6,8,10TB 
disks. The 10TB's are about 75% full and use around 14GB,
this is on mainly 3x Replica RBD(4MB objects)

Nick


Can you explain more? You mean that I should increase my 28Gb to 30Gb 
and this do a trick?


How is your db_slow size? We should control it? You control it? How?



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-22 Thread Serkan Çoban
>Where did you get those numbers? I would like to read more if you can
point to a link.

Just found the link:
https://github.com/facebook/rocksdb/wiki/Leveled-Compaction

On Fri, Feb 22, 2019 at 4:22 PM Serkan Çoban  wrote:
>
> >>These sizes are roughly 3GB,30GB,300GB. Anything in-between those sizes are 
> >>pointless. Only ~3GB of SSD will ever be used out of a
> 28GB partition. Likewise a 240GB partition is also pointless as only
> ~30GB will be used.
>
> Where did you get those numbers? I would like to read more if you can
> point to a link.
>
> On Fri, Feb 22, 2019 at 2:30 PM Nick Fisk  wrote:
> >
> > >On 2/16/19 12:33 AM, David Turner wrote:
> > >> The answer is probably going to be in how big your DB partition is vs
> > >> how big your HDD disk is.  From your output it looks like you have a
> > >> 6TB HDD with a 28GB Blocks.DB partition.  Even though the DB used
> > >> size isn't currently full, I would guess that at some point since
> > >> this OSD was created that it did fill up and what you're seeing is
> > >> the part of the DB that spilled over to the data disk. This is why
> > >> the official recommendation (that is quite cautious, but cautious
> > >> because some use cases will use this up) for a blocks.db partition is
> > >> 4% of the data drive.  For your 6TB disks that's a recommendation of
> > >> 240GB per DB partition.  Of course the actual size of the DB needed
> > >> is dependent on your use case.  But pretty much every use case for a
> > >> 6TB disk needs a bigger partition than 28GB.
> > >
> > >
> > >My current db size of osd.33 is 7910457344 bytes, and osd.73 is
> > >2013265920+4685037568 bytes. 7544Mbyte (24.56% of db_total_bytes) vs
> > >6388Mbyte (6.69% of db_total_bytes).
> > >
> > >Why osd.33 is not used slow storage at this case?
> >
> > Bluestore/RocksDB will only put the next level up size of DB on flash if 
> > the whole size will fit.
> > These sizes are roughly 3GB,30GB,300GB. Anything in-between those sizes are 
> > pointless. Only ~3GB of SSD will ever be used out of a
> > 28GB partition. Likewise a 240GB partition is also pointless as only ~30GB 
> > will be used.
> >
> > I'm currently running 30GB partitions on my cluster with a mix of 6,8,10TB 
> > disks. The 10TB's are about 75% full and use around 14GB,
> > this is on mainly 3x Replica RBD(4MB objects)
> >
> > Nick
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-22 Thread Serkan Çoban
>>These sizes are roughly 3GB,30GB,300GB. Anything in-between those sizes are 
>>pointless. Only ~3GB of SSD will ever be used out of a
28GB partition. Likewise a 240GB partition is also pointless as only
~30GB will be used.

Where did you get those numbers? I would like to read more if you can
point to a link.

On Fri, Feb 22, 2019 at 2:30 PM Nick Fisk  wrote:
>
> >On 2/16/19 12:33 AM, David Turner wrote:
> >> The answer is probably going to be in how big your DB partition is vs
> >> how big your HDD disk is.  From your output it looks like you have a
> >> 6TB HDD with a 28GB Blocks.DB partition.  Even though the DB used
> >> size isn't currently full, I would guess that at some point since
> >> this OSD was created that it did fill up and what you're seeing is
> >> the part of the DB that spilled over to the data disk. This is why
> >> the official recommendation (that is quite cautious, but cautious
> >> because some use cases will use this up) for a blocks.db partition is
> >> 4% of the data drive.  For your 6TB disks that's a recommendation of
> >> 240GB per DB partition.  Of course the actual size of the DB needed
> >> is dependent on your use case.  But pretty much every use case for a
> >> 6TB disk needs a bigger partition than 28GB.
> >
> >
> >My current db size of osd.33 is 7910457344 bytes, and osd.73 is
> >2013265920+4685037568 bytes. 7544Mbyte (24.56% of db_total_bytes) vs
> >6388Mbyte (6.69% of db_total_bytes).
> >
> >Why osd.33 is not used slow storage at this case?
>
> Bluestore/RocksDB will only put the next level up size of DB on flash if the 
> whole size will fit.
> These sizes are roughly 3GB,30GB,300GB. Anything in-between those sizes are 
> pointless. Only ~3GB of SSD will ever be used out of a
> 28GB partition. Likewise a 240GB partition is also pointless as only ~30GB 
> will be used.
>
> I'm currently running 30GB partitions on my cluster with a mix of 6,8,10TB 
> disks. The 10TB's are about 75% full and use around 14GB,
> this is on mainly 3x Replica RBD(4MB objects)
>
> Nick
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-22 Thread Nick Fisk
>On 2/16/19 12:33 AM, David Turner wrote:
>> The answer is probably going to be in how big your DB partition is vs 
>> how big your HDD disk is.  From your output it looks like you have a 
>> 6TB HDD with a 28GB Blocks.DB partition.  Even though the DB used 
>> size isn't currently full, I would guess that at some point since 
>> this OSD was created that it did fill up and what you're seeing is 
>> the part of the DB that spilled over to the data disk. This is why 
>> the official recommendation (that is quite cautious, but cautious 
>> because some use cases will use this up) for a blocks.db partition is 
>> 4% of the data drive.  For your 6TB disks that's a recommendation of 
>> 240GB per DB partition.  Of course the actual size of the DB needed 
>> is dependent on your use case.  But pretty much every use case for a 
>> 6TB disk needs a bigger partition than 28GB.
>
>
>My current db size of osd.33 is 7910457344 bytes, and osd.73 is
>2013265920+4685037568 bytes. 7544Mbyte (24.56% of db_total_bytes) vs
>6388Mbyte (6.69% of db_total_bytes).
>
>Why osd.33 is not used slow storage at this case?

Bluestore/RocksDB will only put the next level up size of DB on flash if the 
whole size will fit.
These sizes are roughly 3GB,30GB,300GB. Anything in-between those sizes are 
pointless. Only ~3GB of SSD will ever be used out of a
28GB partition. Likewise a 240GB partition is also pointless as only ~30GB will 
be used.

I'm currently running 30GB partitions on my cluster with a mix of 6,8,10TB 
disks. The 10TB's are about 75% full and use around 14GB,
this is on mainly 3x Replica RBD(4MB objects)

Nick

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-20 Thread Igor Fedotov

You're right - WAL/DB expansion capability is present in Luminous+ releases.

But David meant volume migration stuff which appeared in Nautilus, see:

https://github.com/ceph/ceph/pull/23103


Thanks,

Igor

On 2/20/2019 9:22 AM, Konstantin Shalygin wrote:

On 2/19/19 11:46 PM, David Turner wrote:
I don't know that there's anything that can be done to resolve this 
yet without rebuilding the OSD. Based on a Nautilus tool being able 
to resize the DB device, I'm assuming that Nautilus is also capable 
of migrating the DB/WAL between devices.  That functionality would 
allow anyone to migrate their DB back off of their spinner which is 
what's happening to you.  I don't believe that sort of tooling exists 
yet, though, without compiling the Nautilus Beta tooling for yourself.


I think there you are wrong, initially bluestore tool can expand only 
wal/db devices [1]. With last releases of mimic and luminous this 
should work fine.


And only master received  feature for main device expanding [2].



[1] 
https://github.com/ceph/ceph/commit/2184e3077caa9de5f21cc901d26f6ecfb76de9e1


[2] 
https://github.com/ceph/ceph/commit/d07c10dfc02e4cdeda288bf39b8060b10da5bbf9


k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-19 Thread Konstantin Shalygin

On 2/19/19 11:46 PM, David Turner wrote:
I don't know that there's anything that can be done to resolve this 
yet without rebuilding the OSD.  Based on a Nautilus tool being able 
to resize the DB device, I'm assuming that Nautilus is also capable of 
migrating the DB/WAL between devices.  That functionality would allow 
anyone to migrate their DB back off of their spinner which is what's 
happening to you.  I don't believe that sort of tooling exists yet, 
though, without compiling the Nautilus Beta tooling for yourself.


I think there you are wrong, initially bluestore tool can expand only 
wal/db devices [1]. With last releases of mimic and luminous this should 
work fine.


And only master received  feature for main device expanding [2].



[1] 
https://github.com/ceph/ceph/commit/2184e3077caa9de5f21cc901d26f6ecfb76de9e1


[2] 
https://github.com/ceph/ceph/commit/d07c10dfc02e4cdeda288bf39b8060b10da5bbf9


k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-19 Thread David Turner
I don't know that there's anything that can be done to resolve this yet
without rebuilding the OSD.  Based on a Nautilus tool being able to resize
the DB device, I'm assuming that Nautilus is also capable of migrating the
DB/WAL between devices.  That functionality would allow anyone to migrate
their DB back off of their spinner which is what's happening to you.  I
don't believe that sort of tooling exists yet, though, without compiling
the Nautilus Beta tooling for yourself.

On Tue, Feb 19, 2019 at 12:03 AM Konstantin Shalygin  wrote:

> On 2/18/19 9:43 PM, David Turner wrote:
> > Do you have historical data from these OSDs to see when/if the DB used
> > on osd.73 ever filled up?  To account for this OSD using the slow
> > storage for DB, all we need to do is show that it filled up the fast
> > DB at least once.  If that happened, then something spilled over to
> > the slow storage and has been there ever since.
>
> Yes, I have. Also I checked my JIRA records what I was do at this times
> and marked this on timeline: [1]
>
> Another graph compared osd.(33|73) for a last year: [2]
>
>
> [1] https://ibb.co/F7smCxW
>
> [1] https://ibb.co/dKWWDzW
>
> k
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-18 Thread Konstantin Shalygin

On 2/18/19 9:43 PM, David Turner wrote:
Do you have historical data from these OSDs to see when/if the DB used 
on osd.73 ever filled up?  To account for this OSD using the slow 
storage for DB, all we need to do is show that it filled up the fast 
DB at least once.  If that happened, then something spilled over to 
the slow storage and has been there ever since.


Yes, I have. Also I checked my JIRA records what I was do at this times 
and marked this on timeline: [1]


Another graph compared osd.(33|73) for a last year: [2]


[1] https://ibb.co/F7smCxW

[1] https://ibb.co/dKWWDzW

k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-18 Thread David Turner
Do you have historical data from these OSDs to see when/if the DB used on
osd.73 ever filled up?  To account for this OSD using the slow storage for
DB, all we need to do is show that it filled up the fast DB at least once.
If that happened, then something spilled over to the slow storage and has
been there ever since.

On Sat, Feb 16, 2019 at 1:50 AM Konstantin Shalygin  wrote:

> On 2/16/19 12:33 AM, David Turner wrote:
> > The answer is probably going to be in how big your DB partition is vs
> > how big your HDD disk is.  From your output it looks like you have a
> > 6TB HDD with a 28GB Blocks.DB partition.  Even though the DB used size
> > isn't currently full, I would guess that at some point since this OSD
> > was created that it did fill up and what you're seeing is the part of
> > the DB that spilled over to the data disk. This is why the official
> > recommendation (that is quite cautious, but cautious because some use
> > cases will use this up) for a blocks.db partition is 4% of the data
> > drive.  For your 6TB disks that's a recommendation of 240GB per DB
> > partition.  Of course the actual size of the DB needed is dependent on
> > your use case.  But pretty much every use case for a 6TB disk needs a
> > bigger partition than 28GB.
>
>
> My current db size of osd.33 is 7910457344 bytes, and osd.73 is
> 2013265920+4685037568 bytes. 7544Mbyte (24.56% of db_total_bytes) vs
> 6388Mbyte (6.69% of db_total_bytes).
>
> Why osd.33 is not used slow storage at this case?
>
>
>
> k
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-15 Thread Konstantin Shalygin

On 2/16/19 12:33 AM, David Turner wrote:
The answer is probably going to be in how big your DB partition is vs 
how big your HDD disk is.  From your output it looks like you have a 
6TB HDD with a 28GB Blocks.DB partition.  Even though the DB used size 
isn't currently full, I would guess that at some point since this OSD 
was created that it did fill up and what you're seeing is the part of 
the DB that spilled over to the data disk. This is why the official 
recommendation (that is quite cautious, but cautious because some use 
cases will use this up) for a blocks.db partition is 4% of the data 
drive.  For your 6TB disks that's a recommendation of 240GB per DB 
partition.  Of course the actual size of the DB needed is dependent on 
your use case.  But pretty much every use case for a 6TB disk needs a 
bigger partition than 28GB.



My current db size of osd.33 is 7910457344 bytes, and osd.73 is 
2013265920+4685037568 bytes. 7544Mbyte (24.56% of db_total_bytes) vs 
6388Mbyte (6.69% of db_total_bytes).


Why osd.33 is not used slow storage at this case?



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-15 Thread David Turner
The answer is probably going to be in how big your DB partition is vs how
big your HDD disk is.  From your output it looks like you have a 6TB HDD
with a 28GB Blocks.DB partition.  Even though the DB used size isn't
currently full, I would guess that at some point since this OSD was created
that it did fill up and what you're seeing is the part of the DB that
spilled over to the data disk.  This is why the official recommendation
(that is quite cautious, but cautious because some use cases will use this
up) for a blocks.db partition is 4% of the data drive.  For your 6TB disks
that's a recommendation of 240GB per DB partition.  Of course the actual
size of the DB needed is dependent on your use case.  But pretty much every
use case for a 6TB disk needs a bigger partition than 28GB.

On Thu, Feb 14, 2019 at 11:58 PM Konstantin Shalygin  wrote:

> Wrong metadata paste of osd.73 in previous message.
>
>
> {
>
>  "id": 73,
>  "arch": "x86_64",
>  "back_addr": "10.10.10.6:6804/175338",
>  "back_iface": "vlan3",
>  "bluefs": "1",
>  "bluefs_db_access_mode": "blk",
>  "bluefs_db_block_size": "4096",
>  "bluefs_db_dev": "259:22",
>  "bluefs_db_dev_node": "nvme2n1",
>  "bluefs_db_driver": "KernelDevice",
>  "bluefs_db_model": "INTEL SSDPEDMD400G4 ",
>  "bluefs_db_partition_path": "/dev/nvme2n1p11",
>  "bluefs_db_rotational": "0",
>  "bluefs_db_serial": "CVFT4324002Q400BGN  ",
>  "bluefs_db_size": "30064771072",
>  "bluefs_db_type": "nvme",
>  "bluefs_single_shared_device": "0",
>  "bluefs_slow_access_mode": "blk",
>  "bluefs_slow_block_size": "4096",
>  "bluefs_slow_dev": "8:176",
>  "bluefs_slow_dev_node": "sdl",
>  "bluefs_slow_driver": "KernelDevice",
>  "bluefs_slow_model": "TOSHIBA HDWE160 ",
>  "bluefs_slow_partition_path": "/dev/sdl2",
>  "bluefs_slow_rotational": "1",
>  "bluefs_slow_size": "6001069199360",
>  "bluefs_slow_type": "hdd",
>  "bluefs_wal_access_mode": "blk",
>  "bluefs_wal_block_size": "4096",
>  "bluefs_wal_dev": "259:22",
>  "bluefs_wal_dev_node": "nvme2n1",
>  "bluefs_wal_driver": "KernelDevice",
>  "bluefs_wal_model": "INTEL SSDPEDMD400G4 ",
>  "bluefs_wal_partition_path": "/dev/nvme2n1p12",
>  "bluefs_wal_rotational": "0",
>  "bluefs_wal_serial": "CVFT4324002Q400BGN  ",
>  "bluefs_wal_size": "1073741824",
>  "bluefs_wal_type": "nvme",
>  "bluestore_bdev_access_mode": "blk",
>  "bluestore_bdev_block_size": "4096",
>  "bluestore_bdev_dev": "8:176",
>  "bluestore_bdev_dev_node": "sdl",
>  "bluestore_bdev_driver": "KernelDevice",
>  "bluestore_bdev_model": "TOSHIBA HDWE160 ",
>  "bluestore_bdev_partition_path": "/dev/sdl2",
>  "bluestore_bdev_rotational": "1",
>  "bluestore_bdev_size": "6001069199360",
>  "bluestore_bdev_type": "hdd",
>  "ceph_version": "ceph version 12.2.10
> (177915764b752804194937482a39e95e0ca3de94) luminous (stable)",
>  "cpu": "Intel(R) Xeon(R) CPU E5-2609 v4 @ 1.70GHz",
>  "default_device_class": "hdd",
>  "distro": "centos",
>  "distro_description": "CentOS Linux 7 (Core)",
>  "distro_version": "7",
>  "front_addr": "172.16.16.16:6803/175338",
>  "front_iface": "vlan4",
>  "hb_back_addr": "10.10.10.6:6805/175338",
>  "hb_front_addr": "172.16.16.16:6805/175338",
>  "hostname": "ceph-osd5",
>  "journal_rotational": "0",
>  "kernel_description": "#1 SMP Tue Aug 14 21:49:04 UTC 2018",
>  "kernel_version": "3.10.0-862.11.6.el7.x86_64",
>  "mem_swap_kb": "0",
>  "mem_total_kb": "65724256",
>  "os": "Linux",
>  "osd_data": "/var/lib/ceph/osd/ceph-73",
>  "osd_objectstore": "bluestore",
>  "rotational": "1"
> }
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-14 Thread Konstantin Shalygin

Wrong metadata paste of osd.73 in previous message.


{

    "id": 73,
    "arch": "x86_64",
    "back_addr": "10.10.10.6:6804/175338",
    "back_iface": "vlan3",
    "bluefs": "1",
    "bluefs_db_access_mode": "blk",
    "bluefs_db_block_size": "4096",
    "bluefs_db_dev": "259:22",
    "bluefs_db_dev_node": "nvme2n1",
    "bluefs_db_driver": "KernelDevice",
    "bluefs_db_model": "INTEL SSDPEDMD400G4 ",
    "bluefs_db_partition_path": "/dev/nvme2n1p11",
    "bluefs_db_rotational": "0",
    "bluefs_db_serial": "CVFT4324002Q400BGN  ",
    "bluefs_db_size": "30064771072",
    "bluefs_db_type": "nvme",
    "bluefs_single_shared_device": "0",
    "bluefs_slow_access_mode": "blk",
    "bluefs_slow_block_size": "4096",
    "bluefs_slow_dev": "8:176",
    "bluefs_slow_dev_node": "sdl",
    "bluefs_slow_driver": "KernelDevice",
    "bluefs_slow_model": "TOSHIBA HDWE160 ",
    "bluefs_slow_partition_path": "/dev/sdl2",
    "bluefs_slow_rotational": "1",
    "bluefs_slow_size": "6001069199360",
    "bluefs_slow_type": "hdd",
    "bluefs_wal_access_mode": "blk",
    "bluefs_wal_block_size": "4096",
    "bluefs_wal_dev": "259:22",
    "bluefs_wal_dev_node": "nvme2n1",
    "bluefs_wal_driver": "KernelDevice",
    "bluefs_wal_model": "INTEL SSDPEDMD400G4 ",
    "bluefs_wal_partition_path": "/dev/nvme2n1p12",
    "bluefs_wal_rotational": "0",
    "bluefs_wal_serial": "CVFT4324002Q400BGN  ",
    "bluefs_wal_size": "1073741824",
    "bluefs_wal_type": "nvme",
    "bluestore_bdev_access_mode": "blk",
    "bluestore_bdev_block_size": "4096",
    "bluestore_bdev_dev": "8:176",
    "bluestore_bdev_dev_node": "sdl",
    "bluestore_bdev_driver": "KernelDevice",
    "bluestore_bdev_model": "TOSHIBA HDWE160 ",
    "bluestore_bdev_partition_path": "/dev/sdl2",
    "bluestore_bdev_rotational": "1",
    "bluestore_bdev_size": "6001069199360",
    "bluestore_bdev_type": "hdd",
    "ceph_version": "ceph version 12.2.10 
(177915764b752804194937482a39e95e0ca3de94) luminous (stable)",

    "cpu": "Intel(R) Xeon(R) CPU E5-2609 v4 @ 1.70GHz",
    "default_device_class": "hdd",
    "distro": "centos",
    "distro_description": "CentOS Linux 7 (Core)",
    "distro_version": "7",
    "front_addr": "172.16.16.16:6803/175338",
    "front_iface": "vlan4",
    "hb_back_addr": "10.10.10.6:6805/175338",
    "hb_front_addr": "172.16.16.16:6805/175338",
    "hostname": "ceph-osd5",
    "journal_rotational": "0",
    "kernel_description": "#1 SMP Tue Aug 14 21:49:04 UTC 2018",
    "kernel_version": "3.10.0-862.11.6.el7.x86_64",
    "mem_swap_kb": "0",
    "mem_total_kb": "65724256",
    "os": "Linux",
    "osd_data": "/var/lib/ceph/osd/ceph-73",
    "osd_objectstore": "bluestore",
    "rotational": "1"
}
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-14 Thread Konstantin Shalygin

Hi,

Most of my osd's use's part of slow storage for RocksDB, but some is 
not. I investigated this and think because is most oldest of Bluestore 
osd's in this cluster.


I figure out this by /var/lib/osd/ creation date, don't know is 
possible or not to determine real osd creation date from osd metadata.


In my case is Dec 2017, so this is a Luminous 12.2.2 - this osd's keep 
all RocksDB on NVMe storage. Another osd's was created on 12.2.4 or 12.2.5.


I was trying to compact object store omap (`ceph tell  
compact`), this is helps, but after some period of time slow is grow's 
again. Graph: [1]


BlueFS of two osd's (33 is old osd, slow is not used): [2], [3]


[root@ceph-osd5 /]# ceph daemon osd.33 perf dump | jq .bluefs
{
  "gift_bytes": 0,
  "reclaim_bytes": 0,
  "db_total_bytes": 32212246528,
  "db_used_bytes": 6128926720,
  "wal_total_bytes": 1073737728,
  "wal_used_bytes": 528482304,
  "slow_total_bytes": 240043163648,
  "slow_used_bytes": 0, <--
  "num_files": 109,
  "log_bytes": 12476416,
  "log_compactions": 23,
  "logged_bytes": 518922240,
  "files_written_wal": 2,
  "files_written_sst": 26235,
  "bytes_written_wal": 1480717702902,
  "bytes_written_sst": 1382491287991
}
[root@ceph-osd5 /]# ceph daemon osd.73 perf dump | jq .bluefs
{
  "gift_bytes": 0,
  "reclaim_bytes": 0,
  "db_total_bytes": 30064762880,
  "db_used_bytes": 1874853888,
  "wal_total_bytes": 1073737728,
  "wal_used_bytes": 529530880,
  "slow_total_bytes": 240043163648,
  "slow_used_bytes": 4678746112, <--
  "num_files": 116,
  "log_bytes": 9302016,
  "log_compactions": 23,
  "logged_bytes": 573952000,
  "files_written_wal": 2,
  "files_written_sst": 19755,
  "bytes_written_wal": 1401924883737,
  "bytes_written_sst": 980145086778
}

And osd metadata for this osd's:


{
    "id": 33,
    "arch": "x86_64",
    "back_addr": "10.10.10.6:6812/175022",
    "back_iface": "vlan3",
    "bluefs": "1",
    "bluefs_db_access_mode": "blk",
    "bluefs_db_block_size": "4096",
    "bluefs_db_dev": "259:22",
    "bluefs_db_dev_node": "nvme2n1",
    "bluefs_db_driver": "KernelDevice",
    "bluefs_db_model": "INTEL SSDPEDMD400G4 ",
    "bluefs_db_partition_path": "/dev/nvme2n1p7",
    "bluefs_db_rotational": "0",
    "bluefs_db_serial": "CVFT4324002Q400BGN  ",
    "bluefs_db_size": "32212254720",
    "bluefs_db_type": "nvme",
    "bluefs_single_shared_device": "0",
    "bluefs_slow_access_mode": "blk",
    "bluefs_slow_block_size": "4096",
    "bluefs_slow_dev": "8:112",
    "bluefs_slow_dev_node": "sdh",
    "bluefs_slow_driver": "KernelDevice",
    "bluefs_slow_model": "TOSHIBA HDWE160 ",
    "bluefs_slow_partition_path": "/dev/sdh2",
    "bluefs_slow_rotational": "1",
    "bluefs_slow_size": "6001069199360",
    "bluefs_slow_type": "hdd",
    "bluefs_wal_access_mode": "blk",
    "bluefs_wal_block_size": "4096",
    "bluefs_wal_dev": "259:22",
    "bluefs_wal_dev_node": "nvme2n1",
    "bluefs_wal_driver": "KernelDevice",
    "bluefs_wal_model": "INTEL SSDPEDMD400G4 ",
    "bluefs_wal_partition_path": "/dev/nvme2n1p8",
    "bluefs_wal_rotational": "0",
    "bluefs_wal_serial": "CVFT4324002Q400BGN  ",
    "bluefs_wal_size": "1073741824",
    "bluefs_wal_type": "nvme",
    "bluestore_bdev_access_mode": "blk",
    "bluestore_bdev_block_size": "4096",
    "bluestore_bdev_dev": "8:112",
    "bluestore_bdev_dev_node": "sdh",
    "bluestore_bdev_driver": "KernelDevice",
    "bluestore_bdev_model": "TOSHIBA HDWE160 ",
    "bluestore_bdev_partition_path": "/dev/sdh2",
    "bluestore_bdev_rotational": "1",
    "bluestore_bdev_size": "6001069199360",
    "bluestore_bdev_type": "hdd",
    "ceph_version": "ceph version 12.2.10 
(177915764b752804194937482a39e95e0ca3de94) luminous (stable)",

    "cpu": "Intel(R) Xeon(R) CPU E5-2609 v4 @ 1.70GHz",
    "default_device_class": "hdd",
    "distro": "centos",
    "distro_description": "CentOS Linux 7 (Core)",
    "distro_version": "7",
    "front_addr": "172.16.16.16:6812/175022",
    "front_iface": "vlan4",
    "hb_back_addr": "10.10.10.6:6813/175022",
    "hb_front_addr": "172.16.16.16:6813/175022",
    "hostname": "ceph-osd5",
    "journal_rotational": "0",
    "kernel_description": "#1 SMP Tue Aug 14 21:49:04 UTC 2018",
    "kernel_version": "3.10.0-862.11.6.el7.x86_64",
    "mem_swap_kb": "0",
    "mem_total_kb": "65724256",
    "os": "Linux",
    "osd_data": "/var/lib/ceph/osd/ceph-33",
    "osd_objectstore": "bluestore",
    "rotational": "1"
}
[root@ceph-mon0 /]# ceph osd metadata osd.37
{
    "id": 37,
    "arch": "x86_64",
    "back_addr": "10.10.10.4:6808/164908",
    "back_iface": "vlan3",
    "bluefs": "1",
    "bluefs_db_access_mode": "blk",
    "bluefs_db_block_size": "4096",
    "bluefs_db_dev": "259:22",
    "bluefs_db_dev_node": "nvme2n1",
    "bluefs_db_driver": "KernelDevice",
    "bluefs_db_model": "INTEL SSDPEDMD400G4 ",
    "bluefs_db_partition_path": "/dev/nvme2n1p7",