Re: [ceph-users] NVMe disk - size

2019-11-18 Thread Kristof Coucke
Hi Jesper & Lars,

Thnx for your answer.

c) is indeed the option with the devices being used for block.db
Adding an extra NVMe doesn't really seem necessary, since there is no space
issue... (Contrary... The only advantage is the impact of a defect NVMe
disk). The performance of the NVMe's shouldn't be a problem too as they
should be more than fast...

What Lars is saying might be an option too:
Going for option c) and using the remaining space for RGW metadata pool.
This would result in +/- 500GB/host being used as fast storage for the OSDs
and 2.7TB being used for the RGW metadata pool...
I don't find any calculations for the size of the RGW metadata pool yet,
but we're storing very limited metadata information, so I can't imaging
that 27TB of NVMe disk space wouldn't be sufficient for that pool...

So, the final pick will be a) combined with c). I'm just going to search
further on the metadata size calculation to be sure that there is no issue
related to the sizing.

Regards,

Kristof

Op ma 18 nov. 2019 om 08:14 schreef :

> Is c) the bcache solution?
>
> real life experience - unless you are really beating an enterprise ssd
> with writes - they last very,very long and even when failure happens- you
> can typically see it by the wear levels in smart months before.
>
> I would go for c) but if possible add one more nvme to each host - we have
> a 9-hdd+3-ssd scenario here.
>
> Jesper
>
>
>
> Sent from myMail for iOS
>
>
> Monday, 18 November 2019, 07.49 +0100 from kristof.cou...@gmail.com <
> kristof.cou...@gmail.com>:
>
> Hi all,
>
> Thanks for the feedback.
> Though, just to be sure:
>
> 1. There is no 30GB limit if I understand correctly for the RocksDB size.
> If metadata crosses that barrier, the L4 part will spillover to the primary
> device? Or will it just move the RocksDB completely? Or will it just stop
> and indicate it's full?
> 2. Since the WAL will also be written to that device, I assume a few
> additional GB's is still usefull...
>
> With my setup (13x 14TB + 2 NVMe of 1.6TB / host, 10 hosts) I have
> multiple possible scenario's:
> - Assigning 35GB of space of the NVMe disk (30GB for DB, 5 spare) would
> result in only 455GB being used (13 x 35GB). This is a pity, since I have
> 3.2TB of NVMe disk space...
>
> Options line-up:
>
> *Option a*: Not using the NVMe for block.db storage, but as RGW metadata
> pool.
> Advantages:
> - Impact of 1 defect NVMe is limited.
> - Fast storage for the metadata pool.
> Disadvantage:
> - RocksDB for each OSD is on the primary disk, resulting in slower
> performance of each OSD.
>
> *Option b: *Hardware mirror of the NVMe drive
> Advantages:
> - Impact of 1 defect NVMe is limited
> - Fast KV lookup for each OSD
> Disadvantage:
> - I/O to NVMe is serialized for all OSDs on 1 host. Though the NVMe are
> fast, I imagine that there still is an impact.
> - 1 TB of NVMe is not used / host
>
> *Option c: *Split the NVMe's accross the OSD
> Advantages:
> - Fast RockDB access - up to L3 (assuming spillover does it job)
> Disadvantage:
> - 1 defect NVMe impacts max 7 OSDs (1 NVMe assigned to 7 or 6 OSD daemons
> per host)
> - 2.7TB of NVMe space not used per host
>
> *Option d: *1 NVMe disk for OSDs, 1 for RGW metadata pool
> Advantages:
> - Fast RockDB access - up to L3
> - Fast RGW metadata pool (though limited to 5,3TB (raw pool size will be
> 16TB, divided by 3 due to replication) I assume this already gives some
> possibilities
> Disadvantages:
> - 1 defect NVMe might impact a complete host (all OSDs might be using it
> for the RockDB storage)
> - 1 TB of NVMe is not used
>
> Though menu to choose from, each with it possibilities... The initial idea
> was too assign 200GB per OSD of the NVMe space per OSD, but this would
> result in a lot of unused space. I don't know if there is anything on the
> roadmap to adapt the RocksDB sizing to make better use of the available
> NVMe disk space.
> With all the information, I would assume that the best option would be *option
> A*. Since we will be using erasure coding for the RGW data pool (k=6,
> m=3), the impact of a defect NVMe would be too significant. The other
> alternative would be option b, but then again we would be dealing with HW
> raid which is against all Ceph design rules.
>
> Any other options or (dis)advantages I missed? Or any other opinions to
> choose another option?
>
> Regards,
>
> Kristof
>
> Op vr 15 nov. 2019 om 18:22 schreef :
>
> Use 30 GB for all OSDs. Other values are pointless, because
> https://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing
>
> You can use the rest of free NVMe space for bcache - it's much better
> than just allocating it for block.db.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NVMe disk - size

2019-11-17 Thread Lars Täuber
Hi Kristof,

may I add another choice?
I configured my SSDs this way.

Every host for OSDs has two fast and durable SSDs.
Both SSDs are in one RAID1 which then is split up into LVs.

I took 58GB for DB & WAL (and space for a special action by the DB (was it 
compaction?)) for each OSD.
Then there where some hundreds of GB left on this RAID1 which I took to form a 
faster SSD-OSD.
This is put into its own class of OSDs.

So I have (slower) pools put onto OSDs of class "hdd" and (faster) pools put 
onto OSDs of class "ssd".
The faster pools are used for metadata of CephFS.

Good luck,
Lars


Mon, 18 Nov 2019 07:46:23 +0100
Kristof Coucke  ==> vita...@yourcmc.ru :
> Hi all,
> 
> Thanks for the feedback.
> Though, just to be sure:
> 
> 1. There is no 30GB limit if I understand correctly for the RocksDB size.
> If metadata crosses that barrier, the L4 part will spillover to the primary
> device? Or will it just move the RocksDB completely? Or will it just stop
> and indicate it's full?
> 2. Since the WAL will also be written to that device, I assume a few
> additional GB's is still usefull...
> 
> With my setup (13x 14TB + 2 NVMe of 1.6TB / host, 10 hosts) I have multiple
> possible scenario's:
> - Assigning 35GB of space of the NVMe disk (30GB for DB, 5 spare) would
> result in only 455GB being used (13 x 35GB). This is a pity, since I have
> 3.2TB of NVMe disk space...
> 
> Options line-up:
> 
> *Option a*: Not using the NVMe for block.db storage, but as RGW metadata
> pool.
> Advantages:
> - Impact of 1 defect NVMe is limited.
> - Fast storage for the metadata pool.
> Disadvantage:
> - RocksDB for each OSD is on the primary disk, resulting in slower
> performance of each OSD.
> 
> *Option b: *Hardware mirror of the NVMe drive
> Advantages:
> - Impact of 1 defect NVMe is limited
> - Fast KV lookup for each OSD
> Disadvantage:
> - I/O to NVMe is serialized for all OSDs on 1 host. Though the NVMe are
> fast, I imagine that there still is an impact.
> - 1 TB of NVMe is not used / host
> 
> *Option c: *Split the NVMe's accross the OSD
> Advantages:
> - Fast RockDB access - up to L3 (assuming spillover does it job)
> Disadvantage:
> - 1 defect NVMe impacts max 7 OSDs (1 NVMe assigned to 7 or 6 OSD daemons
> per host)
> - 2.7TB of NVMe space not used per host
> 
> *Option d: *1 NVMe disk for OSDs, 1 for RGW metadata pool
> Advantages:
> - Fast RockDB access - up to L3
> - Fast RGW metadata pool (though limited to 5,3TB (raw pool size will be
> 16TB, divided by 3 due to replication) I assume this already gives some
> possibilities
> Disadvantages:
> - 1 defect NVMe might impact a complete host (all OSDs might be using it
> for the RockDB storage)
> - 1 TB of NVMe is not used
> 
> Though menu to choose from, each with it possibilities... The initial idea
> was too assign 200GB per OSD of the NVMe space per OSD, but this would
> result in a lot of unused space. I don't know if there is anything on the
> roadmap to adapt the RocksDB sizing to make better use of the available
> NVMe disk space.
> With all the information, I would assume that the best option would be *option
> A*. Since we will be using erasure coding for the RGW data pool (k=6, m=3),
> the impact of a defect NVMe would be too significant. The other alternative
> would be option b, but then again we would be dealing with HW raid which is
> against all Ceph design rules.
> 
> Any other options or (dis)advantages I missed? Or any other opinions to
> choose another option?
> 
> Regards,
> 
> Kristof
> 
> Op vr 15 nov. 2019 om 18:22 schreef :
> 
> > Use 30 GB for all OSDs. Other values are pointless, because
> > https://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing
> >
> > You can use the rest of free NVMe space for bcache - it's much better
> > than just allocating it for block.db.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NVMe disk - size

2019-11-17 Thread jesper

Is c) the bcache solution?

real life experience - unless you are really beating an enterprise ssd with 
writes - they last very,very long and even when failure happens- you can 
typically see it by the wear levels in smart months before.

I would go for c) but if possible add one more nvme to each host - we have a 
9-hdd+3-ssd scenario here.

Jesper



Sent from myMail for iOS


Monday, 18 November 2019, 07.49 +0100 from kristof.cou...@gmail.com  
:
>Hi all,
>
>Thanks for the feedback.
>Though, just to be sure:
>
>1. There is no 30GB limit if I understand correctly for the RocksDB size. If 
>metadata crosses that barrier, the L4 part will spillover to the primary 
>device? Or will it just move the RocksDB completely? Or will it just stop and 
>indicate it's full?
>2. Since the WAL will also be written to that device, I assume a few 
>additional GB's is still usefull...
>
>With my setup (13x 14TB + 2 NVMe of 1.6TB / host, 10 hosts) I have multiple 
>possible scenario's:
>- Assigning 35GB of space of the NVMe disk (30GB for DB, 5 spare) would result 
>in only 455GB being used (13 x 35GB). This is a pity, since I have 3.2TB of 
>NVMe disk space...
>
>Options line-up:
>
>Option a : Not using the NVMe for block.db storage, but as RGW metadata pool.
>Advantages:
>- Impact of 1 defect NVMe is limited.
>- Fast storage for the metadata pool.
>Disadvantage:
>- RocksDB for each OSD is on the primary disk, resulting in slower performance 
>of each OSD.
>
>Option b:  Hardware mirror of the NVMe drive
>Advantages:
>- Impact of 1 defect NVMe is limited
>- Fast KV lookup for each OSD
>Disadvantage:
>- I/O to NVMe is serialized for all OSDs on 1 host. Though the NVMe are fast, 
>I imagine that there still is an impact.
>- 1 TB of NVMe is not used / host
>
>Option c:  Split the NVMe's accross the OSD
>Advantages:
>- Fast RockDB access - up to L3 (assuming spillover does it job)
>Disadvantage:
>- 1 defect NVMe impacts max 7 OSDs (1 NVMe assigned to 7 or 6 OSD daemons per 
>host)
>- 2.7TB of NVMe space not used per host
>
>Option d:  1 NVMe disk for OSDs, 1 for RGW metadata pool
>Advantages:
>- Fast RockDB access - up to L3
>- Fast RGW metadata pool (though limited to 5,3TB (raw pool size will be 16TB, 
>divided by 3 due to replication) I assume this already gives some possibilities
>Disadvantages:
>- 1 defect NVMe might impact a complete host (all OSDs might be using it for 
>the RockDB storage)
>- 1 TB of NVMe is not used
>
>Though menu to choose from, each with it possibilities... The initial idea was 
>too assign 200GB per OSD of the NVMe space per OSD, but this would result in a 
>lot of unused space. I don't know if there is anything on the roadmap to adapt 
>the RocksDB sizing to make better use of the available NVMe disk space.
>With all the information, I would assume that the best option would be  option 
>A . Since we will be using erasure coding for the RGW data pool (k=6, m=3), 
>the impact of a defect NVMe would be too significant. The other alternative 
>would be option b, but then again we would be dealing with HW raid which is 
>against all Ceph design rules.
>
>Any other options or (dis)advantages I missed? Or any other opinions to choose 
>another option?
>
>Regards,
>
>Kristof
>Op vr 15 nov. 2019 om 18:22 schreef < vita...@yourcmc.ru >:
>>Use 30 GB for all OSDs. Other values are pointless, because 
>>https://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing
>>
>>You can use the rest of free NVMe space for bcache - it's much better 
>>than just allocating it for block.db.
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NVMe disk - size

2019-11-17 Thread Kristof Coucke
Hi all,

Thanks for the feedback.
Though, just to be sure:

1. There is no 30GB limit if I understand correctly for the RocksDB size.
If metadata crosses that barrier, the L4 part will spillover to the primary
device? Or will it just move the RocksDB completely? Or will it just stop
and indicate it's full?
2. Since the WAL will also be written to that device, I assume a few
additional GB's is still usefull...

With my setup (13x 14TB + 2 NVMe of 1.6TB / host, 10 hosts) I have multiple
possible scenario's:
- Assigning 35GB of space of the NVMe disk (30GB for DB, 5 spare) would
result in only 455GB being used (13 x 35GB). This is a pity, since I have
3.2TB of NVMe disk space...

Options line-up:

*Option a*: Not using the NVMe for block.db storage, but as RGW metadata
pool.
Advantages:
- Impact of 1 defect NVMe is limited.
- Fast storage for the metadata pool.
Disadvantage:
- RocksDB for each OSD is on the primary disk, resulting in slower
performance of each OSD.

*Option b: *Hardware mirror of the NVMe drive
Advantages:
- Impact of 1 defect NVMe is limited
- Fast KV lookup for each OSD
Disadvantage:
- I/O to NVMe is serialized for all OSDs on 1 host. Though the NVMe are
fast, I imagine that there still is an impact.
- 1 TB of NVMe is not used / host

*Option c: *Split the NVMe's accross the OSD
Advantages:
- Fast RockDB access - up to L3 (assuming spillover does it job)
Disadvantage:
- 1 defect NVMe impacts max 7 OSDs (1 NVMe assigned to 7 or 6 OSD daemons
per host)
- 2.7TB of NVMe space not used per host

*Option d: *1 NVMe disk for OSDs, 1 for RGW metadata pool
Advantages:
- Fast RockDB access - up to L3
- Fast RGW metadata pool (though limited to 5,3TB (raw pool size will be
16TB, divided by 3 due to replication) I assume this already gives some
possibilities
Disadvantages:
- 1 defect NVMe might impact a complete host (all OSDs might be using it
for the RockDB storage)
- 1 TB of NVMe is not used

Though menu to choose from, each with it possibilities... The initial idea
was too assign 200GB per OSD of the NVMe space per OSD, but this would
result in a lot of unused space. I don't know if there is anything on the
roadmap to adapt the RocksDB sizing to make better use of the available
NVMe disk space.
With all the information, I would assume that the best option would be *option
A*. Since we will be using erasure coding for the RGW data pool (k=6, m=3),
the impact of a defect NVMe would be too significant. The other alternative
would be option b, but then again we would be dealing with HW raid which is
against all Ceph design rules.

Any other options or (dis)advantages I missed? Or any other opinions to
choose another option?

Regards,

Kristof

Op vr 15 nov. 2019 om 18:22 schreef :

> Use 30 GB for all OSDs. Other values are pointless, because
> https://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing
>
> You can use the rest of free NVMe space for bcache - it's much better
> than just allocating it for block.db.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NVMe disk - size

2019-11-15 Thread vitalif
Use 30 GB for all OSDs. Other values are pointless, because 
https://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing


You can use the rest of free NVMe space for bcache - it's much better 
than just allocating it for block.db.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NVMe disk - size

2019-11-15 Thread Paul Emmerich
On Fri, Nov 15, 2019 at 4:39 PM Wido den Hollander  wrote:
>
>
>
> On 11/15/19 4:25 PM, Paul Emmerich wrote:
> > On Fri, Nov 15, 2019 at 4:02 PM Wido den Hollander  wrote:
> >>
> >>  I normally use LVM on top
> >> of each device and create 2 LVs per OSD:
> >>
> >> - WAL: 1GB
> >> - DB: xx GB
> >
> > Why? I've seen this a few times and I can't figure out what the
> > advantage of doing this explicitly on the LVM level instead of relying
> > on BlueStore to handle this.
> >
>
> If the WAL+DB are on a external device you want the WAL to be there as
> well. That's why I specify the WAL separate.
>
> This might be an 'old habbit' as well.

But the WAL will be placed onto the DB device if it isn't explicitly
specified, so there's no advantage to having a separate partition.


Paul

>
> Wido
>
> >
> > Paul
> >
> >>
> >>>
> >>>
> >>> The initial cluster is +1PB and we’re planning to expand it again with
> >>> 1PB in the near future to migrate our data.
> >>>
> >>> We’ll only use the system thru the RGW (No CephFS, nor block device),
> >>> and we’ll store “a lot” of small files on it… (Millions of files a day)
> >>>
> >>>
> >>>
> >>> The reason I’m asking it, is that I’ve been able to break the test
> >>> system (long story), causing OSDs to fail as they ran out of space…
> >>> Expanding the disks (the block DB device as well as the main block
> >>> device) failed with the ceph-bluestore-tool…
> >>>
> >>>
> >>>
> >>> Thanks for your answer!
> >>>
> >>>
> >>>
> >>> Kristof
> >>>
> >>>
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NVMe disk - size

2019-11-15 Thread Wido den Hollander


On 11/15/19 4:25 PM, Paul Emmerich wrote:
> On Fri, Nov 15, 2019 at 4:02 PM Wido den Hollander  wrote:
>>
>>  I normally use LVM on top
>> of each device and create 2 LVs per OSD:
>>
>> - WAL: 1GB
>> - DB: xx GB
> 
> Why? I've seen this a few times and I can't figure out what the
> advantage of doing this explicitly on the LVM level instead of relying
> on BlueStore to handle this.
> 

If the WAL+DB are on a external device you want the WAL to be there as
well. That's why I specify the WAL separate.

This might be an 'old habbit' as well.

Wido

> 
> Paul
> 
>>
>>>
>>>
>>> The initial cluster is +1PB and we’re planning to expand it again with
>>> 1PB in the near future to migrate our data.
>>>
>>> We’ll only use the system thru the RGW (No CephFS, nor block device),
>>> and we’ll store “a lot” of small files on it… (Millions of files a day)
>>>
>>>
>>>
>>> The reason I’m asking it, is that I’ve been able to break the test
>>> system (long story), causing OSDs to fail as they ran out of space…
>>> Expanding the disks (the block DB device as well as the main block
>>> device) failed with the ceph-bluestore-tool…
>>>
>>>
>>>
>>> Thanks for your answer!
>>>
>>>
>>>
>>> Kristof
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NVMe disk - size

2019-11-15 Thread Paul Emmerich
On Fri, Nov 15, 2019 at 4:02 PM Wido den Hollander  wrote:
>
>  I normally use LVM on top
> of each device and create 2 LVs per OSD:
>
> - WAL: 1GB
> - DB: xx GB

Why? I've seen this a few times and I can't figure out what the
advantage of doing this explicitly on the LVM level instead of relying
on BlueStore to handle this.


Paul

>
> >
> >
> > The initial cluster is +1PB and we’re planning to expand it again with
> > 1PB in the near future to migrate our data.
> >
> > We’ll only use the system thru the RGW (No CephFS, nor block device),
> > and we’ll store “a lot” of small files on it… (Millions of files a day)
> >
> >
> >
> > The reason I’m asking it, is that I’ve been able to break the test
> > system (long story), causing OSDs to fail as they ran out of space…
> > Expanding the disks (the block DB device as well as the main block
> > device) failed with the ceph-bluestore-tool…
> >
> >
> >
> > Thanks for your answer!
> >
> >
> >
> > Kristof
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NVMe disk - size

2019-11-15 Thread Paul Emmerich
On Fri, Nov 15, 2019 at 4:04 PM Kristof Coucke  wrote:
>
> Hi Paul,
>
> Thank you for the answer.
> I didn't thought of that approach... (Using the NVMe for the meta data pool 
> of RGW).
>
> From where do you get the limitation of 1.3TB?

13 OSDs/Server * 10 Servers * 30 GB/OSD usable DB space / 3 (Replica)


>
> I don't get that one...
>
> Br,
>
> Kristof
>
> Op vr 15 nov. 2019 om 15:26 schreef Paul Emmerich :
>>
>> On Fri, Nov 15, 2019 at 3:16 PM Kristof Coucke  
>> wrote:
>> > We’ve configured a Ceph cluster with 10 nodes, each having 13 large disks 
>> > (14TB) and 2 NVMe disks (1,6TB).
>> > The recommendations I’ve read in the online documentation, state that the 
>> > db block device should be around 4%~5% of the slow device. So, the 
>> > block.db should be somewhere between 600GB and 700GB as a best practice.
>>
>> That recommendation is unfortunately not based on any facts :(
>> How much you really need depends on your actual usage.
>>
>> > However… I was thinking to only reserve 200GB per OSD as fast device… 
>> > Which is 1/3 of the recommendation…
>>
>> For various weird internal reason it'll only use ~30 GB in the steady
>> state during operation before spilling over at the moment, 300 GB
>> would be the next magical number
>> (search mailing list for details)
>>
>>
>> > Is it recommended to still use it as a block.db
>>
>> yes
>>
>> > or is it recommended to only use it as a WAL device?
>>
>> no, there is no advantage to that if it's that large
>>
>>
>> > Should I just split the NVMe in three and only configure 3 OSDs to use the 
>> > system? (This would mean that the performace shall be degraded to the 
>> > speed of the slowest device…)
>>
>> no
>>
>> > We’ll only use the system thru the RGW (No CephFS, nor block device), and 
>> > we’ll store “a lot” of small files on it… (Millions of files a day)
>>
>> the current setup gives you around ~1.3 TB of usable metadata space
>> which may or may not be enough, really depends on how much "a lot" is
>> and how small "small" is.
>>
>> It might be better to use the NVMe disks as dedicated OSDs and map all
>> metadata pools onto them directly, that allows you to fully utilize
>> the space for RGW metadata (but not Ceph metadata in the data pools)
>> without running into weird db size restrictions.
>> There are advantages and disadvantages to both approaches
>>
>> Paul
>>
>> >
>> >
>> >
>> > The reason I’m asking it, is that I’ve been able to break the test system 
>> > (long story), causing OSDs to fail as they ran out of space… Expanding the 
>> > disks (the block DB device as well as the main block device) failed with 
>> > the ceph-bluestore-tool…
>> >
>> >
>> >
>> > Thanks for your answer!
>> >
>> >
>> >
>> > Kristof
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NVMe disk - size

2019-11-15 Thread Kristof Coucke
Hi Paul,

Thank you for the answer.
I didn't thought of that approach... (Using the NVMe for the meta data pool
of RGW).

>From where do you get the limitation of 1.3TB?

I don't get that one...

Br,

Kristof

Op vr 15 nov. 2019 om 15:26 schreef Paul Emmerich :

> On Fri, Nov 15, 2019 at 3:16 PM Kristof Coucke 
> wrote:
> > We’ve configured a Ceph cluster with 10 nodes, each having 13 large
> disks (14TB) and 2 NVMe disks (1,6TB).
> > The recommendations I’ve read in the online documentation, state that
> the db block device should be around 4%~5% of the slow device. So, the
> block.db should be somewhere between 600GB and 700GB as a best practice.
>
> That recommendation is unfortunately not based on any facts :(
> How much you really need depends on your actual usage.
>
> > However… I was thinking to only reserve 200GB per OSD as fast device…
> Which is 1/3 of the recommendation…
>
> For various weird internal reason it'll only use ~30 GB in the steady
> state during operation before spilling over at the moment, 300 GB
> would be the next magical number
> (search mailing list for details)
>
>
> > Is it recommended to still use it as a block.db
>
> yes
>
> > or is it recommended to only use it as a WAL device?
>
> no, there is no advantage to that if it's that large
>
>
> > Should I just split the NVMe in three and only configure 3 OSDs to use
> the system? (This would mean that the performace shall be degraded to the
> speed of the slowest device…)
>
> no
>
> > We’ll only use the system thru the RGW (No CephFS, nor block device),
> and we’ll store “a lot” of small files on it… (Millions of files a day)
>
> the current setup gives you around ~1.3 TB of usable metadata space
> which may or may not be enough, really depends on how much "a lot" is
> and how small "small" is.
>
> It might be better to use the NVMe disks as dedicated OSDs and map all
> metadata pools onto them directly, that allows you to fully utilize
> the space for RGW metadata (but not Ceph metadata in the data pools)
> without running into weird db size restrictions.
> There are advantages and disadvantages to both approaches
>
> Paul
>
> >
> >
> >
> > The reason I’m asking it, is that I’ve been able to break the test
> system (long story), causing OSDs to fail as they ran out of space…
> Expanding the disks (the block DB device as well as the main block device)
> failed with the ceph-bluestore-tool…
> >
> >
> >
> > Thanks for your answer!
> >
> >
> >
> > Kristof
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NVMe disk - size

2019-11-15 Thread Wido den Hollander


On 11/15/19 3:19 PM, Kristof Coucke wrote:
> Hi all,
> 
>  
> 
> We’ve configured a Ceph cluster with 10 nodes, each having 13 large
> disks (14TB) and 2 NVMe disks (1,6TB).
> 
> The idea was to use the NVMe as “fast device”…
> 
> The recommendations I’ve read in the online documentation, state that
> the db block device should be around 4%~5% of the slow device. So, the
> block.db should be somewhere between 600GB and 700GB as a best practice.
> 
> However… I was thinking to only reserve 200GB per OSD as fast device…
> Which is 1/3 of the recommendation…
> 
>  
The 4% rule is way to much. Usually 10GB per 1TB of storage is
sufficient, so 1%. You should be safe with 200GB per OSD.

> 
> I’ve tested in the labs, and it does work fine with even very small
> devices (the spillover does its job).
> 
> Though, before taking the system in to production, I would like to
> verify that no issues arise.
> 
>  
> 
>   * Is it recommended to still use it as a block.db, or is it
> recommended to only use it as a WAL device?

Use it as DB.

>   * Should I just split the NVMe in three and only configure 3 OSDs to
> use the system? (This would mean that the performace shall be
> degraded to the speed of the slowest device…)
> 

I would split them. So 6 and 7 OSDs per NVMe. I normally use LVM on top
of each device and create 2 LVs per OSD:

- WAL: 1GB
- DB: xx GB

>  
> 
> The initial cluster is +1PB and we’re planning to expand it again with
> 1PB in the near future to migrate our data.
> 
> We’ll only use the system thru the RGW (No CephFS, nor block device),
> and we’ll store “a lot” of small files on it… (Millions of files a day)
> 
>  
> 
> The reason I’m asking it, is that I’ve been able to break the test
> system (long story), causing OSDs to fail as they ran out of space…
> Expanding the disks (the block DB device as well as the main block
> device) failed with the ceph-bluestore-tool…
> 
>  
> 
> Thanks for your answer!
> 
>  
> 
> Kristof
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NVMe disk - size

2019-11-15 Thread Paul Emmerich
On Fri, Nov 15, 2019 at 3:16 PM Kristof Coucke  wrote:
> We’ve configured a Ceph cluster with 10 nodes, each having 13 large disks 
> (14TB) and 2 NVMe disks (1,6TB).
> The recommendations I’ve read in the online documentation, state that the db 
> block device should be around 4%~5% of the slow device. So, the block.db 
> should be somewhere between 600GB and 700GB as a best practice.

That recommendation is unfortunately not based on any facts :(
How much you really need depends on your actual usage.

> However… I was thinking to only reserve 200GB per OSD as fast device… Which 
> is 1/3 of the recommendation…

For various weird internal reason it'll only use ~30 GB in the steady
state during operation before spilling over at the moment, 300 GB
would be the next magical number
(search mailing list for details)


> Is it recommended to still use it as a block.db

yes

> or is it recommended to only use it as a WAL device?

no, there is no advantage to that if it's that large


> Should I just split the NVMe in three and only configure 3 OSDs to use the 
> system? (This would mean that the performace shall be degraded to the speed 
> of the slowest device…)

no

> We’ll only use the system thru the RGW (No CephFS, nor block device), and 
> we’ll store “a lot” of small files on it… (Millions of files a day)

the current setup gives you around ~1.3 TB of usable metadata space
which may or may not be enough, really depends on how much "a lot" is
and how small "small" is.

It might be better to use the NVMe disks as dedicated OSDs and map all
metadata pools onto them directly, that allows you to fully utilize
the space for RGW metadata (but not Ceph metadata in the data pools)
without running into weird db size restrictions.
There are advantages and disadvantages to both approaches

Paul

>
>
>
> The reason I’m asking it, is that I’ve been able to break the test system 
> (long story), causing OSDs to fail as they ran out of space… Expanding the 
> disks (the block DB device as well as the main block device) failed with the 
> ceph-bluestore-tool…
>
>
>
> Thanks for your answer!
>
>
>
> Kristof
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] NVMe disk - size

2019-11-15 Thread Kristof Coucke
Hi all,



We’ve configured a Ceph cluster with 10 nodes, each having 13 large disks
(14TB) and 2 NVMe disks (1,6TB).

The idea was to use the NVMe as “fast device”…

The recommendations I’ve read in the online documentation, state that the
db block device should be around 4%~5% of the slow device. So, the block.db
should be somewhere between 600GB and 700GB as a best practice.

However… I was thinking to only reserve 200GB per OSD as fast device… Which
is 1/3 of the recommendation…



I’ve tested in the labs, and it does work fine with even very small devices
(the spillover does its job).

Though, before taking the system in to production, I would like to verify
that no issues arise.



   - Is it recommended to still use it as a block.db, or is it recommended
   to only use it as a WAL device?
   - Should I just split the NVMe in three and only configure 3 OSDs to use
   the system? (This would mean that the performace shall be degraded to the
   speed of the slowest device…)



The initial cluster is +1PB and we’re planning to expand it again with 1PB
in the near future to migrate our data.

We’ll only use the system thru the RGW (No CephFS, nor block device), and
we’ll store “a lot” of small files on it… (Millions of files a day)



The reason I’m asking it, is that I’ve been able to break the test system
(long story), causing OSDs to fail as they ran out of space… Expanding the
disks (the block DB device as well as the main block device) failed with
the ceph-bluestore-tool…



Thanks for your answer!



Kristof
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] NVMe disk - size

2019-11-15 Thread Kristof Coucke
Hi all, We’ve configured a Ceph cluster with 10 nodes, each having 13 large disks (14TB) and 2 NVMe disks (1,6TB).The idea was to use the NVMe as “fast device”…The recommendations I’ve read in the online documentation, state that the db block device should be around 4%~5% of the slow device. So, the block.db should be somewhere between 600GB and 700GB as a best practice.However… I was thinking to only reserve 200GB per OSD as fast device… Which is 1/3 of the recommendation… I’ve tested in the labs, and it does work fine with even very small devices (the spillover does its job).Though, before taking the system in to production, I would like to verify that no issues arise. Is it recommended to still use it as a block.db, or is it recommended to only use it as a WAL device?Should I just split the NVMe in three and only configure 3 OSDs to use the system? (This would mean that the performace shall be degraded to the speed of the slowest device…) The initial cluster is +1PB and we’re planning to expand it again with 1PB in the near future to migrate our data.We’ll only use the system thru the RGW (No CephFS, nor block device), and we’ll store “a lot” of small files on it… (Millions of files a day) The reason I’m asking it, is that I’ve been able to break the test system (long story), causing OSDs to fail as they ran out of space… Expanding the disks (the block DB device as well as the main block device) failed with the ceph-bluestore-tool… Thanks for your answer! Kristof
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com