Re: [ceph-users] WAL/DB size

2019-08-16 Thread Anthony D'Atri
Thanks — interesting reading.

Distilling the discussion there, below are my takeaways.  Am I interpreting 
correctly?

1) The spillover phenomenon and thus the small number of discrete sizes that 
are effective without being wasteful — are recognized

2) "I don't think we should plan teh block.db size based on the rocksdb 
stairstep pattern. A better solution would be to tweak the rocksdb level sizes 
at mkfs time based on the block.db size!”

3) Neither 1) nor 2) was actually acted upon, so we got arbitrary guidance 
based on a calculation of the number of metadata objects, with no input from or 
action upon how the DB actually behaves?


Am I interpreting correctly?


> Btw, the original discussion leading to the 4% recommendation is here:
> https://github.com/ceph/ceph/pull/23210
> 
> 
> -- 
> Paul Emmerich
> 
> 
>> 30gb already includes WAL, see 
>> http://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing
>> 
>> 15 августа 2019 г. 1:15:58 GMT+03:00, Anthony D'Atri  
>> пишет:
>>> 
>>> Good points in both posts, but I think there’s still some unclarity.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-16 Thread Paul Emmerich
Btw, the original discussion leading to the 4% recommendation is here:
https://github.com/ceph/ceph/pull/23210


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Thu, Aug 15, 2019 at 11:23 AM Виталий Филиппов  wrote:
>
> 30gb already includes WAL, see 
> http://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing
>
> 15 августа 2019 г. 1:15:58 GMT+03:00, Anthony D'Atri  
> пишет:
>>
>> Good points in both posts, but I think there’s still some unclarity.
>>
>> Absolutely let’s talk about DB and WAL together.  By “bluestore goes on 
>> flash” I assume you mean WAL+DB?
>>
>> “Simply allocate DB and WAL will appear there automatically”
>>
>> Forgive me please if this is obvious, but I’d like to see a holistic 
>> explanation of WAL and DB sizing *together*, which I think would help folks 
>> put these concepts together and plan deployments with some sense of 
>> confidence.
>>
>> We’ve seen good explanations on the list of why only specific DB sizes, say 
>> 30GB, are actually used _for the DB_.
>> If the WAL goes along with the DB, shouldn’t we also explicitly determine an 
>> appropriate size N for the WAL, and make the partition (30+N) GB?
>> If so, how do we derive N?  Or is it a constant?
>>
>> Filestore was so much simpler, 10GB set+forget for the journal.  Not that I 
>> miss XFS, mind you.
>>
>>
 Actually standalone WAL is required when you have either very small fast
 device (and don't want db to use it) or three devices (different in
 performance) behind OSD (e.g. hdd, ssd, nvme). So WAL is to be located
 at the fastest one.

 For the given use case you just have HDD and NVMe and DB and WAL can
 safely collocate. Which means you don't need to allocate specific volume
 for WAL. Hence no need to answer the question how many space is needed
 for WAL. Simply allocate DB and WAL will appear there automatically.


>>> Yes, i'm surprised how often people talk about the DB and WAL separately
>>> for no good reason.  In common setups bluestore goes on flash and the
>>> storage goes on the HDDs, simple.
>>>
>>> In the event flash is 100s of GB and would be wasted, is there anything
>>> that needs to be done to set rocksdb to use the highest level?  600 I
>>> believe
>>
>> 
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> With best regards,
> Vitaliy Filippov
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-15 Thread Виталий Филиппов
30gb already includes WAL, see 
http://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing

15 августа 2019 г. 1:15:58 GMT+03:00, Anthony D'Atri  
пишет:
>Good points in both posts, but I think there’s still some unclarity.
>
>Absolutely let’s talk about DB and WAL together.  By “bluestore goes on
>flash” I assume you mean WAL+DB?
>
>“Simply allocate DB and WAL will appear there automatically”
>
>Forgive me please if this is obvious, but I’d like to see a holistic
>explanation of WAL and DB sizing *together*, which I think would help
>folks put these concepts together and plan deployments with some sense
>of confidence.
>
>We’ve seen good explanations on the list of why only specific DB sizes,
>say 30GB, are actually used _for the DB_.
>If the WAL goes along with the DB, shouldn’t we also explicitly
>determine an appropriate size N for the WAL, and make the partition
>(30+N) GB?
>If so, how do we derive N?  Or is it a constant?
>
>Filestore was so much simpler, 10GB set+forget for the journal.  Not
>that I miss XFS, mind you.
>
>
>>> Actually standalone WAL is required when you have either very small
>fast
>>> device (and don't want db to use it) or three devices (different in
>>> performance) behind OSD (e.g. hdd, ssd, nvme). So WAL is to be
>located
>>> at the fastest one.
>>> 
>>> For the given use case you just have HDD and NVMe and DB and WAL can
>>> safely collocate. Which means you don't need to allocate specific
>volume
>>> for WAL. Hence no need to answer the question how many space is
>needed
>>> for WAL. Simply allocate DB and WAL will appear there automatically.
>>> 
>>> 
>> Yes, i'm surprised how often people talk about the DB and WAL
>separately
>> for no good reason.  In common setups bluestore goes on flash and the
>> storage goes on the HDDs, simple.
>> 
>> In the event flash is 100s of GB and would be wasted, is there
>anything
>> that needs to be done to set rocksdb to use the highest level?  600 I
>> believe
>
>
>
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
With best regards,
  Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-15 Thread Mark Nelson

Hi Folks,


The basic idea behind the WAL is that for every DB write transaction you 
first write it into an in-memory buffer and to a region on disk.  
RocksDB typically is setup to have multiple WAL buffers, and when one or 
more fills up, it will start flushing the data to L0 while new writes 
are written to the next buffer.  If rocksdb can't flush data fast 
enough, it will throttle write throughput down so that hopefully you 
don't fill all of of the buffers up and stall before a flush completes.  
The combined total size/number of buffers governs both how much disk 
space you need for the WAL and how much RAM is needed to store incoming 
IO that hasn't finished flushing into the DB.  There are various 
tradeoffs when adjust the size, number, and behavior of the WAL.  On one 
hand there's an advantage to having small buffers to favor frequent 
swift flush events and hopefully keep overall memory usage low and CPU 
overhead of key comparisons low.  On the other hand, having large WAL 
buffers means you have more runway both in terms of being able to absorb 
longer L0 compaction events but also potentially in terms of being able 
to avoid writing pglog entries to L0 entirely if a tombstone lands in 
the same WAL buffer as the initial write.  We've seen evidence that 
write amplification is (sometimes much) lower with bigger WAL buffers 
and we think this is a big part of the reason why.



Right now our default WAL settings for rocksdb is:


max_write_buffer_number=4

min_write_buffer_number_to_merge=1

write_buffer_size=268435456


which means we will store up to 4 256MB buffers and start flushing as 
soon as 1 fills up.  Alternate strategies could be to something like 16 
64MB buffers, and set min_write_buffer_number_to_merge to something like 
4.  Potentially that might provide slightly more fine grained control 
and also may be advantageous with a larger number of column families, 
but we haven't seen evidence yet that splitting the buffers into more 
smaller segments definitely improves things.  Probably the bigger 
take-away is that you can't simply make the WAL huge to give yourself 
extra runway for writes unless you are also willing to eat the RAM cost 
of storing all of that data in-memory as well. That's one of the reasons 
why we tell people regularly that 1-2GB is enough for the WAL.  With a 
target OSD memory of 4GB, (up to) 1GB for the WAL is already pushing 
it.  Luckily in most cases it doesn't actually use the full 1GB though.  
RocksDB will throttle before you get to that point so in reality it's 
more likely the WAL is probably using more like 0-512MB of Disk/RAM with 
2-3 extra buffers of capacity in case things get hairy.



Mark


On 8/15/19 1:59 AM, Janne Johansson wrote:
Den tors 15 aug. 2019 kl 00:16 skrev Anthony D'Atri 
mailto:a...@dreamsnake.net>>:


Good points in both posts, but I think there’s still some unclarity.


...

We’ve seen good explanations on the list of why only specific DB
sizes, say 30GB, are actually used _for the DB_.
If the WAL goes along with the DB, shouldn’t we also explicitly
determine an appropriate size N for the WAL, and make the
partition (30+N) GB?
If so, how do we derive N?  Or is it a constant?

Filestore was so much simpler, 10GB set+forget for the journal. 
Not that I miss XFS, mind you.


But we got a simple handwaving-best-effort-guesstimate that went "WAL 
1GB is fine, yes." so there you have an N you can use for the

30+N or 60+N sizings.
Can't see how that N needs more science than the filestore N=10G you 
showed. Not that I think journal=10G was wrong or anything.


--
May the most significant bit of your life be positive.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-15 Thread Janne Johansson
Den tors 15 aug. 2019 kl 00:16 skrev Anthony D'Atri :

> Good points in both posts, but I think there’s still some unclarity.
>

...


> We’ve seen good explanations on the list of why only specific DB sizes,
> say 30GB, are actually used _for the DB_.
> If the WAL goes along with the DB, shouldn’t we also explicitly determine
> an appropriate size N for the WAL, and make the partition (30+N) GB?
> If so, how do we derive N?  Or is it a constant?
>
> Filestore was so much simpler, 10GB set+forget for the journal.  Not that
> I miss XFS, mind you.
>

But we got a simple handwaving-best-effort-guesstimate that went "WAL 1GB
is fine, yes." so there you have an N you can use for the
30+N or 60+N sizings.
Can't see how that N needs more science than the filestore N=10G you
showed. Not that I think journal=10G was wrong or anything.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-14 Thread Anthony D'Atri
Good points in both posts, but I think there’s still some unclarity.

Absolutely let’s talk about DB and WAL together.  By “bluestore goes on flash” 
I assume you mean WAL+DB?

“Simply allocate DB and WAL will appear there automatically”

Forgive me please if this is obvious, but I’d like to see a holistic 
explanation of WAL and DB sizing *together*, which I think would help folks put 
these concepts together and plan deployments with some sense of confidence.

We’ve seen good explanations on the list of why only specific DB sizes, say 
30GB, are actually used _for the DB_.
If the WAL goes along with the DB, shouldn’t we also explicitly determine an 
appropriate size N for the WAL, and make the partition (30+N) GB?
If so, how do we derive N?  Or is it a constant?

Filestore was so much simpler, 10GB set+forget for the journal.  Not that I 
miss XFS, mind you.


>> Actually standalone WAL is required when you have either very small fast
>> device (and don't want db to use it) or three devices (different in
>> performance) behind OSD (e.g. hdd, ssd, nvme). So WAL is to be located
>> at the fastest one.
>> 
>> For the given use case you just have HDD and NVMe and DB and WAL can
>> safely collocate. Which means you don't need to allocate specific volume
>> for WAL. Hence no need to answer the question how many space is needed
>> for WAL. Simply allocate DB and WAL will appear there automatically.
>> 
>> 
> Yes, i'm surprised how often people talk about the DB and WAL separately
> for no good reason.  In common setups bluestore goes on flash and the
> storage goes on the HDDs, simple.
> 
> In the event flash is 100s of GB and would be wasted, is there anything
> that needs to be done to set rocksdb to use the highest level?  600 I
> believe



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-14 Thread Mark Nelson


On 8/14/19 1:06 PM, solarflow99 wrote:


Actually standalone WAL is required when you have either very
small fast
device (and don't want db to use it) or three devices (different in
performance) behind OSD (e.g. hdd, ssd, nvme). So WAL is to be
located
at the fastest one.

For the given use case you just have HDD and NVMe and DB and WAL can
safely collocate. Which means you don't need to allocate specific
volume
for WAL. Hence no need to answer the question how many space is
needed
for WAL. Simply allocate DB and WAL will appear there automatically.


Yes, i'm surprised how often people talk about the DB and WAL 
separately for no good reason.  In common setups bluestore goes on 
flash and the storage goes on the HDDs, simple.


In the event flash is 100s of GB and would be wasted, is there 
anything that needs to be done to set rocksdb to use the highest 
level?  600 I believe






When you first setup the OSD you could manually tweak the level 
sizes/multipliers so that one of the level boundaries + WAL falls 
somewhat under the total allocated size of the DB device.  Keep in mind 
that there can be temporary space usage increases due to compaction.  
Ultimately though I think this is a bad approach. The better bet is the 
work that Igor and Adam are doing:



https://github.com/ceph/ceph/pull/28960

https://github.com/ceph/ceph/pull/29047


Mark

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-14 Thread solarflow99
> Actually standalone WAL is required when you have either very small fast
> device (and don't want db to use it) or three devices (different in
> performance) behind OSD (e.g. hdd, ssd, nvme). So WAL is to be located
> at the fastest one.
>
> For the given use case you just have HDD and NVMe and DB and WAL can
> safely collocate. Which means you don't need to allocate specific volume
> for WAL. Hence no need to answer the question how many space is needed
> for WAL. Simply allocate DB and WAL will appear there automatically.
>
>
Yes, i'm surprised how often people talk about the DB and WAL separately
for no good reason.  In common setups bluestore goes on flash and the
storage goes on the HDDs, simple.

In the event flash is 100s of GB and would be wasted, is there anything
that needs to be done to set rocksdb to use the highest level?  600 I
believe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-14 Thread Igor Fedotov

Hi Wido & Hermant.

On 8/14/2019 11:36 AM, Wido den Hollander wrote:


On 8/14/19 9:33 AM, Hemant Sonawane wrote:

Hello guys,

Thank you so much for your responses really appreciate it. But I would
like to mention one more thing which I forgot in my last email is that I
am going to use this storage for openstack VM's. So still the answer
will be the same that I should use 1GB for wal?


WAL 1GB is fine, yes.


I'd like to argue against this for a bit.

Actually standalone WAL is required when you have either very small fast 
device (and don't want db to use it) or three devices (different in 
performance) behind OSD (e.g. hdd, ssd, nvme). So WAL is to be located  
at the fastest one.


For the given use case you just have HDD and NVMe and DB and WAL can 
safely collocate. Which means you don't need to allocate specific volume 
for WAL. Hence no need to answer the question how many space is needed 
for WAL. Simply allocate DB and WAL will appear there automatically.




As this is an OpenStack/RBD only use-case I would say that 10GB of DB
per 1TB of disk storage is sufficient.


Given RocksDB granularity already mentioned in this thread we tend to 
prefer some fixed allocation sizes with 30-60Gb being close to the optimal.


Anyway suggest to use LVM for DB/WAL volume and may be start with 
smaller size (e.g. 32GB per OSD) which leaves some extra spare space on 
your NVMes and allows to add more space if needed. (Just to note - 
removing some already allocated but still unused space from existing OSD 
and gift it to another/new OSD is a more troublesome task than adding 
some space from the spare volume).



On Wed, 14 Aug 2019 at 05:54, Mark Nelson mailto:mnel...@redhat.com>> wrote:

 On 8/13/19 3:51 PM, Paul Emmerich wrote:

 > On Tue, Aug 13, 2019 at 10:04 PM Wido den Hollander mailto:w...@42on.com>> wrote:
 >> I just checked an RGW-only setup. 6TB drive, 58% full, 11.2GB of
 DB in
 >> use. No slow db in use.
 > random rgw-only setup here: 12TB drive, 77% full, 48GB metadata and
 > 10GB omap for index and whatever.
 >
 > That's 0.5% + 0.1%. And that's a setup that's using mostly erasure
 > coding and small-ish objects.
 >
 >
 >> I've talked with many people from the community and I don't see an
 >> agreement for the 4% rule.
 > agreed, 4% isn't a reasonable default.
 > I've seen setups with even 10% metadata usage, but these are weird
 > edge cases with very small objects on NVMe-only setups (obviously
 > without a separate DB device).
 >
 > Paul


 I agree, and I did quite a bit of the early space usage analysis.  I
 have a feeling that someone was trying to be well-meaning and make a
 simple ratio for users to target that was big enough to handle the
 majority of use cases.  The problem is that reality isn't that simple
 and one-size-fits all doesn't really work here.


 For RBD you can usually get away with far less than 4%.  A small
 fraction of that is often sufficient.  For tiny (say 4K) RGW objects
 (especially objects with very long names!) you potentially can end up
 using significantly more than 4%. Unfortunately there's no really good
 way for us to normalize this so long as RGW is using OMAP to store
 bucket indexes.  I think the best we can do long run is make it much
 clearer how space is being used on the block/db/wal devices and easier
 for users to shrink/grow the amount of "fast" disk they have on an OSD.
 Alternately we could put bucket indexes into rados objects instead of
 OMAP, but that would be a pretty big project (with it's own challenges
 but potentially also with rewards).


 Mark

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Thanks and Regards,

Hemant Sonawane


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-14 Thread Burkhard Linke

Hi,


please keep in mind that due to the rocksdb level concept, only certain 
db partition sizes are useful. Larger partitions are a waste of 
capacity, since rockdb will only use whole level sizes.



There has been a lot of discussion about this on the mailing list in the 
last months. A plain XY% of OSD size is just wrong and misleading.



Regards,

Burkhard


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-14 Thread Wido den Hollander


On 8/14/19 9:33 AM, Hemant Sonawane wrote:
> Hello guys,
> 
> Thank you so much for your responses really appreciate it. But I would
> like to mention one more thing which I forgot in my last email is that I
> am going to use this storage for openstack VM's. So still the answer
> will be the same that I should use 1GB for wal?
> 

WAL 1GB is fine, yes.

As this is an OpenStack/RBD only use-case I would say that 10GB of DB
per 1TB of disk storage is sufficient.

> 
> On Wed, 14 Aug 2019 at 05:54, Mark Nelson  > wrote:
> 
> On 8/13/19 3:51 PM, Paul Emmerich wrote:
> 
> > On Tue, Aug 13, 2019 at 10:04 PM Wido den Hollander  > wrote:
> >> I just checked an RGW-only setup. 6TB drive, 58% full, 11.2GB of
> DB in
> >> use. No slow db in use.
> > random rgw-only setup here: 12TB drive, 77% full, 48GB metadata and
> > 10GB omap for index and whatever.
> >
> > That's 0.5% + 0.1%. And that's a setup that's using mostly erasure
> > coding and small-ish objects.
> >
> >
> >> I've talked with many people from the community and I don't see an
> >> agreement for the 4% rule.
> > agreed, 4% isn't a reasonable default.
> > I've seen setups with even 10% metadata usage, but these are weird
> > edge cases with very small objects on NVMe-only setups (obviously
> > without a separate DB device).
> >
> > Paul
> 
> 
> I agree, and I did quite a bit of the early space usage analysis.  I
> have a feeling that someone was trying to be well-meaning and make a
> simple ratio for users to target that was big enough to handle the
> majority of use cases.  The problem is that reality isn't that simple
> and one-size-fits all doesn't really work here.
> 
> 
> For RBD you can usually get away with far less than 4%.  A small
> fraction of that is often sufficient.  For tiny (say 4K) RGW objects 
> (especially objects with very long names!) you potentially can end up
> using significantly more than 4%. Unfortunately there's no really good
> way for us to normalize this so long as RGW is using OMAP to store
> bucket indexes.  I think the best we can do long run is make it much
> clearer how space is being used on the block/db/wal devices and easier
> for users to shrink/grow the amount of "fast" disk they have on an OSD.
> Alternately we could put bucket indexes into rados objects instead of
> OMAP, but that would be a pretty big project (with it's own challenges
> but potentially also with rewards).
> 
> 
> Mark
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> -- 
> Thanks and Regards,
> 
> Hemant Sonawane
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-14 Thread Hemant Sonawane
Hello guys,

Thank you so much for your responses really appreciate it. But I would like
to mention one more thing which I forgot in my last email is that I am
going to use this storage for openstack VM's. So still the answer will be
the same that I should use 1GB for wal?


On Wed, 14 Aug 2019 at 05:54, Mark Nelson  wrote:

> On 8/13/19 3:51 PM, Paul Emmerich wrote:
>
> > On Tue, Aug 13, 2019 at 10:04 PM Wido den Hollander 
> wrote:
> >> I just checked an RGW-only setup. 6TB drive, 58% full, 11.2GB of DB in
> >> use. No slow db in use.
> > random rgw-only setup here: 12TB drive, 77% full, 48GB metadata and
> > 10GB omap for index and whatever.
> >
> > That's 0.5% + 0.1%. And that's a setup that's using mostly erasure
> > coding and small-ish objects.
> >
> >
> >> I've talked with many people from the community and I don't see an
> >> agreement for the 4% rule.
> > agreed, 4% isn't a reasonable default.
> > I've seen setups with even 10% metadata usage, but these are weird
> > edge cases with very small objects on NVMe-only setups (obviously
> > without a separate DB device).
> >
> > Paul
>
>
> I agree, and I did quite a bit of the early space usage analysis.  I
> have a feeling that someone was trying to be well-meaning and make a
> simple ratio for users to target that was big enough to handle the
> majority of use cases.  The problem is that reality isn't that simple
> and one-size-fits all doesn't really work here.
>
>
> For RBD you can usually get away with far less than 4%.  A small
> fraction of that is often sufficient.  For tiny (say 4K) RGW objects
> (especially objects with very long names!) you potentially can end up
> using significantly more than 4%. Unfortunately there's no really good
> way for us to normalize this so long as RGW is using OMAP to store
> bucket indexes.  I think the best we can do long run is make it much
> clearer how space is being used on the block/db/wal devices and easier
> for users to shrink/grow the amount of "fast" disk they have on an OSD.
> Alternately we could put bucket indexes into rados objects instead of
> OMAP, but that would be a pretty big project (with it's own challenges
> but potentially also with rewards).
>
>
> Mark
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Thanks and Regards,

Hemant Sonawane
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-13 Thread Mark Nelson

On 8/13/19 3:51 PM, Paul Emmerich wrote:


On Tue, Aug 13, 2019 at 10:04 PM Wido den Hollander  wrote:

I just checked an RGW-only setup. 6TB drive, 58% full, 11.2GB of DB in
use. No slow db in use.

random rgw-only setup here: 12TB drive, 77% full, 48GB metadata and
10GB omap for index and whatever.

That's 0.5% + 0.1%. And that's a setup that's using mostly erasure
coding and small-ish objects.



I've talked with many people from the community and I don't see an
agreement for the 4% rule.

agreed, 4% isn't a reasonable default.
I've seen setups with even 10% metadata usage, but these are weird
edge cases with very small objects on NVMe-only setups (obviously
without a separate DB device).

Paul



I agree, and I did quite a bit of the early space usage analysis.  I 
have a feeling that someone was trying to be well-meaning and make a 
simple ratio for users to target that was big enough to handle the 
majority of use cases.  The problem is that reality isn't that simple 
and one-size-fits all doesn't really work here.



For RBD you can usually get away with far less than 4%.  A small 
fraction of that is often sufficient.  For tiny (say 4K) RGW objects  
(especially objects with very long names!) you potentially can end up 
using significantly more than 4%. Unfortunately there's no really good 
way for us to normalize this so long as RGW is using OMAP to store 
bucket indexes.  I think the best we can do long run is make it much 
clearer how space is being used on the block/db/wal devices and easier 
for users to shrink/grow the amount of "fast" disk they have on an OSD. 
Alternately we could put bucket indexes into rados objects instead of 
OMAP, but that would be a pretty big project (with it's own challenges 
but potentially also with rewards).



Mark

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-13 Thread Paul Emmerich
On Tue, Aug 13, 2019 at 10:04 PM Wido den Hollander  wrote:
> I just checked an RGW-only setup. 6TB drive, 58% full, 11.2GB of DB in
> use. No slow db in use.

random rgw-only setup here: 12TB drive, 77% full, 48GB metadata and
10GB omap for index and whatever.

That's 0.5% + 0.1%. And that's a setup that's using mostly erasure
coding and small-ish objects.


> I've talked with many people from the community and I don't see an
> agreement for the 4% rule.

agreed, 4% isn't a reasonable default.
I've seen setups with even 10% metadata usage, but these are weird
edge cases with very small objects on NVMe-only setups (obviously
without a separate DB device).

Paul

>
> Wido
>
> >
> > Thank you,
> >
> > Dominic L. Hilsbos, MBA
> > Director – Information Technology
> > Perform Air International Inc.
> > dhils...@performair.com
> > www.PerformAir.com
> >
> >
> >
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> > Wido den Hollander
> > Sent: Tuesday, August 13, 2019 12:51 PM
> > To: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] WAL/DB size
> >
> >
> >
> > On 8/13/19 5:54 PM, Hemant Sonawane wrote:
> >> Hi All,
> >> I have 4 6TB of HDD and 2 450GB SSD and I am going to partition each
> >> disk to 220GB for rock.db. So my question is does it make sense to use
> >> wal for my configuration? if yes then what could be the size of it? help
> >> will be really appreciated.
> >
> > Yes, the WAL needs to be about 1GB in size. That should work in allmost
> > all configurations.
> >
> > 220GB is more then you need for the DB as well. It's doesn't hurt, but
> > it's not needed. For each 6TB drive you need about ~60GB of space for
> > the DB.
> >
> > Wido
> >
> >> --
> >> Thanks and Regards,
> >>
> >> Hemant Sonawane
> >>
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-13 Thread Wido den Hollander


On 8/13/19 10:00 PM, dhils...@performair.com wrote:
> Wildo / Hemant;
> 
> Current recommendations (since at least luminous) say that a block.db device 
> should be at least 4% of the block device.  For a 6 TB drive, this would be 
> 240 GB, not 60 GB.

I know and I don't agree with that. I'm not sure where that number came
from either.

There could be various configurations, but none of the configs I have
seen require that amount of DB space.

I just checked an RGW-only setup. 6TB drive, 58% full, 11.2GB of DB in
use. No slow db in use.

I've talked with many people from the community and I don't see an
agreement for the 4% rule.

Wido

> 
> Thank you,
> 
> Dominic L. Hilsbos, MBA 
> Director – Information Technology 
> Perform Air International Inc.
> dhils...@performair.com 
> www.PerformAir.com
> 
> 
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Wido 
> den Hollander
> Sent: Tuesday, August 13, 2019 12:51 PM
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] WAL/DB size
> 
> 
> 
> On 8/13/19 5:54 PM, Hemant Sonawane wrote:
>> Hi All,
>> I have 4 6TB of HDD and 2 450GB SSD and I am going to partition each
>> disk to 220GB for rock.db. So my question is does it make sense to use
>> wal for my configuration? if yes then what could be the size of it? help
>> will be really appreciated.
> 
> Yes, the WAL needs to be about 1GB in size. That should work in allmost
> all configurations.
> 
> 220GB is more then you need for the DB as well. It's doesn't hurt, but
> it's not needed. For each 6TB drive you need about ~60GB of space for
> the DB.
> 
> Wido
> 
>> -- 
>> Thanks and Regards,
>>
>> Hemant Sonawane
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-13 Thread DHilsbos
Wildo / Hemant;

Current recommendations (since at least luminous) say that a block.db device 
should be at least 4% of the block device.  For a 6 TB drive, this would be 240 
GB, not 60 GB.

Thank you,

Dominic L. Hilsbos, MBA 
Director – Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com



-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Wido 
den Hollander
Sent: Tuesday, August 13, 2019 12:51 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] WAL/DB size



On 8/13/19 5:54 PM, Hemant Sonawane wrote:
> Hi All,
> I have 4 6TB of HDD and 2 450GB SSD and I am going to partition each
> disk to 220GB for rock.db. So my question is does it make sense to use
> wal for my configuration? if yes then what could be the size of it? help
> will be really appreciated.

Yes, the WAL needs to be about 1GB in size. That should work in allmost
all configurations.

220GB is more then you need for the DB as well. It's doesn't hurt, but
it's not needed. For each 6TB drive you need about ~60GB of space for
the DB.

Wido

> -- 
> Thanks and Regards,
> 
> Hemant Sonawane
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2019-08-13 Thread Wido den Hollander


On 8/13/19 5:54 PM, Hemant Sonawane wrote:
> Hi All,
> I have 4 6TB of HDD and 2 450GB SSD and I am going to partition each
> disk to 220GB for rock.db. So my question is does it make sense to use
> wal for my configuration? if yes then what could be the size of it? help
> will be really appreciated.

Yes, the WAL needs to be about 1GB in size. That should work in allmost
all configurations.

220GB is more then you need for the DB as well. It's doesn't hurt, but
it's not needed. For each 6TB drive you need about ~60GB of space for
the DB.

Wido

> -- 
> Thanks and Regards,
> 
> Hemant Sonawane
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2018-09-07 Thread Alfredo Deza
On Fri, Sep 7, 2018 at 3:31 PM, Maged Mokhtar  wrote:
> On 2018-09-07 14:36, Alfredo Deza wrote:
>
> On Fri, Sep 7, 2018 at 8:27 AM, Muhammad Junaid 
> wrote:
>
> Hi there
>
> Asking the questions as a newbie. May be asked a number of times before by
> many but sorry, it is not clear yet to me.
>
> 1. The WAL device is just like journaling device used before bluestore. And
> CEPH confirms Write to client after writing to it (Before actual write to
> primary device)?
>
> 2. If we have lets say 5 OSD's (4 TB SAS) and 1 200GB SSD. Should we
> partition SSD in 10 partitions? Shoud/Can we set WAL Partition Size against
> each OSD as 10GB? Or what min/max we should set for WAL Partition? And can
> we set remaining 150GB as (30GB * 5) for 5 db partitions for all OSD's?
>
>
> A WAL partition would only help if you have a device faster than the
> SSD where the block.db would go.
>
> We recently updated our sizing recommendations for block.db at least
> 4% of the size of block (also referenced as the data device):
>
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing
>
> In your case, what you want is to create 5 logical volumes from your
> 200GB at 40GB each, without a need for a WAL device.
>
>
>
> Thanks in advance. Regards.
>
> Muhammad Junaid
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> should not the db size depend on the number of objects stored rather than
> their storage size ? or is the new recommendation assuming some average
> object size ?

The latter. You are correct that it should depend on the number of
objects, but the objects vary in size depending on the type of
workload. RGW objects are different than RBD. So we are
taking a baseline/average for object sizes and recommending based on
that, which is roughly 4% the size of the data device
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2018-09-07 Thread Maged Mokhtar
On 2018-09-07 14:36, Alfredo Deza wrote:

> On Fri, Sep 7, 2018 at 8:27 AM, Muhammad Junaid  
> wrote: 
> 
>> Hi there
>> 
>> Asking the questions as a newbie. May be asked a number of times before by
>> many but sorry, it is not clear yet to me.
>> 
>> 1. The WAL device is just like journaling device used before bluestore. And
>> CEPH confirms Write to client after writing to it (Before actual write to
>> primary device)?
>> 
>> 2. If we have lets say 5 OSD's (4 TB SAS) and 1 200GB SSD. Should we
>> partition SSD in 10 partitions? Shoud/Can we set WAL Partition Size against
>> each OSD as 10GB? Or what min/max we should set for WAL Partition? And can
>> we set remaining 150GB as (30GB * 5) for 5 db partitions for all OSD's?
> 
> A WAL partition would only help if you have a device faster than the
> SSD where the block.db would go.
> 
> We recently updated our sizing recommendations for block.db at least
> 4% of the size of block (also referenced as the data device):
> 
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing
> 
> In your case, what you want is to create 5 logical volumes from your
> 200GB at 40GB each, without a need for a WAL device.
> 
>> Thanks in advance. Regards.
>> 
>> Muhammad Junaid
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

should not the db size depend on the number of objects stored rather
than their storage size ? or is the new recommendation assuming some
average object size ?___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2018-09-07 Thread Brett Chancellor
I saw above the recommended size for the db partition was 5% of data, but
yet the recommendation is 40GB partitions for 4TB drives. Isn't that closer
to 1%?

On Fri, Sep 7, 2018 at 10:06 AM, Muhammad Junaid 
wrote:

> Thanks very much. It is clear very much now. Because we are just in
> planning stage right now, would you tell me if we use 7200rpm SAS 3-4TB for
> OSD's, write speed will be fine with this new scenario? Because it will
> apparently writing to slower disks before actual confirmation. (I
> understand there must be advantages of bluestore using direct partitions).
> Regards.
>
> Muhammad Junaid
>
> On Fri, Sep 7, 2018 at 6:39 PM Richard Hesketh <
> richard.hesk...@rd.bbc.co.uk> wrote:
>
>> It can get confusing.
>>
>> There will always be a WAL, and there will always be a metadata DB, for
>> a bluestore OSD. However, if a separate device is not specified for the
>> WAL, it is kept in the same device/partition as the DB; in the same way,
>> if a separate device is not specified for the DB, it is kept on the same
>> device as the actual data (an "all-in-one" OSD). Unless you have a
>> separate, even faster device for the WAL to go on, you shouldn't specify
>> it separately from the DB; just make one partition on your SSD per OSD,
>> and make them as large as will fit together on the SSD.
>>
>> Also, just to be clear, the WAL is not exactly a journal in the same way
>> that Filestore required a journal. Because Bluestore can provide write
>> atomicity without requiring a separate journal, data is *usually*
>> written directly to the longterm storage; writes are only journalled in
>> the WAL to be flushed/synced later if they're below a certain size (IIRC
>> 32kb by default), to avoid latency by excessive seeking on HDDs.
>>
>> Rich
>>
>> On 07/09/18 14:23, Muhammad Junaid wrote:
>> > Thanks again, but sorry again too. I couldn't understand the following.
>> >
>> > 1. As per docs, blocks.db is used only for bluestore (file system meta
>> > data info etc). It has nothing to do with actual data (for journaling)
>> > which will ultimately written to slower disks.
>> > 2. How will actual journaling will work if there is no WAL (As you
>> > suggested)?
>> >
>> > Regards.
>> >
>> > On Fri, Sep 7, 2018 at 6:09 PM Alfredo Deza > > > wrote:
>> >
>> > On Fri, Sep 7, 2018 at 9:02 AM, Muhammad Junaid
>> > mailto:junaid.fsd...@gmail.com>> wrote:
>> > > Thanks Alfredo. Just to clear that My configuration has 5 OSD's
>> > (7200 rpm
>> > > SAS HDDS) which are slower than the 200G SSD. Thats why I asked
>> > for a 10G
>> > > WAL partition for each OSD on the SSD.
>> > >
>> > > Are you asking us to do 40GB  * 5 partitions on SSD just for
>> block.db?
>> >
>> > Yes.
>> >
>> > You don't need a separate WAL defined. It only makes sense when you
>> > have something *faster* than where block.db will live.
>> >
>> > In your case 'data' will go in the slower spinning devices,
>> 'block.db'
>> > will go in the SSD, and there is no need for WAL. You would only
>> > benefit
>> > from WAL if you had another device, like an NVMe, where 2GB
>> partitions
>> > (or LVs) could be created for block.wal
>> >
>> >
>> > >
>> > > On Fri, Sep 7, 2018 at 5:36 PM Alfredo Deza > > > wrote:
>> > >>
>> > >> On Fri, Sep 7, 2018 at 8:27 AM, Muhammad Junaid
>> > mailto:junaid.fsd...@gmail.com>>
>> > >> wrote:
>> > >> > Hi there
>> > >> >
>> > >> > Asking the questions as a newbie. May be asked a number of
>> > times before
>> > >> > by
>> > >> > many but sorry, it is not clear yet to me.
>> > >> >
>> > >> > 1. The WAL device is just like journaling device used before
>> > bluestore.
>> > >> > And
>> > >> > CEPH confirms Write to client after writing to it (Before
>> > actual write
>> > >> > to
>> > >> > primary device)?
>> > >> >
>> > >> > 2. If we have lets say 5 OSD's (4 TB SAS) and 1 200GB SSD.
>> > Should we
>> > >> > partition SSD in 10 partitions? Shoud/Can we set WAL Partition
>> Size
>> > >> > against
>> > >> > each OSD as 10GB? Or what min/max we should set for WAL
>> > Partition? And
>> > >> > can
>> > >> > we set remaining 150GB as (30GB * 5) for 5 db partitions for
>> > all OSD's?
>> > >>
>> > >> A WAL partition would only help if you have a device faster than
>> the
>> > >> SSD where the block.db would go.
>> > >>
>> > >> We recently updated our sizing recommendations for block.db at
>> least
>> > >> 4% of the size of block (also referenced as the data device):
>> > >>
>> > >>
>> > >>
>> > http://docs.ceph.com/docs/master/rados/configuration/
>> bluestore-config-ref/#sizing
>> > >>
>> > >> In your case, what you want is to create 5 logical volumes from
>> your
>> > >> 200GB at 40GB each, without a need for a WAL device.
>> > >>
>> > >>

Re: [ceph-users] WAL/DB size

2018-09-07 Thread Muhammad Junaid
Thanks very much. It is clear very much now. Because we are just in
planning stage right now, would you tell me if we use 7200rpm SAS 3-4TB for
OSD's, write speed will be fine with this new scenario? Because it will
apparently writing to slower disks before actual confirmation. (I
understand there must be advantages of bluestore using direct partitions).
Regards.

Muhammad Junaid

On Fri, Sep 7, 2018 at 6:39 PM Richard Hesketh 
wrote:

> It can get confusing.
>
> There will always be a WAL, and there will always be a metadata DB, for
> a bluestore OSD. However, if a separate device is not specified for the
> WAL, it is kept in the same device/partition as the DB; in the same way,
> if a separate device is not specified for the DB, it is kept on the same
> device as the actual data (an "all-in-one" OSD). Unless you have a
> separate, even faster device for the WAL to go on, you shouldn't specify
> it separately from the DB; just make one partition on your SSD per OSD,
> and make them as large as will fit together on the SSD.
>
> Also, just to be clear, the WAL is not exactly a journal in the same way
> that Filestore required a journal. Because Bluestore can provide write
> atomicity without requiring a separate journal, data is *usually*
> written directly to the longterm storage; writes are only journalled in
> the WAL to be flushed/synced later if they're below a certain size (IIRC
> 32kb by default), to avoid latency by excessive seeking on HDDs.
>
> Rich
>
> On 07/09/18 14:23, Muhammad Junaid wrote:
> > Thanks again, but sorry again too. I couldn't understand the following.
> >
> > 1. As per docs, blocks.db is used only for bluestore (file system meta
> > data info etc). It has nothing to do with actual data (for journaling)
> > which will ultimately written to slower disks.
> > 2. How will actual journaling will work if there is no WAL (As you
> > suggested)?
> >
> > Regards.
> >
> > On Fri, Sep 7, 2018 at 6:09 PM Alfredo Deza  > > wrote:
> >
> > On Fri, Sep 7, 2018 at 9:02 AM, Muhammad Junaid
> > mailto:junaid.fsd...@gmail.com>> wrote:
> > > Thanks Alfredo. Just to clear that My configuration has 5 OSD's
> > (7200 rpm
> > > SAS HDDS) which are slower than the 200G SSD. Thats why I asked
> > for a 10G
> > > WAL partition for each OSD on the SSD.
> > >
> > > Are you asking us to do 40GB  * 5 partitions on SSD just for
> block.db?
> >
> > Yes.
> >
> > You don't need a separate WAL defined. It only makes sense when you
> > have something *faster* than where block.db will live.
> >
> > In your case 'data' will go in the slower spinning devices,
> 'block.db'
> > will go in the SSD, and there is no need for WAL. You would only
> > benefit
> > from WAL if you had another device, like an NVMe, where 2GB
> partitions
> > (or LVs) could be created for block.wal
> >
> >
> > >
> > > On Fri, Sep 7, 2018 at 5:36 PM Alfredo Deza  > > wrote:
> > >>
> > >> On Fri, Sep 7, 2018 at 8:27 AM, Muhammad Junaid
> > mailto:junaid.fsd...@gmail.com>>
> > >> wrote:
> > >> > Hi there
> > >> >
> > >> > Asking the questions as a newbie. May be asked a number of
> > times before
> > >> > by
> > >> > many but sorry, it is not clear yet to me.
> > >> >
> > >> > 1. The WAL device is just like journaling device used before
> > bluestore.
> > >> > And
> > >> > CEPH confirms Write to client after writing to it (Before
> > actual write
> > >> > to
> > >> > primary device)?
> > >> >
> > >> > 2. If we have lets say 5 OSD's (4 TB SAS) and 1 200GB SSD.
> > Should we
> > >> > partition SSD in 10 partitions? Shoud/Can we set WAL Partition
> Size
> > >> > against
> > >> > each OSD as 10GB? Or what min/max we should set for WAL
> > Partition? And
> > >> > can
> > >> > we set remaining 150GB as (30GB * 5) for 5 db partitions for
> > all OSD's?
> > >>
> > >> A WAL partition would only help if you have a device faster than
> the
> > >> SSD where the block.db would go.
> > >>
> > >> We recently updated our sizing recommendations for block.db at
> least
> > >> 4% of the size of block (also referenced as the data device):
> > >>
> > >>
> > >>
> >
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing
> > >>
> > >> In your case, what you want is to create 5 logical volumes from
> your
> > >> 200GB at 40GB each, without a need for a WAL device.
> > >>
> > >>
> > >> >
> > >> > Thanks in advance. Regards.
> > >> >
> > >> > Muhammad Junaid
> > >> >
> > >> > ___
> > >> > ceph-users mailing list
> > >> > ceph-users@lists.ceph.com 
> > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >> >
> >
> >
> >
> > 

Re: [ceph-users] WAL/DB size

2018-09-07 Thread Richard Hesketh
It can get confusing.

There will always be a WAL, and there will always be a metadata DB, for
a bluestore OSD. However, if a separate device is not specified for the
WAL, it is kept in the same device/partition as the DB; in the same way,
if a separate device is not specified for the DB, it is kept on the same
device as the actual data (an "all-in-one" OSD). Unless you have a
separate, even faster device for the WAL to go on, you shouldn't specify
it separately from the DB; just make one partition on your SSD per OSD,
and make them as large as will fit together on the SSD.

Also, just to be clear, the WAL is not exactly a journal in the same way
that Filestore required a journal. Because Bluestore can provide write
atomicity without requiring a separate journal, data is *usually*
written directly to the longterm storage; writes are only journalled in
the WAL to be flushed/synced later if they're below a certain size (IIRC
32kb by default), to avoid latency by excessive seeking on HDDs.

Rich

On 07/09/18 14:23, Muhammad Junaid wrote:
> Thanks again, but sorry again too. I couldn't understand the following.
> 
> 1. As per docs, blocks.db is used only for bluestore (file system meta
> data info etc). It has nothing to do with actual data (for journaling)
> which will ultimately written to slower disks. 
> 2. How will actual journaling will work if there is no WAL (As you
> suggested)?
> 
> Regards.
> 
> On Fri, Sep 7, 2018 at 6:09 PM Alfredo Deza  > wrote:
> 
> On Fri, Sep 7, 2018 at 9:02 AM, Muhammad Junaid
> mailto:junaid.fsd...@gmail.com>> wrote:
> > Thanks Alfredo. Just to clear that My configuration has 5 OSD's
> (7200 rpm
> > SAS HDDS) which are slower than the 200G SSD. Thats why I asked
> for a 10G
> > WAL partition for each OSD on the SSD.
> >
> > Are you asking us to do 40GB  * 5 partitions on SSD just for block.db?
> 
> Yes.
> 
> You don't need a separate WAL defined. It only makes sense when you
> have something *faster* than where block.db will live.
> 
> In your case 'data' will go in the slower spinning devices, 'block.db'
> will go in the SSD, and there is no need for WAL. You would only
> benefit
> from WAL if you had another device, like an NVMe, where 2GB partitions
> (or LVs) could be created for block.wal
> 
> 
> >
> > On Fri, Sep 7, 2018 at 5:36 PM Alfredo Deza  > wrote:
> >>
> >> On Fri, Sep 7, 2018 at 8:27 AM, Muhammad Junaid
> mailto:junaid.fsd...@gmail.com>>
> >> wrote:
> >> > Hi there
> >> >
> >> > Asking the questions as a newbie. May be asked a number of
> times before
> >> > by
> >> > many but sorry, it is not clear yet to me.
> >> >
> >> > 1. The WAL device is just like journaling device used before
> bluestore.
> >> > And
> >> > CEPH confirms Write to client after writing to it (Before
> actual write
> >> > to
> >> > primary device)?
> >> >
> >> > 2. If we have lets say 5 OSD's (4 TB SAS) and 1 200GB SSD.
> Should we
> >> > partition SSD in 10 partitions? Shoud/Can we set WAL Partition Size
> >> > against
> >> > each OSD as 10GB? Or what min/max we should set for WAL
> Partition? And
> >> > can
> >> > we set remaining 150GB as (30GB * 5) for 5 db partitions for
> all OSD's?
> >>
> >> A WAL partition would only help if you have a device faster than the
> >> SSD where the block.db would go.
> >>
> >> We recently updated our sizing recommendations for block.db at least
> >> 4% of the size of block (also referenced as the data device):
> >>
> >>
> >>
> 
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing
> >>
> >> In your case, what you want is to create 5 logical volumes from your
> >> 200GB at 40GB each, without a need for a WAL device.
> >>
> >>
> >> >
> >> > Thanks in advance. Regards.
> >> >
> >> > Muhammad Junaid
> >> >
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com 
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2018-09-07 Thread Muhammad Junaid
Thanks again, but sorry again too. I couldn't understand the following.

1. As per docs, blocks.db is used only for bluestore (file system meta data
info etc). It has nothing to do with actual data (for journaling) which
will ultimately written to slower disks.
2. How will actual journaling will work if there is no WAL (As you
suggested)?

Regards.

On Fri, Sep 7, 2018 at 6:09 PM Alfredo Deza  wrote:

> On Fri, Sep 7, 2018 at 9:02 AM, Muhammad Junaid 
> wrote:
> > Thanks Alfredo. Just to clear that My configuration has 5 OSD's (7200 rpm
> > SAS HDDS) which are slower than the 200G SSD. Thats why I asked for a 10G
> > WAL partition for each OSD on the SSD.
> >
> > Are you asking us to do 40GB  * 5 partitions on SSD just for block.db?
>
> Yes.
>
> You don't need a separate WAL defined. It only makes sense when you
> have something *faster* than where block.db will live.
>
> In your case 'data' will go in the slower spinning devices, 'block.db'
> will go in the SSD, and there is no need for WAL. You would only
> benefit
> from WAL if you had another device, like an NVMe, where 2GB partitions
> (or LVs) could be created for block.wal
>
>
> >
> > On Fri, Sep 7, 2018 at 5:36 PM Alfredo Deza  wrote:
> >>
> >> On Fri, Sep 7, 2018 at 8:27 AM, Muhammad Junaid <
> junaid.fsd...@gmail.com>
> >> wrote:
> >> > Hi there
> >> >
> >> > Asking the questions as a newbie. May be asked a number of times
> before
> >> > by
> >> > many but sorry, it is not clear yet to me.
> >> >
> >> > 1. The WAL device is just like journaling device used before
> bluestore.
> >> > And
> >> > CEPH confirms Write to client after writing to it (Before actual write
> >> > to
> >> > primary device)?
> >> >
> >> > 2. If we have lets say 5 OSD's (4 TB SAS) and 1 200GB SSD. Should we
> >> > partition SSD in 10 partitions? Shoud/Can we set WAL Partition Size
> >> > against
> >> > each OSD as 10GB? Or what min/max we should set for WAL Partition? And
> >> > can
> >> > we set remaining 150GB as (30GB * 5) for 5 db partitions for all
> OSD's?
> >>
> >> A WAL partition would only help if you have a device faster than the
> >> SSD where the block.db would go.
> >>
> >> We recently updated our sizing recommendations for block.db at least
> >> 4% of the size of block (also referenced as the data device):
> >>
> >>
> >>
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing
> >>
> >> In your case, what you want is to create 5 logical volumes from your
> >> 200GB at 40GB each, without a need for a WAL device.
> >>
> >>
> >> >
> >> > Thanks in advance. Regards.
> >> >
> >> > Muhammad Junaid
> >> >
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2018-09-07 Thread Alfredo Deza
On Fri, Sep 7, 2018 at 9:02 AM, Muhammad Junaid  wrote:
> Thanks Alfredo. Just to clear that My configuration has 5 OSD's (7200 rpm
> SAS HDDS) which are slower than the 200G SSD. Thats why I asked for a 10G
> WAL partition for each OSD on the SSD.
>
> Are you asking us to do 40GB  * 5 partitions on SSD just for block.db?

Yes.

You don't need a separate WAL defined. It only makes sense when you
have something *faster* than where block.db will live.

In your case 'data' will go in the slower spinning devices, 'block.db'
will go in the SSD, and there is no need for WAL. You would only
benefit
from WAL if you had another device, like an NVMe, where 2GB partitions
(or LVs) could be created for block.wal


>
> On Fri, Sep 7, 2018 at 5:36 PM Alfredo Deza  wrote:
>>
>> On Fri, Sep 7, 2018 at 8:27 AM, Muhammad Junaid 
>> wrote:
>> > Hi there
>> >
>> > Asking the questions as a newbie. May be asked a number of times before
>> > by
>> > many but sorry, it is not clear yet to me.
>> >
>> > 1. The WAL device is just like journaling device used before bluestore.
>> > And
>> > CEPH confirms Write to client after writing to it (Before actual write
>> > to
>> > primary device)?
>> >
>> > 2. If we have lets say 5 OSD's (4 TB SAS) and 1 200GB SSD. Should we
>> > partition SSD in 10 partitions? Shoud/Can we set WAL Partition Size
>> > against
>> > each OSD as 10GB? Or what min/max we should set for WAL Partition? And
>> > can
>> > we set remaining 150GB as (30GB * 5) for 5 db partitions for all OSD's?
>>
>> A WAL partition would only help if you have a device faster than the
>> SSD where the block.db would go.
>>
>> We recently updated our sizing recommendations for block.db at least
>> 4% of the size of block (also referenced as the data device):
>>
>>
>> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing
>>
>> In your case, what you want is to create 5 logical volumes from your
>> 200GB at 40GB each, without a need for a WAL device.
>>
>>
>> >
>> > Thanks in advance. Regards.
>> >
>> > Muhammad Junaid
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2018-09-07 Thread Eugen Block

Hi,


Are you asking us to do 40GB  * 5 partitions on SSD just for block.db?


yes. By default ceph deploys block.db and wal.db on the same device if  
no separate wal device is specified.


Regards,
Eugen


Zitat von Muhammad Junaid :


Thanks Alfredo. Just to clear that My configuration has 5 OSD's (7200 rpm
SAS HDDS) which are slower than the 200G SSD. Thats why I asked for a 10G
WAL partition for each OSD on the SSD.

Are you asking us to do 40GB  * 5 partitions on SSD just for block.db?

On Fri, Sep 7, 2018 at 5:36 PM Alfredo Deza  wrote:


On Fri, Sep 7, 2018 at 8:27 AM, Muhammad Junaid 
wrote:
> Hi there
>
> Asking the questions as a newbie. May be asked a number of times before
by
> many but sorry, it is not clear yet to me.
>
> 1. The WAL device is just like journaling device used before bluestore.
And
> CEPH confirms Write to client after writing to it (Before actual write to
> primary device)?
>
> 2. If we have lets say 5 OSD's (4 TB SAS) and 1 200GB SSD. Should we
> partition SSD in 10 partitions? Shoud/Can we set WAL Partition Size
against
> each OSD as 10GB? Or what min/max we should set for WAL Partition? And
can
> we set remaining 150GB as (30GB * 5) for 5 db partitions for all OSD's?

A WAL partition would only help if you have a device faster than the
SSD where the block.db would go.

We recently updated our sizing recommendations for block.db at least
4% of the size of block (also referenced as the data device):


http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing

In your case, what you want is to create 5 logical volumes from your
200GB at 40GB each, without a need for a WAL device.


>
> Thanks in advance. Regards.
>
> Muhammad Junaid
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2018-09-07 Thread Muhammad Junaid
Thanks Alfredo. Just to clear that My configuration has 5 OSD's (7200 rpm
SAS HDDS) which are slower than the 200G SSD. Thats why I asked for a 10G
WAL partition for each OSD on the SSD.

Are you asking us to do 40GB  * 5 partitions on SSD just for block.db?

On Fri, Sep 7, 2018 at 5:36 PM Alfredo Deza  wrote:

> On Fri, Sep 7, 2018 at 8:27 AM, Muhammad Junaid 
> wrote:
> > Hi there
> >
> > Asking the questions as a newbie. May be asked a number of times before
> by
> > many but sorry, it is not clear yet to me.
> >
> > 1. The WAL device is just like journaling device used before bluestore.
> And
> > CEPH confirms Write to client after writing to it (Before actual write to
> > primary device)?
> >
> > 2. If we have lets say 5 OSD's (4 TB SAS) and 1 200GB SSD. Should we
> > partition SSD in 10 partitions? Shoud/Can we set WAL Partition Size
> against
> > each OSD as 10GB? Or what min/max we should set for WAL Partition? And
> can
> > we set remaining 150GB as (30GB * 5) for 5 db partitions for all OSD's?
>
> A WAL partition would only help if you have a device faster than the
> SSD where the block.db would go.
>
> We recently updated our sizing recommendations for block.db at least
> 4% of the size of block (also referenced as the data device):
>
>
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing
>
> In your case, what you want is to create 5 logical volumes from your
> 200GB at 40GB each, without a need for a WAL device.
>
>
> >
> > Thanks in advance. Regards.
> >
> > Muhammad Junaid
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2018-09-07 Thread Alfredo Deza
On Fri, Sep 7, 2018 at 8:27 AM, Muhammad Junaid  wrote:
> Hi there
>
> Asking the questions as a newbie. May be asked a number of times before by
> many but sorry, it is not clear yet to me.
>
> 1. The WAL device is just like journaling device used before bluestore. And
> CEPH confirms Write to client after writing to it (Before actual write to
> primary device)?
>
> 2. If we have lets say 5 OSD's (4 TB SAS) and 1 200GB SSD. Should we
> partition SSD in 10 partitions? Shoud/Can we set WAL Partition Size against
> each OSD as 10GB? Or what min/max we should set for WAL Partition? And can
> we set remaining 150GB as (30GB * 5) for 5 db partitions for all OSD's?

A WAL partition would only help if you have a device faster than the
SSD where the block.db would go.

We recently updated our sizing recommendations for block.db at least
4% of the size of block (also referenced as the data device):

http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing

In your case, what you want is to create 5 logical volumes from your
200GB at 40GB each, without a need for a WAL device.


>
> Thanks in advance. Regards.
>
> Muhammad Junaid
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com