Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2018-01-30 Thread Wido den Hollander



On 11/03/2017 02:43 PM, Mark Nelson wrote:



On 11/03/2017 08:25 AM, Wido den Hollander wrote:



Op 3 november 2017 om 13:33 schreef Mark Nelson :




On 11/03/2017 02:44 AM, Wido den Hollander wrote:


Op 3 november 2017 om 0:09 schreef Nigel Williams 
:



On 3 November 2017 at 07:45, Martin Overgaard Hansen 
 wrote:
I want to bring this subject back in the light and hope someone 
can provide

insight regarding the issue, thanks.


Thanks Martin, I was going to do the same.

Is it possible to make the DB partition (on the fastest device) too
big? in other words is there a point where for a given set of OSDs
(number + size) the DB partition is sized too large and is wasting
resources. I recall a comment by someone proposing to split up a
single large (fast) SSD into 100GB partitions for each OSD.



It depends on the size of your backing disk. The DB will grow for 
the amount of Objects you have on your OSD.


A 4TB drive will hold more objects then a 1TB drive (usually), same 
goes for a 10TB vs 6TB.


From what I've seen now there is no such thing as a 'too big' DB.

The tests I've done for now seem to suggest that filling up a 50GB 
DB is rather hard to do. But if you have Billions of Objects and 
thus tens of millions object per OSD.


Are you doing RBD, RGW, or something else to test?  What size are the
objets and are you fragmenting them?


Let's say the avg overhead is 16k you would need a 150GB DB for 10M 
objects.


You could look into your current numbers and check how many objects 
you have per OSD.


I checked a couple of Ceph clusters I run and see about 1M objects 
per OSD, but other only have 250k OSDs.


In all those cases even with 32k you would need a 30GB DB with 1M 
objects in that OSD.



The answer could be couched as some intersection of pool type (RBD /
RADOS / CephFS), object change(update?) intensity, size of OSD etc and
rule-of-thumb.



I would check your running Ceph clusters and calculate the amount of 
objects per OSD.


total objects / num osd * 3


One nagging concern I have in the back of my mind is that the amount of
space amplification in rocksdb might grow with the number of levels (ie
the number of objects).  The space used per object might be different at
10M objects and 50M objects.



True. But how many systems do we have out there with 10M objects in 
ONE OSD?


The systems I checked range from 250k to 1M objects per OSD. Ofcourse, 
but statistics aren't the golden rule, but users will want some 
guideline on how to size their DB.


That's actually something I would really like better insight into.  I 
don't feel like I have a sufficient understanding of how many 
objects/OSD people are really deploying in the field.  I figure 10M/OSD 
is probably a reasonable "typical" upper limit for HDDs, but I could see 
some use cases with flash backed SSDs pushing far more.




A few months later I've gathered some more data and wrote a script to 
quickly query it on OSDs: 
https://gist.github.com/wido/875d531692a922d608b9392e1766405d


I fetched information from a few systems running with BlueStore.

So far the larged value I found on systems running with RBD is 24k per 
onode.


This OSD reported 70k onodes in it's database with a total DB size of 
about 1.5GB


As most deployments I see out there are RBD those are the ones I can get 
the most information from.


The avg object size I saw was 2.8MB.

So let's say you would like to fill a OSD with 2TB of data. With a avg 
object size of 2.8M you would have 714k objects on that OSD.


714k objects * 24k per onode = 16GB DB

The rule of thumb I've been using now is 10GB DB per 1TB of OSD storage. 
For now this seems to work out for me in all the cases I have seen.


I'm not saying it applies to every case, but the cases I've seen so faw 
seem to hold up.


If your average object size drops you will get more onodes per TB and 
thus have a larger DB.


I'm just trying to gather information so people designing their system 
have something to work with.


Wido



WAL should be sufficient with 1GB~2GB, right?


Yep.  On the surface this appears to be a simple question, but a much 
deeper question is what are we actually doing with the WAL?  How should 
we be storing PG log and dup ops data?  How can we get away from the 
large WAL buffers and memtables we have now?  These are questions we are 
actively working on solving.  For the moment though, having multiple (4) 
256MB WAL buffers appears to give us the best performance despite 
resulting in large memtables, so 1-2GB for the WAL is right.


Mark



Wido



Wido


An idea occurred to me that by monitoring for the logged spill message
(the event when the DB partition spills/overflows to the OSD), OSDs
could be (lazily) destroyed and recreated with a new DB partition
increased in size say by 10% each time.
___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-11-03 Thread Maged Mokhtar
On 2017-11-03 15:59, Wido den Hollander wrote:

> Op 3 november 2017 om 14:43 schreef Mark Nelson :
> 
> On 11/03/2017 08:25 AM, Wido den Hollander wrote: 
> Op 3 november 2017 om 13:33 schreef Mark Nelson :
> 
> On 11/03/2017 02:44 AM, Wido den Hollander wrote: 
> Op 3 november 2017 om 0:09 schreef Nigel Williams 
> :
> 
> On 3 November 2017 at 07:45, Martin Overgaard Hansen  
> wrote: I want to bring this subject back in the light and hope someone can 
> provide
> insight regarding the issue, thanks. 
> Thanks Martin, I was going to do the same.
> 
> Is it possible to make the DB partition (on the fastest device) too
> big? in other words is there a point where for a given set of OSDs
> (number + size) the DB partition is sized too large and is wasting
> resources. I recall a comment by someone proposing to split up a
> single large (fast) SSD into 100GB partitions for each OSD.

It depends on the size of your backing disk. The DB will grow for the
amount of Objects you have on your OSD.

A 4TB drive will hold more objects then a 1TB drive (usually), same goes
for a 10TB vs 6TB.

>From what I've seen now there is no such thing as a 'too big' DB.

The tests I've done for now seem to suggest that filling up a 50GB DB is
rather hard to do. But if you have Billions of Objects and thus tens of
millions object per OSD. 
Are you doing RBD, RGW, or something else to test?  What size are the
objets and are you fragmenting them? 

> Let's say the avg overhead is 16k you would need a 150GB DB for 10M objects.
> 
> You could look into your current numbers and check how many objects you have 
> per OSD.
> 
> I checked a couple of Ceph clusters I run and see about 1M objects per OSD, 
> but other only have 250k OSDs.
> 
> In all those cases even with 32k you would need a 30GB DB with 1M objects in 
> that OSD.
> 
>> The answer could be couched as some intersection of pool type (RBD /
>> RADOS / CephFS), object change(update?) intensity, size of OSD etc and
>> rule-of-thumb.
> 
> I would check your running Ceph clusters and calculate the amount of objects 
> per OSD.
> 
> total objects / num osd * 3

One nagging concern I have in the back of my mind is that the amount of
space amplification in rocksdb might grow with the number of levels (ie
the number of objects).  The space used per object might be different at
10M objects and 50M objects.

True. But how many systems do we have out there with 10M objects in ONE
OSD?

The systems I checked range from 250k to 1M objects per OSD. Ofcourse,
but statistics aren't the golden rule, but users will want some
guideline on how to size their DB. 
That's actually something I would really like better insight into.  I 
don't feel like I have a sufficient understanding of how many 
objects/OSD people are really deploying in the field.  I figure 10M/OSD 
is probably a reasonable "typical" upper limit for HDDs, but I could see

some use cases with flash backed SSDs pushing far more. 
Would a poll on the ceph-users list work? I understand that you require
such feedback to make a proper judgement.

I know of one cluster which has 10M objects (heavy, heavy, heavy RGW
user) in about 400TB of data.

All other clusters I've seen aren't that high on the amount of Objects.
They are usually high on data since they have a RBD use-case which is a
lot of 4M objects.

You could also ask users to use this tool:
https://github.com/42on/ceph-collect

That tarball would give you a lot of information about the cluster and
the amount of objects per OSD and PG.

Wido

>> WAL should be sufficient with 1GB~2GB, right?
> 
> Yep.  On the surface this appears to be a simple question, but a much 
> deeper question is what are we actually doing with the WAL?  How should 
> we be storing PG log and dup ops data?  How can we get away from the 
> large WAL buffers and memtables we have now?  These are questions we are 
> actively working on solving.  For the moment though, having multiple (4) 
> 256MB WAL buffers appears to give us the best performance despite 
> resulting in large memtables, so 1-2GB for the WAL is right.
> 
> Mark
> 
> Wido
> 
> Wido
> 
> An idea occurred to me that by monitoring for the logged spill message
> (the event when the DB partition spills/overflows to the OSD), OSDs
> could be (lazily) destroyed and recreated with a new DB partition
> increased in size say by 10% each time.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-11-03 Thread Willem Jan Withagen
On 3-11-2017 00:09, Nigel Williams wrote:
> On 3 November 2017 at 07:45, Martin Overgaard Hansen  
> wrote:
>> I want to bring this subject back in the light and hope someone can provide
>> insight regarding the issue, thanks.

> Is it possible to make the DB partition (on the fastest device) too
> big? in other words is there a point where for a given set of OSDs
> (number + size) the DB partition is sized too large and is wasting
> resources. I recall a comment by someone proposing to split up a
> single large (fast) SSD into 100GB partitions for each OSD.

Waisting resources is probably relative.

SSD have a limitted lifetime. And Ceph is a seriously hard (ab)user of
the wear for SSDs.

Now if you over dimension the allocated space, it looks like it is not
used. But onderneath in the SSD firmware writting is spread out over all
cells of the SSD. So the wear is evely distibuted over all componets of
the SSD.

And by overcommitting you have thus prolonged the life of you SSD.

So it is either buy more now, but less replacing.
Or allocate stricktly, and replace sooner.

--WjW

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-11-03 Thread Mark Nelson



On 11/03/2017 08:25 AM, Wido den Hollander wrote:



Op 3 november 2017 om 13:33 schreef Mark Nelson :




On 11/03/2017 02:44 AM, Wido den Hollander wrote:



Op 3 november 2017 om 0:09 schreef Nigel Williams :


On 3 November 2017 at 07:45, Martin Overgaard Hansen  wrote:

I want to bring this subject back in the light and hope someone can provide
insight regarding the issue, thanks.


Thanks Martin, I was going to do the same.

Is it possible to make the DB partition (on the fastest device) too
big? in other words is there a point where for a given set of OSDs
(number + size) the DB partition is sized too large and is wasting
resources. I recall a comment by someone proposing to split up a
single large (fast) SSD into 100GB partitions for each OSD.



It depends on the size of your backing disk. The DB will grow for the amount of 
Objects you have on your OSD.

A 4TB drive will hold more objects then a 1TB drive (usually), same goes for a 
10TB vs 6TB.

From what I've seen now there is no such thing as a 'too big' DB.

The tests I've done for now seem to suggest that filling up a 50GB DB is rather 
hard to do. But if you have Billions of Objects and thus tens of millions 
object per OSD.


Are you doing RBD, RGW, or something else to test?  What size are the
objets and are you fragmenting them?


Let's say the avg overhead is 16k you would need a 150GB DB for 10M objects.

You could look into your current numbers and check how many objects you have 
per OSD.

I checked a couple of Ceph clusters I run and see about 1M objects per OSD, but 
other only have 250k OSDs.

In all those cases even with 32k you would need a 30GB DB with 1M objects in 
that OSD.


The answer could be couched as some intersection of pool type (RBD /
RADOS / CephFS), object change(update?) intensity, size of OSD etc and
rule-of-thumb.



I would check your running Ceph clusters and calculate the amount of objects 
per OSD.

total objects / num osd * 3


One nagging concern I have in the back of my mind is that the amount of
space amplification in rocksdb might grow with the number of levels (ie
the number of objects).  The space used per object might be different at
10M objects and 50M objects.



True. But how many systems do we have out there with 10M objects in ONE OSD?

The systems I checked range from 250k to 1M objects per OSD. Ofcourse, but 
statistics aren't the golden rule, but users will want some guideline on how to 
size their DB.


That's actually something I would really like better insight into.  I 
don't feel like I have a sufficient understanding of how many 
objects/OSD people are really deploying in the field.  I figure 10M/OSD 
is probably a reasonable "typical" upper limit for HDDs, but I could see 
some use cases with flash backed SSDs pushing far more.




WAL should be sufficient with 1GB~2GB, right?


Yep.  On the surface this appears to be a simple question, but a much 
deeper question is what are we actually doing with the WAL?  How should 
we be storing PG log and dup ops data?  How can we get away from the 
large WAL buffers and memtables we have now?  These are questions we are 
actively working on solving.  For the moment though, having multiple (4) 
256MB WAL buffers appears to give us the best performance despite 
resulting in large memtables, so 1-2GB for the WAL is right.


Mark



Wido



Wido


An idea occurred to me that by monitoring for the logged spill message
(the event when the DB partition spills/overflows to the OSD), OSDs
could be (lazily) destroyed and recreated with a new DB partition
increased in size say by 10% each time.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-11-03 Thread Wido den Hollander

> Op 3 november 2017 om 13:33 schreef Mark Nelson :
> 
> 
> 
> 
> On 11/03/2017 02:44 AM, Wido den Hollander wrote:
> >
> >> Op 3 november 2017 om 0:09 schreef Nigel Williams 
> >> :
> >>
> >>
> >> On 3 November 2017 at 07:45, Martin Overgaard Hansen  
> >> wrote:
> >>> I want to bring this subject back in the light and hope someone can 
> >>> provide
> >>> insight regarding the issue, thanks.
> >>
> >> Thanks Martin, I was going to do the same.
> >>
> >> Is it possible to make the DB partition (on the fastest device) too
> >> big? in other words is there a point where for a given set of OSDs
> >> (number + size) the DB partition is sized too large and is wasting
> >> resources. I recall a comment by someone proposing to split up a
> >> single large (fast) SSD into 100GB partitions for each OSD.
> >>
> >
> > It depends on the size of your backing disk. The DB will grow for the 
> > amount of Objects you have on your OSD.
> >
> > A 4TB drive will hold more objects then a 1TB drive (usually), same goes 
> > for a 10TB vs 6TB.
> >
> > From what I've seen now there is no such thing as a 'too big' DB.
> >
> > The tests I've done for now seem to suggest that filling up a 50GB DB is 
> > rather hard to do. But if you have Billions of Objects and thus tens of 
> > millions object per OSD.
> 
> Are you doing RBD, RGW, or something else to test?  What size are the 
> objets and are you fragmenting them?
> >
> > Let's say the avg overhead is 16k you would need a 150GB DB for 10M objects.
> >
> > You could look into your current numbers and check how many objects you 
> > have per OSD.
> >
> > I checked a couple of Ceph clusters I run and see about 1M objects per OSD, 
> > but other only have 250k OSDs.
> >
> > In all those cases even with 32k you would need a 30GB DB with 1M objects 
> > in that OSD.
> >
> >> The answer could be couched as some intersection of pool type (RBD /
> >> RADOS / CephFS), object change(update?) intensity, size of OSD etc and
> >> rule-of-thumb.
> >>
> >
> > I would check your running Ceph clusters and calculate the amount of 
> > objects per OSD.
> >
> > total objects / num osd * 3
> 
> One nagging concern I have in the back of my mind is that the amount of 
> space amplification in rocksdb might grow with the number of levels (ie 
> the number of objects).  The space used per object might be different at 
> 10M objects and 50M objects.
> 

True. But how many systems do we have out there with 10M objects in ONE OSD?

The systems I checked range from 250k to 1M objects per OSD. Ofcourse, but 
statistics aren't the golden rule, but users will want some guideline on how to 
size their DB.

WAL should be sufficient with 1GB~2GB, right?

Wido

> >
> > Wido
> >
> >> An idea occurred to me that by monitoring for the logged spill message
> >> (the event when the DB partition spills/overflows to the OSD), OSDs
> >> could be (lazily) destroyed and recreated with a new DB partition
> >> increased in size say by 10% each time.
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-11-03 Thread Mark Nelson



On 11/03/2017 04:08 AM, Jorge Pinilla López wrote:

well I haven't found any recomendation either  but I think that
sometimes the SSD space is being wasted.


If someone wanted to write it, you could have bluefs share some of the 
space on the drive for hot object data and release space as needed for 
the DB.  I'd very much recommend keeping the promotion rate incredibly low.




I was thinking about making an OSD from the rest of my SSD space, but it
wouldnt scale in case more speed is needed.


I think there's a temptation to try to shove more stuff on the SSD, but 
honestly I'm not sure it's a great idea.  These drives are already 
handling WAL and DB traffic, potentially for multiple OSDs.  If you have 
a very read centric workload or are using drives with high write 
endurance that's one thing.  From a monetary perspective, think 
carefully about how much drive endurance and mttf matter to you.




Other option I asked was to use bcache or a mix between bcache and small
DB partitions but I was only reply with corruption problems so I decided
not to do it.

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021535.html

I think a good idea would be to use the space needed to store the Hot DB
and the rest use it as a cache (at least a read cache)


Given that bluestore is already storing all of the metadata in rocksdb, 
putting the DB partition on flash is already going to buy you a lot. 
Having said that, something that could let the DB and a cache 
share/reclaim space on the SSD could be interesting.  It won't be a cure 
all, but at least could provide a small improvement so long as the 
promotion overhead is kept very low.




I dont really know a lot about this topic but I think that maybe giving
50GB of a really expensive SSD is pointless with its only using 10GB.


Think of it less as "space" and more of it as cells of write endurance. 
That's really what you are buying.  Whether that's a small drive with 
high write endurance or a big drive with low write endurance.  Some may 
have better properties for reads, some may have power-loss-protection 
that allows O_DSYNC writes to go much faster.  As far as the WAL and DB 
goes, it's all about how many writes you can get out of the drive before 
it goes kaput.




El 02/11/2017 a las 21:45, Martin Overgaard Hansen escribió:


Hi, it seems like I’m in the same boat as everyone else in
this particular thread.

I’m also unable to find any guidelines or recommendations regarding
sizing of the wal and / or db.

I want to bring this subject back in the light and hope someone can
provide insight regarding the issue, thanks.

Best Regards,
Martin Overgaard Hansen

MultiHouse IT Partner A/S



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--

*Jorge Pinilla López*
jorp...@unizar.es
Estudiante de ingenieria informática
Becario del area de sistemas (SICUZ)
Universidad de Zaragoza
PGP-KeyID: A34331932EBC715A




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-11-03 Thread Mark Nelson



On 11/03/2017 02:44 AM, Wido den Hollander wrote:



Op 3 november 2017 om 0:09 schreef Nigel Williams :


On 3 November 2017 at 07:45, Martin Overgaard Hansen  wrote:

I want to bring this subject back in the light and hope someone can provide
insight regarding the issue, thanks.


Thanks Martin, I was going to do the same.

Is it possible to make the DB partition (on the fastest device) too
big? in other words is there a point where for a given set of OSDs
(number + size) the DB partition is sized too large and is wasting
resources. I recall a comment by someone proposing to split up a
single large (fast) SSD into 100GB partitions for each OSD.



It depends on the size of your backing disk. The DB will grow for the amount of 
Objects you have on your OSD.

A 4TB drive will hold more objects then a 1TB drive (usually), same goes for a 
10TB vs 6TB.

From what I've seen now there is no such thing as a 'too big' DB.

The tests I've done for now seem to suggest that filling up a 50GB DB is rather 
hard to do. But if you have Billions of Objects and thus tens of millions 
object per OSD.


Are you doing RBD, RGW, or something else to test?  What size are the 
objets and are you fragmenting them?


Let's say the avg overhead is 16k you would need a 150GB DB for 10M objects.

You could look into your current numbers and check how many objects you have 
per OSD.

I checked a couple of Ceph clusters I run and see about 1M objects per OSD, but 
other only have 250k OSDs.

In all those cases even with 32k you would need a 30GB DB with 1M objects in 
that OSD.


The answer could be couched as some intersection of pool type (RBD /
RADOS / CephFS), object change(update?) intensity, size of OSD etc and
rule-of-thumb.



I would check your running Ceph clusters and calculate the amount of objects 
per OSD.

total objects / num osd * 3


One nagging concern I have in the back of my mind is that the amount of 
space amplification in rocksdb might grow with the number of levels (ie 
the number of objects).  The space used per object might be different at 
10M objects and 50M objects.




Wido


An idea occurred to me that by monitoring for the logged spill message
(the event when the DB partition spills/overflows to the OSD), OSDs
could be (lazily) destroyed and recreated with a new DB partition
increased in size say by 10% each time.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-11-03 Thread Jorge Pinilla López
well I haven't found any recomendation either  but I think that
sometimes the SSD space is being wasted.

I was thinking about making an OSD from the rest of my SSD space, but it
wouldnt scale in case more speed is needed.

Other option I asked was to use bcache or a mix between bcache and small
DB partitions but I was only reply with corruption problems so I decided
not to do it.

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021535.html

I think a good idea would be to use the space needed to store the Hot DB
and the rest use it as a cache (at least a read cache)

I dont really know a lot about this topic but I think that maybe giving
50GB of a really expensive SSD is pointless with its only using 10GB.

El 02/11/2017 a las 21:45, Martin Overgaard Hansen escribió:

> Hi, it seems like I’m in the same boat as everyone else in
> this particular thread.
>
> I’m also unable to find any guidelines or recommendations regarding
> sizing of the wal and / or db.
>
> I want to bring this subject back in the light and hope someone can
> provide insight regarding the issue, thanks.  
>
> Best Regards,
> Martin Overgaard Hansen
>
> MultiHouse IT Partner A/S
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 

*Jorge Pinilla López*
jorp...@unizar.es
Estudiante de ingenieria informática
Becario del area de sistemas (SICUZ)
Universidad de Zaragoza
PGP-KeyID: A34331932EBC715A


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-11-03 Thread Wido den Hollander

> Op 3 november 2017 om 0:09 schreef Nigel Williams 
> :
> 
> 
> On 3 November 2017 at 07:45, Martin Overgaard Hansen  
> wrote:
> > I want to bring this subject back in the light and hope someone can provide
> > insight regarding the issue, thanks.
> 
> Thanks Martin, I was going to do the same.
> 
> Is it possible to make the DB partition (on the fastest device) too
> big? in other words is there a point where for a given set of OSDs
> (number + size) the DB partition is sized too large and is wasting
> resources. I recall a comment by someone proposing to split up a
> single large (fast) SSD into 100GB partitions for each OSD.
> 

It depends on the size of your backing disk. The DB will grow for the amount of 
Objects you have on your OSD.

A 4TB drive will hold more objects then a 1TB drive (usually), same goes for a 
10TB vs 6TB.

>From what I've seen now there is no such thing as a 'too big' DB. 

The tests I've done for now seem to suggest that filling up a 50GB DB is rather 
hard to do. But if you have Billions of Objects and thus tens of millions 
object per OSD.

Let's say the avg overhead is 16k you would need a 150GB DB for 10M objects.

You could look into your current numbers and check how many objects you have 
per OSD.

I checked a couple of Ceph clusters I run and see about 1M objects per OSD, but 
other only have 250k OSDs.

In all those cases even with 32k you would need a 30GB DB with 1M objects in 
that OSD.

> The answer could be couched as some intersection of pool type (RBD /
> RADOS / CephFS), object change(update?) intensity, size of OSD etc and
> rule-of-thumb.
> 

I would check your running Ceph clusters and calculate the amount of objects 
per OSD.

total objects / num osd * 3

Wido

> An idea occurred to me that by monitoring for the logged spill message
> (the event when the DB partition spills/overflows to the OSD), OSDs
> could be (lazily) destroyed and recreated with a new DB partition
> increased in size say by 10% each time.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-11-02 Thread Nigel Williams
On 3 November 2017 at 07:45, Martin Overgaard Hansen  wrote:
> I want to bring this subject back in the light and hope someone can provide
> insight regarding the issue, thanks.

Thanks Martin, I was going to do the same.

Is it possible to make the DB partition (on the fastest device) too
big? in other words is there a point where for a given set of OSDs
(number + size) the DB partition is sized too large and is wasting
resources. I recall a comment by someone proposing to split up a
single large (fast) SSD into 100GB partitions for each OSD.

The answer could be couched as some intersection of pool type (RBD /
RADOS / CephFS), object change(update?) intensity, size of OSD etc and
rule-of-thumb.

An idea occurred to me that by monitoring for the logged spill message
(the event when the DB partition spills/overflows to the OSD), OSDs
could be (lazily) destroyed and recreated with a new DB partition
increased in size say by 10% each time.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-11-02 Thread Martin Overgaard Hansen
Hi, it seems like I’m in the same boat as everyone else in this particular 
thread.

I’m also unable to find any guidelines or recommendations regarding sizing of 
the wal and / or db.

I want to bring this subject back in the light and hope someone can provide 
insight regarding the issue, thanks.

Best Regards,
Martin Overgaard Hansen
MultiHouse IT Partner A/S
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-10-18 Thread Marco Baldini - H.S. Amiata

Hi

I'm about to change some SATA SSD disks to NVME disks and for CEPH I too 
would like to know how to assign space. I have 3 1TB SATA OSDs so I'll 
split the NVME disks in 3 partitions of equal size, I'm not going to 
assign a different WAL partition because, if the docs are right, the WAL 
is automatically put in the fastest device.


What I can't find is some indication of how much space WAL and blocks.db 
are using, so I could tune them better.



Il 18/10/2017 08:29, Wido den Hollander ha scritto:

Thanks for the feedback. Indeed, we have to be cautious in this case. So 
6kB/object feels low to you, so it's probably.

I'm testing with a 1GB WAL/50GB DB on a SSD with a 4TB disk which seems to hold out fine. 
It's not that space is a true issue, but "use as much as available" doesn't say 
much to people.

If I have a 1TB NVMe for 10 disks, should I give 100GB of DB to each OSD? It's 
those things people want to know. So we need numbers to figure these things out.

Wido


--
*Marco Baldini*
*H.S. Amiata Srl*
Ufficio:0577-779396
Cellulare:  335-8765169
WEB:www.hsamiata.it 
EMAIL:  mbald...@hsamiata.it 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-10-18 Thread Wido den Hollander

> Op 17 oktober 2017 om 14:21 schreef Mark Nelson :
> 
> 
> 
> 
> On 10/17/2017 01:54 AM, Wido den Hollander wrote:
> >
> >> Op 16 oktober 2017 om 18:14 schreef Richard Hesketh 
> >> :
> >>
> >>
> >> On 16/10/17 13:45, Wido den Hollander wrote:
>  Op 26 september 2017 om 16:39 schreef Mark Nelson :
>  On 09/26/2017 01:10 AM, Dietmar Rieder wrote:
> > thanks David,
> >
> > that's confirming what I was assuming. To bad that there is no
> > estimate/method to calculate the db partition size.
> 
>  It's possible that we might be able to get ranges for certain kinds of
>  scenarios.  Maybe if you do lots of small random writes on RBD, you can
>  expect a typical metadata size of X per object.  Or maybe if you do lots
>  of large sequential object writes in RGW, it's more like Y.  I think
>  it's probably going to be tough to make it accurate for everyone though.
> >>>
> >>> So I did a quick test. I wrote 75.000 objects to a BlueStore device:
> >>>
> >>> root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluestore.bluestore_onodes'
> >>> 75085
> >>> root@alpha:~#
> >>>
> >>> I then saw the RocksDB database was 450MB in size:
> >>>
> >>> root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluefs.db_used_bytes'
> >>> 459276288
> >>> root@alpha:~#
> >>>
> >>> 459276288 / 75085 = 6116
> >>>
> >>> So about 6kb of RocksDB data per object.
> >>>
> >>> Let's say I want to store 1M objects in a single OSD I would need ~6GB of 
> >>> DB space.
> >>>
> >>> Is this a safe assumption? Do you think that 6kb is normal? Low? High?
> >>>
> >>> There aren't many of these numbers out there for BlueStore right now so 
> >>> I'm trying to gather some numbers.
> >>>
> >>> Wido
> >>
> >> If I check for the same stats on OSDs in my production cluster I see 
> >> similar but variable values:
> >>
> >> root@vm-ds-01:~/ceph-conf# for i in {0..9} ; do echo -n "osd.$i db per 
> >> object: " ; expr `ceph daemon osd.$i perf dump | jq 
> >> '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf dump | jq 
> >> '.bluestore.bluestore_onodes'` ; done
> >> osd.0 db per object: 7490
> >> osd.1 db per object: 7523
> >> osd.2 db per object: 7378
> >> osd.3 db per object: 7447
> >> osd.4 db per object: 7233
> >> osd.5 db per object: 7393
> >> osd.6 db per object: 7074
> >> osd.7 db per object: 7967
> >> osd.8 db per object: 7253
> >> osd.9 db per object: 7680
> >>
> >> root@vm-ds-02:~# for i in {10..19} ; do echo -n "osd.$i db per object: " ; 
> >> expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph 
> >> daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> >> osd.10 db per object: 5168
> >> osd.11 db per object: 5291
> >> osd.12 db per object: 5476
> >> osd.13 db per object: 4978
> >> osd.14 db per object: 5252
> >> osd.15 db per object: 5461
> >> osd.16 db per object: 5135
> >> osd.17 db per object: 5126
> >> osd.18 db per object: 9336
> >> osd.19 db per object: 4986
> >>
> >> root@vm-ds-03:~# for i in {20..29} ; do echo -n "osd.$i db per object: " ; 
> >> expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph 
> >> daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> >> osd.20 db per object: 5115
> >> osd.21 db per object: 4844
> >> osd.22 db per object: 5063
> >> osd.23 db per object: 5486
> >> osd.24 db per object: 5228
> >> osd.25 db per object: 4966
> >> osd.26 db per object: 5047
> >> osd.27 db per object: 5021
> >> osd.28 db per object: 5321
> >> osd.29 db per object: 5150
> >>
> >> root@vm-ds-04:~# for i in {30..39} ; do echo -n "osd.$i db per object: " ; 
> >> expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph 
> >> daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> >> osd.30 db per object: 6658
> >> osd.31 db per object: 6445
> >> osd.32 db per object: 6259
> >> osd.33 db per object: 6691
> >> osd.34 db per object: 6513
> >> osd.35 db per object: 6628
> >> osd.36 db per object: 6779
> >> osd.37 db per object: 6819
> >> osd.38 db per object: 6677
> >> osd.39 db per object: 6689
> >>
> >> root@vm-ds-05:~# for i in {40..49} ; do echo -n "osd.$i db per object: " ; 
> >> expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph 
> >> daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> >> osd.40 db per object: 5335
> >> osd.41 db per object: 5203
> >> osd.42 db per object: 5552
> >> osd.43 db per object: 5188
> >> osd.44 db per object: 5218
> >> osd.45 db per object: 5157
> >> osd.46 db per object: 4956
> >> osd.47 db per object: 5370
> >> osd.48 db per object: 5117
> >> osd.49 db per object: 5313
> >>
> >> I'm not sure why so much variance (these nodes are basically identical) 
> >> and I think that the db_used_bytes includes the WAL at least in my case, 
> >> as I don't have a separate WAL device. I'm not sure how big the WAL is 
> >> relative to metadata and hence how much this might be thrown 

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-10-17 Thread Mark Nelson



On 10/17/2017 01:54 AM, Wido den Hollander wrote:



Op 16 oktober 2017 om 18:14 schreef Richard Hesketh 
:


On 16/10/17 13:45, Wido den Hollander wrote:

Op 26 september 2017 om 16:39 schreef Mark Nelson :
On 09/26/2017 01:10 AM, Dietmar Rieder wrote:

thanks David,

that's confirming what I was assuming. To bad that there is no
estimate/method to calculate the db partition size.


It's possible that we might be able to get ranges for certain kinds of
scenarios.  Maybe if you do lots of small random writes on RBD, you can
expect a typical metadata size of X per object.  Or maybe if you do lots
of large sequential object writes in RGW, it's more like Y.  I think
it's probably going to be tough to make it accurate for everyone though.


So I did a quick test. I wrote 75.000 objects to a BlueStore device:

root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluestore.bluestore_onodes'
75085
root@alpha:~#

I then saw the RocksDB database was 450MB in size:

root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluefs.db_used_bytes'
459276288
root@alpha:~#

459276288 / 75085 = 6116

So about 6kb of RocksDB data per object.

Let's say I want to store 1M objects in a single OSD I would need ~6GB of DB 
space.

Is this a safe assumption? Do you think that 6kb is normal? Low? High?

There aren't many of these numbers out there for BlueStore right now so I'm 
trying to gather some numbers.

Wido


If I check for the same stats on OSDs in my production cluster I see similar 
but variable values:

root@vm-ds-01:~/ceph-conf# for i in {0..9} ; do echo -n "osd.$i db per object: 
" ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon 
osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.0 db per object: 7490
osd.1 db per object: 7523
osd.2 db per object: 7378
osd.3 db per object: 7447
osd.4 db per object: 7233
osd.5 db per object: 7393
osd.6 db per object: 7074
osd.7 db per object: 7967
osd.8 db per object: 7253
osd.9 db per object: 7680

root@vm-ds-02:~# for i in {10..19} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf 
dump | jq '.bluestore.bluestore_onodes'` ; done
osd.10 db per object: 5168
osd.11 db per object: 5291
osd.12 db per object: 5476
osd.13 db per object: 4978
osd.14 db per object: 5252
osd.15 db per object: 5461
osd.16 db per object: 5135
osd.17 db per object: 5126
osd.18 db per object: 9336
osd.19 db per object: 4986

root@vm-ds-03:~# for i in {20..29} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf 
dump | jq '.bluestore.bluestore_onodes'` ; done
osd.20 db per object: 5115
osd.21 db per object: 4844
osd.22 db per object: 5063
osd.23 db per object: 5486
osd.24 db per object: 5228
osd.25 db per object: 4966
osd.26 db per object: 5047
osd.27 db per object: 5021
osd.28 db per object: 5321
osd.29 db per object: 5150

root@vm-ds-04:~# for i in {30..39} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf 
dump | jq '.bluestore.bluestore_onodes'` ; done
osd.30 db per object: 6658
osd.31 db per object: 6445
osd.32 db per object: 6259
osd.33 db per object: 6691
osd.34 db per object: 6513
osd.35 db per object: 6628
osd.36 db per object: 6779
osd.37 db per object: 6819
osd.38 db per object: 6677
osd.39 db per object: 6689

root@vm-ds-05:~# for i in {40..49} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf 
dump | jq '.bluestore.bluestore_onodes'` ; done
osd.40 db per object: 5335
osd.41 db per object: 5203
osd.42 db per object: 5552
osd.43 db per object: 5188
osd.44 db per object: 5218
osd.45 db per object: 5157
osd.46 db per object: 4956
osd.47 db per object: 5370
osd.48 db per object: 5117
osd.49 db per object: 5313

I'm not sure why so much variance (these nodes are basically identical) and I 
think that the db_used_bytes includes the WAL at least in my case, as I don't 
have a separate WAL device. I'm not sure how big the WAL is relative to 
metadata and hence how much this might be thrown off, but ~6kb/object seems 
like a reasonable value to take for back-of-envelope calculating.



Yes, judging from your numbers 6kb/object seems reasonable. More datapoints are 
welcome in this case.

Some input from a BlueStore dev might be helpful as well to see we are not 
drawing the wrong conclusions here.

Wido


I would be very careful about drawing too many conclusions given a 
single snapshot in time, especially if there haven't been a lot of 
partial object rewrites yet.  Just on the surface, 6KB/object feels low 
(especially if you they are moderately large objects), but perhaps if 
they've never been rewritten this is a reasonable lower bound.  This is 
important because things like 4MB RBD objects that 

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-10-17 Thread Marco Baldini - H.S. Amiata

Hello

Here my results

In this node, I have 3 OSDs (1TB HDD), osd.1 and osd.2 have blocks.db in 
SSD partitions each of 90GB, osd.8 has no separate blocks.db


pve-hs-main[0]:~$ for i in {1,2,8} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf 
dump | jq '.bluestore.bluestore_onodes'` ; done
osd.1 db per object: 20872
osd.2 db per object: 20416
osd.8 db per object: 16888


In this node, I have 3 OSDs (1TB HDD), each with a 60GB blocks.db on a 
separate SSD


pve-hs-2[0]:/$ for i in {3..5} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf 
dump | jq '.bluestore.bluestore_onodes'` ; done
osd.3 db per object: 19053
osd.4 db per object: 18742
osd.5 db per object: 14979


In this node I have 3 OSDs (1TB HDD) with no separate SSD

pve-hs-3[0]:~$ for i in {0,6,7} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf 
dump | jq '.bluestore.bluestore_onodes'` ; done
osd.0 db per object: 27392
osd.6 db per object: 54065
osd.7 db per object: 69986


My ceph df and rados df, if they can be useful

pve-hs-3[0]:~$ ceph df detail
GLOBAL:
SIZE  AVAIL RAW USED %RAW USED OBJECTS
8742G 6628G2114G 24.19187k
POOLS:
NAME   ID QUOTA OBJECTS QUOTA BYTES USED   %USED
 MAX AVAIL OBJECTS DIRTY READ  WRITE RAW USED
cephbackup 9  N/A   N/A   469G  7.38
 2945G  120794  117k  759k 2899k 938G
cephwin13 N/A   N/A 73788M  1.21
 1963G   18711 18711 1337k 1637k 216G
cephnix14 N/A   N/A   201G  3.31
 1963G   52407 52407  791k 1781k 605G
pve-hs-3[0]:~$ rados df detail
POOL_NAME  USED   OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED 
RD_OPS  RD WR_OPS  WR
cephbackup   469G  120794  0 241588  0   00  
777872  7286M 2968926 718G
cephnix  201G   52407  0 157221  0   00  
810317 67057M 1824184 242G
cephwin73788M   18711  0  56133  0   00 
1369792   155G 1677060 136G

total_objects191912
total_used   2114G
total_avail  6628G
total_space  8742G


Can someone see a pattern?



Il 17/10/2017 08:54, Wido den Hollander ha scritto:

Op 16 oktober 2017 om 18:14 schreef Richard Hesketh 
:


On 16/10/17 13:45, Wido den Hollander wrote:

Op 26 september 2017 om 16:39 schreef Mark Nelson :
On 09/26/2017 01:10 AM, Dietmar Rieder wrote:

thanks David,

that's confirming what I was assuming. To bad that there is no
estimate/method to calculate the db partition size.

It's possible that we might be able to get ranges for certain kinds of
scenarios.  Maybe if you do lots of small random writes on RBD, you can
expect a typical metadata size of X per object.  Or maybe if you do lots
of large sequential object writes in RGW, it's more like Y.  I think
it's probably going to be tough to make it accurate for everyone though.

So I did a quick test. I wrote 75.000 objects to a BlueStore device:

root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluestore.bluestore_onodes'
75085
root@alpha:~#

I then saw the RocksDB database was 450MB in size:

root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluefs.db_used_bytes'
459276288
root@alpha:~#

459276288 / 75085 = 6116

So about 6kb of RocksDB data per object.

Let's say I want to store 1M objects in a single OSD I would need ~6GB of DB 
space.

Is this a safe assumption? Do you think that 6kb is normal? Low? High?

There aren't many of these numbers out there for BlueStore right now so I'm 
trying to gather some numbers.

Wido

If I check for the same stats on OSDs in my production cluster I see similar 
but variable values:

root@vm-ds-01:~/ceph-conf# for i in {0..9} ; do echo -n "osd.$i db per object: 
" ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon 
osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.0 db per object: 7490
osd.1 db per object: 7523
osd.2 db per object: 7378
osd.3 db per object: 7447
osd.4 db per object: 7233
osd.5 db per object: 7393
osd.6 db per object: 7074
osd.7 db per object: 7967
osd.8 db per object: 7253
osd.9 db per object: 7680

root@vm-ds-02:~# for i in {10..19} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon osd.$i perf 
dump | jq '.bluestore.bluestore_onodes'` ; done
osd.10 db per object: 5168
osd.11 db per object: 5291
osd.12 db per object: 5476
osd.13 db per object: 4978
osd.14 db per object: 5252
osd.15 db per object: 5461
osd.16 db 

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-10-17 Thread Wido den Hollander

> Op 16 oktober 2017 om 18:14 schreef Richard Hesketh 
> :
> 
> 
> On 16/10/17 13:45, Wido den Hollander wrote:
> >> Op 26 september 2017 om 16:39 schreef Mark Nelson :
> >> On 09/26/2017 01:10 AM, Dietmar Rieder wrote:
> >>> thanks David,
> >>>
> >>> that's confirming what I was assuming. To bad that there is no
> >>> estimate/method to calculate the db partition size.
> >>
> >> It's possible that we might be able to get ranges for certain kinds of 
> >> scenarios.  Maybe if you do lots of small random writes on RBD, you can 
> >> expect a typical metadata size of X per object.  Or maybe if you do lots 
> >> of large sequential object writes in RGW, it's more like Y.  I think 
> >> it's probably going to be tough to make it accurate for everyone though.
> > 
> > So I did a quick test. I wrote 75.000 objects to a BlueStore device:
> > 
> > root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluestore.bluestore_onodes'
> > 75085
> > root@alpha:~# 
> > 
> > I then saw the RocksDB database was 450MB in size:
> > 
> > root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluefs.db_used_bytes'
> > 459276288
> > root@alpha:~#
> > 
> > 459276288 / 75085 = 6116
> > 
> > So about 6kb of RocksDB data per object.
> > 
> > Let's say I want to store 1M objects in a single OSD I would need ~6GB of 
> > DB space.
> > 
> > Is this a safe assumption? Do you think that 6kb is normal? Low? High?
> > 
> > There aren't many of these numbers out there for BlueStore right now so I'm 
> > trying to gather some numbers.
> > 
> > Wido
> 
> If I check for the same stats on OSDs in my production cluster I see similar 
> but variable values:
> 
> root@vm-ds-01:~/ceph-conf# for i in {0..9} ; do echo -n "osd.$i db per 
> object: " ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` 
> / `ceph daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> osd.0 db per object: 7490
> osd.1 db per object: 7523
> osd.2 db per object: 7378
> osd.3 db per object: 7447
> osd.4 db per object: 7233
> osd.5 db per object: 7393
> osd.6 db per object: 7074
> osd.7 db per object: 7967
> osd.8 db per object: 7253
> osd.9 db per object: 7680
> 
> root@vm-ds-02:~# for i in {10..19} ; do echo -n "osd.$i db per object: " ; 
> expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph 
> daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> osd.10 db per object: 5168
> osd.11 db per object: 5291
> osd.12 db per object: 5476
> osd.13 db per object: 4978
> osd.14 db per object: 5252
> osd.15 db per object: 5461
> osd.16 db per object: 5135
> osd.17 db per object: 5126
> osd.18 db per object: 9336
> osd.19 db per object: 4986
> 
> root@vm-ds-03:~# for i in {20..29} ; do echo -n "osd.$i db per object: " ; 
> expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph 
> daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> osd.20 db per object: 5115
> osd.21 db per object: 4844
> osd.22 db per object: 5063
> osd.23 db per object: 5486
> osd.24 db per object: 5228
> osd.25 db per object: 4966
> osd.26 db per object: 5047
> osd.27 db per object: 5021
> osd.28 db per object: 5321
> osd.29 db per object: 5150
> 
> root@vm-ds-04:~# for i in {30..39} ; do echo -n "osd.$i db per object: " ; 
> expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph 
> daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> osd.30 db per object: 6658
> osd.31 db per object: 6445
> osd.32 db per object: 6259
> osd.33 db per object: 6691
> osd.34 db per object: 6513
> osd.35 db per object: 6628
> osd.36 db per object: 6779
> osd.37 db per object: 6819
> osd.38 db per object: 6677
> osd.39 db per object: 6689
> 
> root@vm-ds-05:~# for i in {40..49} ; do echo -n "osd.$i db per object: " ; 
> expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph 
> daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
> osd.40 db per object: 5335
> osd.41 db per object: 5203
> osd.42 db per object: 5552
> osd.43 db per object: 5188
> osd.44 db per object: 5218
> osd.45 db per object: 5157
> osd.46 db per object: 4956
> osd.47 db per object: 5370
> osd.48 db per object: 5117
> osd.49 db per object: 5313
> 
> I'm not sure why so much variance (these nodes are basically identical) and I 
> think that the db_used_bytes includes the WAL at least in my case, as I don't 
> have a separate WAL device. I'm not sure how big the WAL is relative to 
> metadata and hence how much this might be thrown off, but ~6kb/object seems 
> like a reasonable value to take for back-of-envelope calculating.
> 

Yes, judging from your numbers 6kb/object seems reasonable. More datapoints are 
welcome in this case.

Some input from a BlueStore dev might be helpful as well to see we are not 
drawing the wrong conclusions here.

Wido

> [bonus hilarity]
> On my all-in-one-SSD OSDs, because bluestore reports them entirely as db 
> space, I get results like:
> 

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-10-16 Thread Richard Hesketh
On 16/10/17 13:45, Wido den Hollander wrote:
>> Op 26 september 2017 om 16:39 schreef Mark Nelson :
>> On 09/26/2017 01:10 AM, Dietmar Rieder wrote:
>>> thanks David,
>>>
>>> that's confirming what I was assuming. To bad that there is no
>>> estimate/method to calculate the db partition size.
>>
>> It's possible that we might be able to get ranges for certain kinds of 
>> scenarios.  Maybe if you do lots of small random writes on RBD, you can 
>> expect a typical metadata size of X per object.  Or maybe if you do lots 
>> of large sequential object writes in RGW, it's more like Y.  I think 
>> it's probably going to be tough to make it accurate for everyone though.
> 
> So I did a quick test. I wrote 75.000 objects to a BlueStore device:
> 
> root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluestore.bluestore_onodes'
> 75085
> root@alpha:~# 
> 
> I then saw the RocksDB database was 450MB in size:
> 
> root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluefs.db_used_bytes'
> 459276288
> root@alpha:~#
> 
> 459276288 / 75085 = 6116
> 
> So about 6kb of RocksDB data per object.
> 
> Let's say I want to store 1M objects in a single OSD I would need ~6GB of DB 
> space.
> 
> Is this a safe assumption? Do you think that 6kb is normal? Low? High?
> 
> There aren't many of these numbers out there for BlueStore right now so I'm 
> trying to gather some numbers.
> 
> Wido

If I check for the same stats on OSDs in my production cluster I see similar 
but variable values:

root@vm-ds-01:~/ceph-conf# for i in {0..9} ; do echo -n "osd.$i db per object: 
" ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph 
daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.0 db per object: 7490
osd.1 db per object: 7523
osd.2 db per object: 7378
osd.3 db per object: 7447
osd.4 db per object: 7233
osd.5 db per object: 7393
osd.6 db per object: 7074
osd.7 db per object: 7967
osd.8 db per object: 7253
osd.9 db per object: 7680

root@vm-ds-02:~# for i in {10..19} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon 
osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.10 db per object: 5168
osd.11 db per object: 5291
osd.12 db per object: 5476
osd.13 db per object: 4978
osd.14 db per object: 5252
osd.15 db per object: 5461
osd.16 db per object: 5135
osd.17 db per object: 5126
osd.18 db per object: 9336
osd.19 db per object: 4986

root@vm-ds-03:~# for i in {20..29} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon 
osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.20 db per object: 5115
osd.21 db per object: 4844
osd.22 db per object: 5063
osd.23 db per object: 5486
osd.24 db per object: 5228
osd.25 db per object: 4966
osd.26 db per object: 5047
osd.27 db per object: 5021
osd.28 db per object: 5321
osd.29 db per object: 5150

root@vm-ds-04:~# for i in {30..39} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon 
osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.30 db per object: 6658
osd.31 db per object: 6445
osd.32 db per object: 6259
osd.33 db per object: 6691
osd.34 db per object: 6513
osd.35 db per object: 6628
osd.36 db per object: 6779
osd.37 db per object: 6819
osd.38 db per object: 6677
osd.39 db per object: 6689

root@vm-ds-05:~# for i in {40..49} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon 
osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.40 db per object: 5335
osd.41 db per object: 5203
osd.42 db per object: 5552
osd.43 db per object: 5188
osd.44 db per object: 5218
osd.45 db per object: 5157
osd.46 db per object: 4956
osd.47 db per object: 5370
osd.48 db per object: 5117
osd.49 db per object: 5313

I'm not sure why so much variance (these nodes are basically identical) and I 
think that the db_used_bytes includes the WAL at least in my case, as I don't 
have a separate WAL device. I'm not sure how big the WAL is relative to 
metadata and hence how much this might be thrown off, but ~6kb/object seems 
like a reasonable value to take for back-of-envelope calculating.

[bonus hilarity]
On my all-in-one-SSD OSDs, because bluestore reports them entirely as db space, 
I get results like:

root@vm-hv-01:~# for i in {60..65} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon 
osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.60 db per object: 80273
osd.61 db per object: 68859
osd.62 db per object: 45560
osd.63 db per object: 38209
osd.64 db per object: 48258
osd.65 db per object: 50525

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-10-16 Thread Wido den Hollander

> Op 26 september 2017 om 16:39 schreef Mark Nelson :
> 
> 
> 
> 
> On 09/26/2017 01:10 AM, Dietmar Rieder wrote:
> > thanks David,
> >
> > that's confirming what I was assuming. To bad that there is no
> > estimate/method to calculate the db partition size.
> 
> It's possible that we might be able to get ranges for certain kinds of 
> scenarios.  Maybe if you do lots of small random writes on RBD, you can 
> expect a typical metadata size of X per object.  Or maybe if you do lots 
> of large sequential object writes in RGW, it's more like Y.  I think 
> it's probably going to be tough to make it accurate for everyone though.
> 

So I did a quick test. I wrote 75.000 objects to a BlueStore device:

root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluestore.bluestore_onodes'
75085
root@alpha:~# 

I then saw the RocksDB database was 450MB in size:

root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluefs.db_used_bytes'
459276288
root@alpha:~#

459276288 / 75085 = 6116

So about 6kb of RocksDB data per object.

Let's say I want to store 1M objects in a single OSD I would need ~6GB of DB 
space.

Is this a safe assumption? Do you think that 6kb is normal? Low? High?

There aren't many of these numbers out there for BlueStore right now so I'm 
trying to gather some numbers.

Wido

> Mark
> 
> >
> > Dietmar
> >
> > On 09/25/2017 05:10 PM, David Turner wrote:
> >> db/wal partitions are per OSD.  DB partitions need to be made as big as
> >> you need them.  If they run out of space, they will fall back to the
> >> block device.  If the DB and block are on the same device, then there's
> >> no reason to partition them and figure out the best size.  If they are
> >> on separate devices, then you need to make it as big as you need to to
> >> ensure that it won't spill over (or if it does that you're ok with the
> >> degraded performance while the db partition is full).  I haven't come
> >> across an equation to judge what size should be used for either
> >> partition yet.
> >>
> >> On Mon, Sep 25, 2017 at 10:53 AM Dietmar Rieder
> >> > wrote:
> >>
> >> On 09/25/2017 02:59 PM, Mark Nelson wrote:
> >> > On 09/25/2017 03:31 AM, TYLin wrote:
> >> >> Hi,
> >> >>
> >> >> To my understand, the bluestore write workflow is
> >> >>
> >> >> For normal big write
> >> >> 1. Write data to block
> >> >> 2. Update metadata to rocksdb
> >> >> 3. Rocksdb write to memory and block.wal
> >> >> 4. Once reach threshold, flush entries in block.wal to block.db
> >> >>
> >> >> For overwrite and small write
> >> >> 1. Write data and metadata to rocksdb
> >> >> 2. Apply the data to block
> >> >>
> >> >> Seems we don’t have a formula or suggestion to the size of block.db.
> >> >> It depends on the object size and number of objects in your pool. 
> >> You
> >> >> can just give big partition to block.db to ensure all the database
> >> >> files are on that fast partition. If block.db full, it will use 
> >> block
> >> >> to put db files, however, this will slow down the db performance. So
> >> >> give db size as much as you can.
> >> >
> >> > This is basically correct.  What's more, it's not just the object
> >> size,
> >> > but the number of extents, checksums, RGW bucket indices, and
> >> > potentially other random stuff.  I'm skeptical how well we can
> >> estimate
> >> > all of this in the long run.  I wonder if we would be better served 
> >> by
> >> > just focusing on making it easy to understand how the DB device is
> >> being
> >> > used, how much is spilling over to the block device, and make it
> >> easy to
> >> > upgrade to a new device once it gets full.
> >> >
> >> >>
> >> >> If you want to put wal and db on same ssd, you don’t need to create
> >> >> block.wal. It will implicitly use block.db to put wal. The only case
> >> >> you need block.wal is that you want to separate wal to another disk.
> >> >
> >> > I always make explicit partitions, but only because I (potentially
> >> > illogically) like it that way.  There may actually be some benefits 
> >> to
> >> > using a single partition for both if sharing a single device.
> >>
> >> is this "Single db/wal partition" then to be used for all OSDs on a 
> >> node
> >> or do you need to create a seperate "Single  db/wal partition" for each
> >> OSD  on the node?
> >>
> >> >
> >> >>
> >> >> I’m also studying bluestore, this is what I know so far. Any
> >> >> correction is welcomed.
> >> >>
> >> >> Thanks
> >> >>
> >> >>
> >> >>> On Sep 22, 2017, at 5:27 PM, Richard Hesketh
> >> >>>  >> > wrote:
> >> >>>
> >> >>> I asked the same question a couple of weeks ago. No response I got
> >> >>> contradicted the 

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-26 Thread Mark Nelson



On 09/26/2017 01:10 AM, Dietmar Rieder wrote:

thanks David,

that's confirming what I was assuming. To bad that there is no
estimate/method to calculate the db partition size.


It's possible that we might be able to get ranges for certain kinds of 
scenarios.  Maybe if you do lots of small random writes on RBD, you can 
expect a typical metadata size of X per object.  Or maybe if you do lots 
of large sequential object writes in RGW, it's more like Y.  I think 
it's probably going to be tough to make it accurate for everyone though.


Mark



Dietmar

On 09/25/2017 05:10 PM, David Turner wrote:

db/wal partitions are per OSD.  DB partitions need to be made as big as
you need them.  If they run out of space, they will fall back to the
block device.  If the DB and block are on the same device, then there's
no reason to partition them and figure out the best size.  If they are
on separate devices, then you need to make it as big as you need to to
ensure that it won't spill over (or if it does that you're ok with the
degraded performance while the db partition is full).  I haven't come
across an equation to judge what size should be used for either
partition yet.

On Mon, Sep 25, 2017 at 10:53 AM Dietmar Rieder
> wrote:

On 09/25/2017 02:59 PM, Mark Nelson wrote:
> On 09/25/2017 03:31 AM, TYLin wrote:
>> Hi,
>>
>> To my understand, the bluestore write workflow is
>>
>> For normal big write
>> 1. Write data to block
>> 2. Update metadata to rocksdb
>> 3. Rocksdb write to memory and block.wal
>> 4. Once reach threshold, flush entries in block.wal to block.db
>>
>> For overwrite and small write
>> 1. Write data and metadata to rocksdb
>> 2. Apply the data to block
>>
>> Seems we don’t have a formula or suggestion to the size of block.db.
>> It depends on the object size and number of objects in your pool. You
>> can just give big partition to block.db to ensure all the database
>> files are on that fast partition. If block.db full, it will use block
>> to put db files, however, this will slow down the db performance. So
>> give db size as much as you can.
>
> This is basically correct.  What's more, it's not just the object
size,
> but the number of extents, checksums, RGW bucket indices, and
> potentially other random stuff.  I'm skeptical how well we can
estimate
> all of this in the long run.  I wonder if we would be better served by
> just focusing on making it easy to understand how the DB device is
being
> used, how much is spilling over to the block device, and make it
easy to
> upgrade to a new device once it gets full.
>
>>
>> If you want to put wal and db on same ssd, you don’t need to create
>> block.wal. It will implicitly use block.db to put wal. The only case
>> you need block.wal is that you want to separate wal to another disk.
>
> I always make explicit partitions, but only because I (potentially
> illogically) like it that way.  There may actually be some benefits to
> using a single partition for both if sharing a single device.

is this "Single db/wal partition" then to be used for all OSDs on a node
or do you need to create a seperate "Single  db/wal partition" for each
OSD  on the node?

>
>>
>> I’m also studying bluestore, this is what I know so far. Any
>> correction is welcomed.
>>
>> Thanks
>>
>>
>>> On Sep 22, 2017, at 5:27 PM, Richard Hesketh
>>> > wrote:
>>>
>>> I asked the same question a couple of weeks ago. No response I got
>>> contradicted the documentation but nobody actively confirmed the
>>> documentation was correct on this subject, either; my end state was
>>> that I was relatively confident I wasn't making some horrible
mistake
>>> by simply specifying a big DB partition and letting bluestore work
>>> itself out (in my case, I've just got HDDs and SSDs that were
>>> journals under filestore), but I could not be sure there wasn't some
>>> sort of performance tuning I was missing out on by not specifying
>>> them separately.
>>>
>>> Rich
>>>
>>> On 21/09/17 20:37, Benjeman Meekhof wrote:
 Some of this thread seems to contradict the documentation and
confuses
 me.  Is the statement below correct?

 "The BlueStore journal will always be placed on the fastest device
 available, so using a DB device will provide the same benefit
that the
 WAL device would while also allowing additional metadata to be
stored
 there (if it will fix)."



http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices


  it seems to be saying that 

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-26 Thread Dietmar Rieder
thanks David,

that's confirming what I was assuming. To bad that there is no
estimate/method to calculate the db partition size.

Dietmar

On 09/25/2017 05:10 PM, David Turner wrote:
> db/wal partitions are per OSD.  DB partitions need to be made as big as
> you need them.  If they run out of space, they will fall back to the
> block device.  If the DB and block are on the same device, then there's
> no reason to partition them and figure out the best size.  If they are
> on separate devices, then you need to make it as big as you need to to
> ensure that it won't spill over (or if it does that you're ok with the
> degraded performance while the db partition is full).  I haven't come
> across an equation to judge what size should be used for either
> partition yet.
> 
> On Mon, Sep 25, 2017 at 10:53 AM Dietmar Rieder
> > wrote:
> 
> On 09/25/2017 02:59 PM, Mark Nelson wrote:
> > On 09/25/2017 03:31 AM, TYLin wrote:
> >> Hi,
> >>
> >> To my understand, the bluestore write workflow is
> >>
> >> For normal big write
> >> 1. Write data to block
> >> 2. Update metadata to rocksdb
> >> 3. Rocksdb write to memory and block.wal
> >> 4. Once reach threshold, flush entries in block.wal to block.db
> >>
> >> For overwrite and small write
> >> 1. Write data and metadata to rocksdb
> >> 2. Apply the data to block
> >>
> >> Seems we don’t have a formula or suggestion to the size of block.db.
> >> It depends on the object size and number of objects in your pool. You
> >> can just give big partition to block.db to ensure all the database
> >> files are on that fast partition. If block.db full, it will use block
> >> to put db files, however, this will slow down the db performance. So
> >> give db size as much as you can.
> >
> > This is basically correct.  What's more, it's not just the object
> size,
> > but the number of extents, checksums, RGW bucket indices, and
> > potentially other random stuff.  I'm skeptical how well we can
> estimate
> > all of this in the long run.  I wonder if we would be better served by
> > just focusing on making it easy to understand how the DB device is
> being
> > used, how much is spilling over to the block device, and make it
> easy to
> > upgrade to a new device once it gets full.
> >
> >>
> >> If you want to put wal and db on same ssd, you don’t need to create
> >> block.wal. It will implicitly use block.db to put wal. The only case
> >> you need block.wal is that you want to separate wal to another disk.
> >
> > I always make explicit partitions, but only because I (potentially
> > illogically) like it that way.  There may actually be some benefits to
> > using a single partition for both if sharing a single device.
> 
> is this "Single db/wal partition" then to be used for all OSDs on a node
> or do you need to create a seperate "Single  db/wal partition" for each
> OSD  on the node?
> 
> >
> >>
> >> I’m also studying bluestore, this is what I know so far. Any
> >> correction is welcomed.
> >>
> >> Thanks
> >>
> >>
> >>> On Sep 22, 2017, at 5:27 PM, Richard Hesketh
> >>>  > wrote:
> >>>
> >>> I asked the same question a couple of weeks ago. No response I got
> >>> contradicted the documentation but nobody actively confirmed the
> >>> documentation was correct on this subject, either; my end state was
> >>> that I was relatively confident I wasn't making some horrible
> mistake
> >>> by simply specifying a big DB partition and letting bluestore work
> >>> itself out (in my case, I've just got HDDs and SSDs that were
> >>> journals under filestore), but I could not be sure there wasn't some
> >>> sort of performance tuning I was missing out on by not specifying
> >>> them separately.
> >>>
> >>> Rich
> >>>
> >>> On 21/09/17 20:37, Benjeman Meekhof wrote:
>  Some of this thread seems to contradict the documentation and
> confuses
>  me.  Is the statement below correct?
> 
>  "The BlueStore journal will always be placed on the fastest device
>  available, so using a DB device will provide the same benefit
> that the
>  WAL device would while also allowing additional metadata to be
> stored
>  there (if it will fix)."
> 
> 
> 
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices
> 
> 
>   it seems to be saying that there's no reason to create
> separate WAL
>  and DB partitions if they are on the same device.  Specifying one
>  large DB partition per OSD will cover both uses.
> 
>  thanks,
> 

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-25 Thread Sage Weil
On Tue, 26 Sep 2017, Nigel Williams wrote:
> On 26 September 2017 at 08:11, Mark Nelson  wrote:
> > The WAL should never grow larger than the size of the buffers you've
> > specified.  It's the DB that can grow and is difficult to estimate both
> > because different workloads will cause different numbers of extents and
> > objects, but also because rocksdb itself causes a certain amount of
> > space-amplification due to a variety of factors.
> 
> Ok, I was confused whether both types could spill. within Bluestore it
> simply blocks if the WAL hits 100%?

It never blocks; it will always just spill over onto the next fastest 
device (wal -> db -> main).  Note that there is no value to a db partition 
if it is on the same device as the main partition.

> Would a drastic (quick) action to correct a too-small-DB-partition
> (impacting performance) is to destroy the OSD and rebuild it with a
> larger DB partition?

That's the easiest!
sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-25 Thread Nigel Williams
On 26 September 2017 at 08:11, Mark Nelson  wrote:
> The WAL should never grow larger than the size of the buffers you've
> specified.  It's the DB that can grow and is difficult to estimate both
> because different workloads will cause different numbers of extents and
> objects, but also because rocksdb itself causes a certain amount of
> space-amplification due to a variety of factors.

Ok, I was confused whether both types could spill. within Bluestore it
simply blocks if the WAL hits 100%?

Would a drastic (quick) action to correct a too-small-DB-partition
(impacting performance) is to destroy the OSD and rebuild it with a
larger DB partition?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-25 Thread Mark Nelson



On 09/25/2017 05:02 PM, Nigel Williams wrote:

On 26 September 2017 at 01:10, David Turner  wrote:

If they are on separate
devices, then you need to make it as big as you need to to ensure that it
won't spill over (or if it does that you're ok with the degraded performance
while the db partition is full).  I haven't come across an equation to judge
what size should be used for either partition yet.


Is it the case that only the WAL will spill if there is a backlog
clearing entries into the DB partition? so the WAL's fill-mark
oscillates but the DB is going to steadily grow (depending on the
previously mentioned factors of "...extents, checksums, RGW bucket
indices, and potentially other random stuff".


The WAL should never grow larger than the size of the buffers you've 
specified.  It's the DB that can grow and is difficult to estimate both 
because different workloads will cause different numbers of extents and 
objects, but also because rocksdb itself causes a certain amount of 
space-amplification due to a variety of factors.




Is there an indicator that can be monitored to show that a spill is occurring?


I think there's a message in the logs, but beyond that I don't remember 
if we added any kind of indication in the user tools.  At one point I 
think I remember Sage mentioning he wanted to add something to ceph df.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-25 Thread Nigel Williams
On 26 September 2017 at 01:10, David Turner  wrote:
> If they are on separate
> devices, then you need to make it as big as you need to to ensure that it
> won't spill over (or if it does that you're ok with the degraded performance
> while the db partition is full).  I haven't come across an equation to judge
> what size should be used for either partition yet.

Is it the case that only the WAL will spill if there is a backlog
clearing entries into the DB partition? so the WAL's fill-mark
oscillates but the DB is going to steadily grow (depending on the
previously mentioned factors of "...extents, checksums, RGW bucket
indices, and potentially other random stuff".

Is there an indicator that can be monitored to show that a spill is occurring?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-25 Thread David Turner
db/wal partitions are per OSD.  DB partitions need to be made as big as you
need them.  If they run out of space, they will fall back to the block
device.  If the DB and block are on the same device, then there's no reason
to partition them and figure out the best size.  If they are on separate
devices, then you need to make it as big as you need to to ensure that it
won't spill over (or if it does that you're ok with the degraded
performance while the db partition is full).  I haven't come across an
equation to judge what size should be used for either partition yet.

On Mon, Sep 25, 2017 at 10:53 AM Dietmar Rieder 
wrote:

> On 09/25/2017 02:59 PM, Mark Nelson wrote:
> > On 09/25/2017 03:31 AM, TYLin wrote:
> >> Hi,
> >>
> >> To my understand, the bluestore write workflow is
> >>
> >> For normal big write
> >> 1. Write data to block
> >> 2. Update metadata to rocksdb
> >> 3. Rocksdb write to memory and block.wal
> >> 4. Once reach threshold, flush entries in block.wal to block.db
> >>
> >> For overwrite and small write
> >> 1. Write data and metadata to rocksdb
> >> 2. Apply the data to block
> >>
> >> Seems we don’t have a formula or suggestion to the size of block.db.
> >> It depends on the object size and number of objects in your pool. You
> >> can just give big partition to block.db to ensure all the database
> >> files are on that fast partition. If block.db full, it will use block
> >> to put db files, however, this will slow down the db performance. So
> >> give db size as much as you can.
> >
> > This is basically correct.  What's more, it's not just the object size,
> > but the number of extents, checksums, RGW bucket indices, and
> > potentially other random stuff.  I'm skeptical how well we can estimate
> > all of this in the long run.  I wonder if we would be better served by
> > just focusing on making it easy to understand how the DB device is being
> > used, how much is spilling over to the block device, and make it easy to
> > upgrade to a new device once it gets full.
> >
> >>
> >> If you want to put wal and db on same ssd, you don’t need to create
> >> block.wal. It will implicitly use block.db to put wal. The only case
> >> you need block.wal is that you want to separate wal to another disk.
> >
> > I always make explicit partitions, but only because I (potentially
> > illogically) like it that way.  There may actually be some benefits to
> > using a single partition for both if sharing a single device.
>
> is this "Single db/wal partition" then to be used for all OSDs on a node
> or do you need to create a seperate "Single  db/wal partition" for each
> OSD  on the node?
>
> >
> >>
> >> I’m also studying bluestore, this is what I know so far. Any
> >> correction is welcomed.
> >>
> >> Thanks
> >>
> >>
> >>> On Sep 22, 2017, at 5:27 PM, Richard Hesketh
> >>>  wrote:
> >>>
> >>> I asked the same question a couple of weeks ago. No response I got
> >>> contradicted the documentation but nobody actively confirmed the
> >>> documentation was correct on this subject, either; my end state was
> >>> that I was relatively confident I wasn't making some horrible mistake
> >>> by simply specifying a big DB partition and letting bluestore work
> >>> itself out (in my case, I've just got HDDs and SSDs that were
> >>> journals under filestore), but I could not be sure there wasn't some
> >>> sort of performance tuning I was missing out on by not specifying
> >>> them separately.
> >>>
> >>> Rich
> >>>
> >>> On 21/09/17 20:37, Benjeman Meekhof wrote:
>  Some of this thread seems to contradict the documentation and confuses
>  me.  Is the statement below correct?
> 
>  "The BlueStore journal will always be placed on the fastest device
>  available, so using a DB device will provide the same benefit that the
>  WAL device would while also allowing additional metadata to be stored
>  there (if it will fix)."
> 
> 
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices
> 
> 
>   it seems to be saying that there's no reason to create separate WAL
>  and DB partitions if they are on the same device.  Specifying one
>  large DB partition per OSD will cover both uses.
> 
>  thanks,
>  Ben
> 
>  On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder
>   wrote:
> > On 09/21/2017 05:03 PM, Mark Nelson wrote:
> >>
> >> On 09/21/2017 03:17 AM, Dietmar Rieder wrote:
> >>> On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
>  On 2017-09-21 07:56, Lazuardi Nasution wrote:
> 
> > Hi,
> >
> > I'm still looking for the answer of these questions. Maybe
> > someone can
> > share their thought on these. Any comment will be helpful too.
> >
> > Best regards,
> >
> > On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
> > 

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-25 Thread Dietmar Rieder
On 09/25/2017 02:59 PM, Mark Nelson wrote:
> On 09/25/2017 03:31 AM, TYLin wrote:
>> Hi,
>>
>> To my understand, the bluestore write workflow is
>>
>> For normal big write
>> 1. Write data to block
>> 2. Update metadata to rocksdb
>> 3. Rocksdb write to memory and block.wal
>> 4. Once reach threshold, flush entries in block.wal to block.db
>>
>> For overwrite and small write
>> 1. Write data and metadata to rocksdb
>> 2. Apply the data to block
>>
>> Seems we don’t have a formula or suggestion to the size of block.db.
>> It depends on the object size and number of objects in your pool. You
>> can just give big partition to block.db to ensure all the database
>> files are on that fast partition. If block.db full, it will use block
>> to put db files, however, this will slow down the db performance. So
>> give db size as much as you can.
> 
> This is basically correct.  What's more, it's not just the object size,
> but the number of extents, checksums, RGW bucket indices, and
> potentially other random stuff.  I'm skeptical how well we can estimate
> all of this in the long run.  I wonder if we would be better served by
> just focusing on making it easy to understand how the DB device is being
> used, how much is spilling over to the block device, and make it easy to
> upgrade to a new device once it gets full.
> 
>>
>> If you want to put wal and db on same ssd, you don’t need to create
>> block.wal. It will implicitly use block.db to put wal. The only case
>> you need block.wal is that you want to separate wal to another disk.
> 
> I always make explicit partitions, but only because I (potentially
> illogically) like it that way.  There may actually be some benefits to
> using a single partition for both if sharing a single device.

is this "Single db/wal partition" then to be used for all OSDs on a node
or do you need to create a seperate "Single  db/wal partition" for each
OSD  on the node?

> 
>>
>> I’m also studying bluestore, this is what I know so far. Any
>> correction is welcomed.
>>
>> Thanks
>>
>>
>>> On Sep 22, 2017, at 5:27 PM, Richard Hesketh
>>>  wrote:
>>>
>>> I asked the same question a couple of weeks ago. No response I got
>>> contradicted the documentation but nobody actively confirmed the
>>> documentation was correct on this subject, either; my end state was
>>> that I was relatively confident I wasn't making some horrible mistake
>>> by simply specifying a big DB partition and letting bluestore work
>>> itself out (in my case, I've just got HDDs and SSDs that were
>>> journals under filestore), but I could not be sure there wasn't some
>>> sort of performance tuning I was missing out on by not specifying
>>> them separately.
>>>
>>> Rich
>>>
>>> On 21/09/17 20:37, Benjeman Meekhof wrote:
 Some of this thread seems to contradict the documentation and confuses
 me.  Is the statement below correct?

 "The BlueStore journal will always be placed on the fastest device
 available, so using a DB device will provide the same benefit that the
 WAL device would while also allowing additional metadata to be stored
 there (if it will fix)."

 http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices


  it seems to be saying that there's no reason to create separate WAL
 and DB partitions if they are on the same device.  Specifying one
 large DB partition per OSD will cover both uses.

 thanks,
 Ben

 On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder
  wrote:
> On 09/21/2017 05:03 PM, Mark Nelson wrote:
>>
>> On 09/21/2017 03:17 AM, Dietmar Rieder wrote:
>>> On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
 On 2017-09-21 07:56, Lazuardi Nasution wrote:

> Hi,
>
> I'm still looking for the answer of these questions. Maybe
> someone can
> share their thought on these. Any comment will be helpful too.
>
> Best regards,
>
> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
> > wrote:
>
>     Hi,
>
>     1. Is it possible configure use osd_data not as small
> partition on
>     OSD but a folder (ex. on root disk)? If yes, how to do that
> with
>     ceph-disk and any pros/cons of doing that?
>     2. Is WAL & DB size calculated based on OSD size or expected
>     throughput like on journal device of filestore? If no, what
> is the
>     default value and pro/cons of adjusting that?
>     3. Is partition alignment matter on Bluestore, including
> WAL & DB
>     if using separate device for them?
>
>     Best regards,
>
>
> ___
> ceph-users mailing list
> 

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-25 Thread Mark Nelson

On 09/25/2017 03:31 AM, TYLin wrote:

Hi,

To my understand, the bluestore write workflow is

For normal big write
1. Write data to block
2. Update metadata to rocksdb
3. Rocksdb write to memory and block.wal
4. Once reach threshold, flush entries in block.wal to block.db

For overwrite and small write
1. Write data and metadata to rocksdb
2. Apply the data to block

Seems we don’t have a formula or suggestion to the size of block.db. It depends 
on the object size and number of objects in your pool. You can just give big 
partition to block.db to ensure all the database files are on that fast 
partition. If block.db full, it will use block to put db files, however, this 
will slow down the db performance. So give db size as much as you can.


This is basically correct.  What's more, it's not just the object size, 
but the number of extents, checksums, RGW bucket indices, and 
potentially other random stuff.  I'm skeptical how well we can estimate 
all of this in the long run.  I wonder if we would be better served by 
just focusing on making it easy to understand how the DB device is being 
used, how much is spilling over to the block device, and make it easy to 
upgrade to a new device once it gets full.




If you want to put wal and db on same ssd, you don’t need to create block.wal. 
It will implicitly use block.db to put wal. The only case you need block.wal is 
that you want to separate wal to another disk.


I always make explicit partitions, but only because I (potentially 
illogically) like it that way.  There may actually be some benefits to 
using a single partition for both if sharing a single device.




I’m also studying bluestore, this is what I know so far. Any correction is 
welcomed.

Thanks



On Sep 22, 2017, at 5:27 PM, Richard Hesketh  
wrote:

I asked the same question a couple of weeks ago. No response I got contradicted 
the documentation but nobody actively confirmed the documentation was correct 
on this subject, either; my end state was that I was relatively confident I 
wasn't making some horrible mistake by simply specifying a big DB partition and 
letting bluestore work itself out (in my case, I've just got HDDs and SSDs that 
were journals under filestore), but I could not be sure there wasn't some sort 
of performance tuning I was missing out on by not specifying them separately.

Rich

On 21/09/17 20:37, Benjeman Meekhof wrote:

Some of this thread seems to contradict the documentation and confuses
me.  Is the statement below correct?

"The BlueStore journal will always be placed on the fastest device
available, so using a DB device will provide the same benefit that the
WAL device would while also allowing additional metadata to be stored
there (if it will fix)."

http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices

 it seems to be saying that there's no reason to create separate WAL
and DB partitions if they are on the same device.  Specifying one
large DB partition per OSD will cover both uses.

thanks,
Ben

On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder
 wrote:

On 09/21/2017 05:03 PM, Mark Nelson wrote:


On 09/21/2017 03:17 AM, Dietmar Rieder wrote:

On 09/21/2017 09:45 AM, Maged Mokhtar wrote:

On 2017-09-21 07:56, Lazuardi Nasution wrote:


Hi,

I'm still looking for the answer of these questions. Maybe someone can
share their thought on these. Any comment will be helpful too.

Best regards,

On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
> wrote:

Hi,

1. Is it possible configure use osd_data not as small partition on
OSD but a folder (ex. on root disk)? If yes, how to do that with
ceph-disk and any pros/cons of doing that?
2. Is WAL & DB size calculated based on OSD size or expected
throughput like on journal device of filestore? If no, what is the
default value and pro/cons of adjusting that?
3. Is partition alignment matter on Bluestore, including WAL & DB
if using separate device for them?

Best regards,


___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



I am also looking for recommendations on wal/db partition sizes. Some
hints:

ceph-disk defaults used in case it does not find
bluestore_block_wal_size or bluestore_block_db_size in config file:

wal =  512MB

db = if bluestore_block_size (data size) is in config file it uses 1/100
of it else it uses 1G.

There is also a presentation by Sage back in March, see page 16:

https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in


wal: 512 MB

db: "a few" GB

the wal size is probably not debatable, it will be like a journal for
small block sizes which are constrained by iops hence 512 MB is more
than enough. Probably we will see more on the db size in 

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-25 Thread TYLin
Hi, 

To my understand, the bluestore write workflow is

For normal big write
1. Write data to block
2. Update metadata to rocksdb
3. Rocksdb write to memory and block.wal
4. Once reach threshold, flush entries in block.wal to block.db

For overwrite and small write
1. Write data and metadata to rocksdb
2. Apply the data to block 
 
Seems we don’t have a formula or suggestion to the size of block.db. It depends 
on the object size and number of objects in your pool. You can just give big 
partition to block.db to ensure all the database files are on that fast 
partition. If block.db full, it will use block to put db files, however, this 
will slow down the db performance. So give db size as much as you can. 

If you want to put wal and db on same ssd, you don’t need to create block.wal. 
It will implicitly use block.db to put wal. The only case you need block.wal is 
that you want to separate wal to another disk.

I’m also studying bluestore, this is what I know so far. Any correction is 
welcomed.

Thanks


> On Sep 22, 2017, at 5:27 PM, Richard Hesketh  
> wrote:
> 
> I asked the same question a couple of weeks ago. No response I got 
> contradicted the documentation but nobody actively confirmed the 
> documentation was correct on this subject, either; my end state was that I 
> was relatively confident I wasn't making some horrible mistake by simply 
> specifying a big DB partition and letting bluestore work itself out (in my 
> case, I've just got HDDs and SSDs that were journals under filestore), but I 
> could not be sure there wasn't some sort of performance tuning I was missing 
> out on by not specifying them separately.
> 
> Rich
> 
> On 21/09/17 20:37, Benjeman Meekhof wrote:
>> Some of this thread seems to contradict the documentation and confuses
>> me.  Is the statement below correct?
>> 
>> "The BlueStore journal will always be placed on the fastest device
>> available, so using a DB device will provide the same benefit that the
>> WAL device would while also allowing additional metadata to be stored
>> there (if it will fix)."
>> 
>> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices
>> 
>>  it seems to be saying that there's no reason to create separate WAL
>> and DB partitions if they are on the same device.  Specifying one
>> large DB partition per OSD will cover both uses.
>> 
>> thanks,
>> Ben
>> 
>> On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder
>>  wrote:
>>> On 09/21/2017 05:03 PM, Mark Nelson wrote:
 
 On 09/21/2017 03:17 AM, Dietmar Rieder wrote:
> On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
>> On 2017-09-21 07:56, Lazuardi Nasution wrote:
>> 
>>> Hi,
>>> 
>>> I'm still looking for the answer of these questions. Maybe someone can
>>> share their thought on these. Any comment will be helpful too.
>>> 
>>> Best regards,
>>> 
>>> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
>>> > wrote:
>>> 
>>> Hi,
>>> 
>>> 1. Is it possible configure use osd_data not as small partition on
>>> OSD but a folder (ex. on root disk)? If yes, how to do that with
>>> ceph-disk and any pros/cons of doing that?
>>> 2. Is WAL & DB size calculated based on OSD size or expected
>>> throughput like on journal device of filestore? If no, what is the
>>> default value and pro/cons of adjusting that?
>>> 3. Is partition alignment matter on Bluestore, including WAL & DB
>>> if using separate device for them?
>>> 
>>> Best regards,
>>> 
>>> 
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> I am also looking for recommendations on wal/db partition sizes. Some
>> hints:
>> 
>> ceph-disk defaults used in case it does not find
>> bluestore_block_wal_size or bluestore_block_db_size in config file:
>> 
>> wal =  512MB
>> 
>> db = if bluestore_block_size (data size) is in config file it uses 1/100
>> of it else it uses 1G.
>> 
>> There is also a presentation by Sage back in March, see page 16:
>> 
>> https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in
>> 
>> 
>> wal: 512 MB
>> 
>> db: "a few" GB
>> 
>> the wal size is probably not debatable, it will be like a journal for
>> small block sizes which are constrained by iops hence 512 MB is more
>> than enough. Probably we will see more on the db size in the future.
> This is what I understood so far.
> I wonder if it makes sense to set the db size as big as possible and
> divide entire db device is  by the number of OSDs it will serve.
> 

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-22 Thread Richard Hesketh
I asked the same question a couple of weeks ago. No response I got 
contradicted the documentation but nobody actively confirmed the 
documentation was correct on this subject, either; my end state was that 
I was relatively confident I wasn't making some horrible mistake by 
simply specifying a big DB partition and letting bluestore work itself 
out (in my case, I've just got HDDs and SSDs that were journals under 
filestore), but I could not be sure there wasn't some sort of 
performance tuning I was missing out on by not specifying them separately.


Rich

On 21/09/17 20:37, Benjeman Meekhof wrote:

Some of this thread seems to contradict the documentation and confuses
me.  Is the statement below correct?

"The BlueStore journal will always be placed on the fastest device
available, so using a DB device will provide the same benefit that the
WAL device would while also allowing additional metadata to be stored
there (if it will fix)."

http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices

  it seems to be saying that there's no reason to create separate WAL
and DB partitions if they are on the same device.  Specifying one
large DB partition per OSD will cover both uses.

thanks,
Ben

On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder
 wrote:

On 09/21/2017 05:03 PM, Mark Nelson wrote:


On 09/21/2017 03:17 AM, Dietmar Rieder wrote:

On 09/21/2017 09:45 AM, Maged Mokhtar wrote:

On 2017-09-21 07:56, Lazuardi Nasution wrote:


Hi,

I'm still looking for the answer of these questions. Maybe someone can
share their thought on these. Any comment will be helpful too.

Best regards,

On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
> wrote:

 Hi,

 1. Is it possible configure use osd_data not as small partition on
 OSD but a folder (ex. on root disk)? If yes, how to do that with
 ceph-disk and any pros/cons of doing that?
 2. Is WAL & DB size calculated based on OSD size or expected
 throughput like on journal device of filestore? If no, what is the
 default value and pro/cons of adjusting that?
 3. Is partition alignment matter on Bluestore, including WAL & DB
 if using separate device for them?

 Best regards,


___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



I am also looking for recommendations on wal/db partition sizes. Some
hints:

ceph-disk defaults used in case it does not find
bluestore_block_wal_size or bluestore_block_db_size in config file:

wal =  512MB

db = if bluestore_block_size (data size) is in config file it uses 1/100
of it else it uses 1G.

There is also a presentation by Sage back in March, see page 16:

https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in


wal: 512 MB

db: "a few" GB

the wal size is probably not debatable, it will be like a journal for
small block sizes which are constrained by iops hence 512 MB is more
than enough. Probably we will see more on the db size in the future.

This is what I understood so far.
I wonder if it makes sense to set the db size as big as possible and
divide entire db device is  by the number of OSDs it will serve.

E.g. 10 OSDs / 1 NVME (800GB)

  (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD

Is this smart/stupid?

Personally I'd use 512MB-2GB for the WAL (larger buffers reduce write
amp but mean larger memtables and potentially higher overhead scanning
through memtables).  4x256MB buffers works pretty well, but it means
memory overhead too.  Beyond that, I'd devote the entire rest of the
device to DB partitions.


thanks for your suggestion Mark!

So, just to make sure I understood this right:

You'd  use a separeate 512MB-2GB WAL partition for each OSD and the
entire rest for DB partitions.

In the example case with 10xHDD OSD and 1 NVME it would then be 10 WAL
partitions with each 512MB-2GB and 10 equal sized DB partitions
consuming the rest of the NVME.


Thanks
   Dietmar
--
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-21 Thread Benjeman Meekhof
Some of this thread seems to contradict the documentation and confuses
me.  Is the statement below correct?

"The BlueStore journal will always be placed on the fastest device
available, so using a DB device will provide the same benefit that the
WAL device would while also allowing additional metadata to be stored
there (if it will fix)."

http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices

 it seems to be saying that there's no reason to create separate WAL
and DB partitions if they are on the same device.  Specifying one
large DB partition per OSD will cover both uses.

thanks,
Ben

On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder
 wrote:
> On 09/21/2017 05:03 PM, Mark Nelson wrote:
>>
>>
>> On 09/21/2017 03:17 AM, Dietmar Rieder wrote:
>>> On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
 On 2017-09-21 07:56, Lazuardi Nasution wrote:

> Hi,
>
> I'm still looking for the answer of these questions. Maybe someone can
> share their thought on these. Any comment will be helpful too.
>
> Best regards,
>
> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
> > wrote:
>
> Hi,
>
> 1. Is it possible configure use osd_data not as small partition on
> OSD but a folder (ex. on root disk)? If yes, how to do that with
> ceph-disk and any pros/cons of doing that?
> 2. Is WAL & DB size calculated based on OSD size or expected
> throughput like on journal device of filestore? If no, what is the
> default value and pro/cons of adjusting that?
> 3. Is partition alignment matter on Bluestore, including WAL & DB
> if using separate device for them?
>
> Best regards,
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 I am also looking for recommendations on wal/db partition sizes. Some
 hints:

 ceph-disk defaults used in case it does not find
 bluestore_block_wal_size or bluestore_block_db_size in config file:

 wal =  512MB

 db = if bluestore_block_size (data size) is in config file it uses 1/100
 of it else it uses 1G.

 There is also a presentation by Sage back in March, see page 16:

 https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in


 wal: 512 MB

 db: "a few" GB

 the wal size is probably not debatable, it will be like a journal for
 small block sizes which are constrained by iops hence 512 MB is more
 than enough. Probably we will see more on the db size in the future.
>>>
>>> This is what I understood so far.
>>> I wonder if it makes sense to set the db size as big as possible and
>>> divide entire db device is  by the number of OSDs it will serve.
>>>
>>> E.g. 10 OSDs / 1 NVME (800GB)
>>>
>>>  (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD
>>>
>>> Is this smart/stupid?
>>
>> Personally I'd use 512MB-2GB for the WAL (larger buffers reduce write
>> amp but mean larger memtables and potentially higher overhead scanning
>> through memtables).  4x256MB buffers works pretty well, but it means
>> memory overhead too.  Beyond that, I'd devote the entire rest of the
>> device to DB partitions.
>>
>
> thanks for your suggestion Mark!
>
> So, just to make sure I understood this right:
>
> You'd  use a separeate 512MB-2GB WAL partition for each OSD and the
> entire rest for DB partitions.
>
> In the example case with 10xHDD OSD and 1 NVME it would then be 10 WAL
> partitions with each 512MB-2GB and 10 equal sized DB partitions
> consuming the rest of the NVME.
>
>
> Thanks
>   Dietmar
> --
> _
> D i e t m a r  R i e d e r, Mag.Dr.
> Innsbruck Medical University
> Biocenter - Division for Bioinformatics
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-21 Thread Dietmar Rieder
On 09/21/2017 05:03 PM, Mark Nelson wrote:
> 
> 
> On 09/21/2017 03:17 AM, Dietmar Rieder wrote:
>> On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
>>> On 2017-09-21 07:56, Lazuardi Nasution wrote:
>>>
 Hi,

 I'm still looking for the answer of these questions. Maybe someone can
 share their thought on these. Any comment will be helpful too.

 Best regards,

 On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
 > wrote:

     Hi,

     1. Is it possible configure use osd_data not as small partition on
     OSD but a folder (ex. on root disk)? If yes, how to do that with
     ceph-disk and any pros/cons of doing that?
     2. Is WAL & DB size calculated based on OSD size or expected
     throughput like on journal device of filestore? If no, what is the
     default value and pro/cons of adjusting that?
     3. Is partition alignment matter on Bluestore, including WAL & DB
     if using separate device for them?

     Best regards,


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>> I am also looking for recommendations on wal/db partition sizes. Some
>>> hints:
>>>
>>> ceph-disk defaults used in case it does not find
>>> bluestore_block_wal_size or bluestore_block_db_size in config file:
>>>
>>> wal =  512MB
>>>
>>> db = if bluestore_block_size (data size) is in config file it uses 1/100
>>> of it else it uses 1G.
>>>
>>> There is also a presentation by Sage back in March, see page 16:
>>>
>>> https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in
>>>
>>>
>>> wal: 512 MB
>>>
>>> db: "a few" GB
>>>
>>> the wal size is probably not debatable, it will be like a journal for
>>> small block sizes which are constrained by iops hence 512 MB is more
>>> than enough. Probably we will see more on the db size in the future.
>>
>> This is what I understood so far.
>> I wonder if it makes sense to set the db size as big as possible and
>> divide entire db device is  by the number of OSDs it will serve.
>>
>> E.g. 10 OSDs / 1 NVME (800GB)
>>
>>  (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD
>>
>> Is this smart/stupid?
> 
> Personally I'd use 512MB-2GB for the WAL (larger buffers reduce write
> amp but mean larger memtables and potentially higher overhead scanning
> through memtables).  4x256MB buffers works pretty well, but it means
> memory overhead too.  Beyond that, I'd devote the entire rest of the
> device to DB partitions.
> 

thanks for your suggestion Mark!

So, just to make sure I understood this right:

You'd  use a separeate 512MB-2GB WAL partition for each OSD and the
entire rest for DB partitions.

In the example case with 10xHDD OSD and 1 NVME it would then be 10 WAL
partitions with each 512MB-2GB and 10 equal sized DB partitions
consuming the rest of the NVME.


Thanks
  Dietmar
-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-21 Thread Mark Nelson



On 09/21/2017 03:17 AM, Dietmar Rieder wrote:

On 09/21/2017 09:45 AM, Maged Mokhtar wrote:

On 2017-09-21 07:56, Lazuardi Nasution wrote:


Hi,

I'm still looking for the answer of these questions. Maybe someone can
share their thought on these. Any comment will be helpful too.

Best regards,

On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
> wrote:

Hi,

1. Is it possible configure use osd_data not as small partition on
OSD but a folder (ex. on root disk)? If yes, how to do that with
ceph-disk and any pros/cons of doing that?
2. Is WAL & DB size calculated based on OSD size or expected
throughput like on journal device of filestore? If no, what is the
default value and pro/cons of adjusting that?
3. Is partition alignment matter on Bluestore, including WAL & DB
if using separate device for them?

Best regards,


___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




I am also looking for recommendations on wal/db partition sizes. Some hints:

ceph-disk defaults used in case it does not find
bluestore_block_wal_size or bluestore_block_db_size in config file:

wal =  512MB

db = if bluestore_block_size (data size) is in config file it uses 1/100
of it else it uses 1G.

There is also a presentation by Sage back in March, see page 16:

https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in

wal: 512 MB

db: "a few" GB

the wal size is probably not debatable, it will be like a journal for
small block sizes which are constrained by iops hence 512 MB is more
than enough. Probably we will see more on the db size in the future.


This is what I understood so far.
I wonder if it makes sense to set the db size as big as possible and
divide entire db device is  by the number of OSDs it will serve.

E.g. 10 OSDs / 1 NVME (800GB)

 (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD

Is this smart/stupid?


Personally I'd use 512MB-2GB for the WAL (larger buffers reduce write 
amp but mean larger memtables and potentially higher overhead scanning 
through memtables).  4x256MB buffers works pretty well, but it means 
memory overhead too.  Beyond that, I'd devote the entire rest of the 
device to DB partitions.


Mark




Dietmar
 --
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-21 Thread Dietmar Rieder
On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
> On 2017-09-21 07:56, Lazuardi Nasution wrote:
> 
>> Hi,
>>  
>> I'm still looking for the answer of these questions. Maybe someone can
>> share their thought on these. Any comment will be helpful too.
>>  
>> Best regards,
>>
>> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
>> > wrote:
>>
>> Hi,
>>  
>> 1. Is it possible configure use osd_data not as small partition on
>> OSD but a folder (ex. on root disk)? If yes, how to do that with
>> ceph-disk and any pros/cons of doing that?
>> 2. Is WAL & DB size calculated based on OSD size or expected
>> throughput like on journal device of filestore? If no, what is the
>> default value and pro/cons of adjusting that?
>> 3. Is partition alignment matter on Bluestore, including WAL & DB
>> if using separate device for them?
>>  
>> Best regards,
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
>  
> 
> I am also looking for recommendations on wal/db partition sizes. Some hints:
> 
> ceph-disk defaults used in case it does not find
> bluestore_block_wal_size or bluestore_block_db_size in config file:
> 
> wal =  512MB
> 
> db = if bluestore_block_size (data size) is in config file it uses 1/100
> of it else it uses 1G.
> 
> There is also a presentation by Sage back in March, see page 16:
> 
> https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in
> 
> wal: 512 MB
> 
> db: "a few" GB 
> 
> the wal size is probably not debatable, it will be like a journal for
> small block sizes which are constrained by iops hence 512 MB is more
> than enough. Probably we will see more on the db size in the future.

This is what I understood so far.
I wonder if it makes sense to set the db size as big as possible and
divide entire db device is  by the number of OSDs it will serve.

E.g. 10 OSDs / 1 NVME (800GB)

 (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD

Is this smart/stupid?

Dietmar
 --
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-21 Thread Maged Mokhtar
On 2017-09-21 07:56, Lazuardi Nasution wrote:

> Hi, 
> 
> I'm still looking for the answer of these questions. Maybe someone can share 
> their thought on these. Any comment will be helpful too. 
> 
> Best regards, 
> 
> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution  
> wrote:
> 
>> Hi, 
>> 
>> 1. Is it possible configure use osd_data not as small partition on OSD but a 
>> folder (ex. on root disk)? If yes, how to do that with ceph-disk and any 
>> pros/cons of doing that? 
>> 2. Is WAL & DB size calculated based on OSD size or expected throughput like 
>> on journal device of filestore? If no, what is the default value and 
>> pro/cons of adjusting that? 
>> 3. Is partition alignment matter on Bluestore, including WAL & DB if using 
>> separate device for them? 
>> 
>> Best regards,
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

I am also looking for recommendations on wal/db partition sizes. Some
hints: 

ceph-disk defaults used in case it does not find
bluestore_block_wal_size or bluestore_block_db_size in config file: 

wal =  512MB 

db = if bluestore_block_size (data size) is in config file it uses 1/100
of it else it uses 1G. 

There is also a presentation by Sage back in March, see page 16: 

https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in


wal: 512 MB 

db: "a few" GB  

the wal size is probably not debatable, it will be like a journal for
small block sizes which are constrained by iops hence 512 MB is more
than enough. Probably we will see more on the db size in the future. 

Maged___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-20 Thread Lazuardi Nasution
Hi,

I'm still looking for the answer of these questions. Maybe someone can
share their thought on these. Any comment will be helpful too.

Best regards,

On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution 
wrote:

> Hi,
>
> 1. Is it possible configure use osd_data not as small partition on OSD but
> a folder (ex. on root disk)? If yes, how to do that with ceph-disk and any
> pros/cons of doing that?
> 2. Is WAL & DB size calculated based on OSD size or expected throughput
> like on journal device of filestore? If no, what is the default value and
> pro/cons of adjusting that?
> 3. Is partition alignment matter on Bluestore, including WAL & DB if using
> separate device for them?
>
> Best regards,
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com