Re: [ceph-users] Bluestore caching, flawed by design?

2018-04-02 Thread Mark Nelson

On 04/01/2018 07:59 PM, Christian Balzer wrote:


Hello,

firstly, Jack pretty much correctly correlated my issues to Mark's points,
more below.

On Sat, 31 Mar 2018 08:24:45 -0500 Mark Nelson wrote:


On 03/29/2018 08:59 PM, Christian Balzer wrote:


Hello,

my crappy test cluster was rendered inoperational by an IP renumbering
that wasn't planned and forced on me during a DC move, so I decided to
start from scratch and explore the fascinating world of Luminous/bluestore
and all the assorted bugs. ^_-
(yes I could have recovered the cluster by setting up a local VLAN with
the old IPs, extract the monmap, etc, but I consider the need for a
running monitor a flaw, since all the relevant data was present in the
leveldb).

Anyways, while I've read about bluestore OSD cache in passing here, the
back of my brain was clearly still hoping that it would use pagecache/SLAB
like other filesystems.
Which after my first round of playing with things clearly isn't the case.

This strikes me as a design flaw and regression because:

Bluestore's cache is not broken by design.


During further tests I verified something that caught my attention out of
the corner of my when glancing at atop output of the OSDs during my fio
runs.

Consider this fio run, after having done the same with write to populate
the file and caches (1GB per OSD default on the test cluster, 20 OSDs
total on 5 nodes):
---
$ fio --size=8G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
--rw=randread --name=fiojob --blocksize=4M --iodepth=32
---

This is being run against a kernel mounted RBD image.
On the Luminous test cluster it will read the data from the disks,
completely ignoring the pagecache on the host (as expected and desired)
AND the bluestore cache.

On a Jewel based test cluster with filestore the reads will be served from
the pagecaches of the OSD nodes, not only massively improving speed but
more importantly spindle contention.


Filestore absolutely will be able to do better than bluestore in the 
case where a single OSD benefits by utilizing all of the memory in a 
node even at the expense of other OSDs.  One situation where this could 
be the case is RGW bucket indexes, but even there the better solution 
imho is to shard the buckets.  I'd argue though that you need to be 
careful about how you approach this.  Let's say you have a single node 
with multiple OSDs and one of those OSDs has a big set of temporarily 
hot read data.  If you let that OSD use up most of the memory  on the 
node to cache the data set, all of the other OSDs have to give up 
something:  Namely cached onodes.  That means that once your hot data is 
no longer hot, all of those other OSDs will need to perform future onode 
reads from disk.  Whether or not it's beneficial to cache the hot data 
set depends on how long it's going to stay hot and how likely those 
other OSDs are going to have a read/write operation at some point in the 
future.  I'd argue that if you assume a generally mixed workload that 
generally spans multiple OSDs, you are much better off ignoring the hot 
data and simply keeping the onodes cached.


I suspect that the more common case where bluestore looks bad is when 
someone is benchmarking reads on a single filestore OSD vs a single 
bluestore OSD and doesn't bother giving bluestore a large portion of the 
memory on the node.  Filestore can look faster than bluestore in that 
case, especially if the data set is relatively small and can fit 
entirely in memory.  In the case where you've configured bluestore to 
use most of your available memory, bluestore should be pretty close.  
For some configurations/workloads potentially faster.




My guess is that bluestore treats "direct" differently than the kernel
accessing a filestore based OSD and I'm not sure what the "correct"
behavior here is.
But somebody migrating to bluestore with such a use case and plenty of RAM
on their OSD nodes is likely to notice this and not going to be happy about
it.


Like I said earlier, it's all about trade-offs.  The pagecache gives you 
a lot of flexibility and on slower devices the price you pay isn't 
terribly high.  On faster devices it's a bigger issue.






I'm not totally convinced that some of the trade-offs we've made with
bluestore's cache implementation are optimal, but I think you should
consider cooling your rhetoric down.


1. Completely new users may think that bluestore defaults are fine and
waste all that RAM in their machines.

What does "wasting" RAM mean in the context of a node running ceph? Are
you upset that other applications can't come in and evict bluestore
onode, OMAP, or object data from cache?


What Jack pointed out, unless you go around and start tuning things,
all available free RAM won't be used for caching.

This raises another point, it being per process data and from skimming
over some bluestore threads here, if you go and raise the cache to use
most RAM during normal ops you're likely to be visited by the evil OOM

Re: [ceph-users] Bluestore caching, flawed by design?

2018-04-02 Thread Simon Leinen
Christian Balzer writes:
> On Mon, 2 Apr 2018 08:33:35 +0200 John Hearns wrote:
>> Christian, you mention single socket systems for storage servers.
>> I often thought that the Xeon-D would be ideal as a building block for
>> storage servers
>> https://www.intel.com/content/www/us/en/products/processors/xeon/d-processors.html
>> Low power, and a complete System-On-Chip with 10gig Ethernet.
>> 
> If you (re)search the ML archives you should be able find discussions
> about this and I seem to remember them coming up as well.

> If you're going to have a typical HDDs for storage and 1-2 SSDs for
> journal/WAL/DB setup, they should do well enough.

We have such systems (QuantaGrid SD1Q-1ULH with Xeon D-1541) and are
generally happy with them.  They are certainly very power-efficient.

> But in that scenario you're likely not all that latency conscious
> about NUMA issues to begin with, given that current CPU interlinks are
> quite decent.

Right.

> They however do feel underpowered when mated with really fast (NVMe) or
> more than 4 SSDs per node if you have a lot of small writes.
[...]

The new Xeon-D 2100 look promising.  I haven't seen any
storage-optimized servers based on this yet, though.
-- 
Simon.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore caching, flawed by design?

2018-04-02 Thread Christian Balzer

Hello,

On Mon, 2 Apr 2018 08:33:35 +0200 John Hearns wrote:

> Christian, you mention single socket systems for storage servers.
> I often thought that the Xeon-D would be ideal as a building block for
> storage servers
> https://www.intel.com/content/www/us/en/products/processors/xeon/d-processors.html
> Low power, and a complete System-On-Chip with 10gig Ethernet.
>
If you (re)search the ML archives you should be able find discussions
about this and I seem to remember them coming up as well.

If you're going to have a typical HDDs for storage and 1-2 SSDs for
journal/WAL/DB setup, they should do well enough. But in that scenario
you're likely not all that latency conscious about NUMA issues to begin
with, given that current CPU interlinks are quite decent.

They however do feel underpowered when mated with really fast (NVMe) or
more than 4 SSDs per node if you have a lot of small writes.

For example with a Jewel cluster and Intel DC S3610 SSDs this fio line:
---
$ fio --size=8G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
--rw=randwrite --name=fiojob --blocksize=4K --iodepth=32
---
with this CPU:
Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz

will leave the SSDs only about 40% busy, but use about 2.5 cores (250% on
atop) per OSD process, leaving very little free CPU cycles to go around.

I'd look at something with these for high end single node systems or just
go Epyc and drown in PCIe lanes as well for a change:
Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz (12/6 cores)

Christian

> I haven't been following these processors lately. Is anyone building  CEPH
> clusters using them
> 
> On 2 April 2018 at 02:59, Christian Balzer  wrote:
> 
> >
> > Hello,
> >
> > firstly, Jack pretty much correctly correlated my issues to Mark's points,
> > more below.
> >
> > On Sat, 31 Mar 2018 08:24:45 -0500 Mark Nelson wrote:
> >  
> > > On 03/29/2018 08:59 PM, Christian Balzer wrote:
> > >  
> > > > Hello,
> > > >
> > > > my crappy test cluster was rendered inoperational by an IP renumbering
> > > > that wasn't planned and forced on me during a DC move, so I decided to
> > > > start from scratch and explore the fascinating world of  
> > Luminous/bluestore  
> > > > and all the assorted bugs. ^_-
> > > > (yes I could have recovered the cluster by setting up a local VLAN with
> > > > the old IPs, extract the monmap, etc, but I consider the need for a
> > > > running monitor a flaw, since all the relevant data was present in the
> > > > leveldb).
> > > >
> > > > Anyways, while I've read about bluestore OSD cache in passing here, the
> > > > back of my brain was clearly still hoping that it would use  
> > pagecache/SLAB  
> > > > like other filesystems.
> > > > Which after my first round of playing with things clearly isn't the  
> > case.  
> > > >
> > > > This strikes me as a design flaw and regression because:  
> > >
> > > Bluestore's cache is not broken by design.
> > >  
> >
> > During further tests I verified something that caught my attention out of
> > the corner of my when glancing at atop output of the OSDs during my fio
> > runs.
> >
> > Consider this fio run, after having done the same with write to populate
> > the file and caches (1GB per OSD default on the test cluster, 20 OSDs
> > total on 5 nodes):
> > ---
> > $ fio --size=8G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
> > --rw=randread --name=fiojob --blocksize=4M --iodepth=32
> > ---
> >
> > This is being run against a kernel mounted RBD image.
> > On the Luminous test cluster it will read the data from the disks,
> > completely ignoring the pagecache on the host (as expected and desired)
> > AND the bluestore cache.
> >
> > On a Jewel based test cluster with filestore the reads will be served from
> > the pagecaches of the OSD nodes, not only massively improving speed but
> > more importantly spindle contention.
> >
> > My guess is that bluestore treats "direct" differently than the kernel
> > accessing a filestore based OSD and I'm not sure what the "correct"
> > behavior here is.
> > But somebody migrating to bluestore with such a use case and plenty of RAM
> > on their OSD nodes is likely to notice this and not going to be happy about
> > it.
> >
> >  
> > > I'm not totally convinced that some of the trade-offs we've made with
> > > bluestore's cache implementation are optimal, but I think you should
> > > consider cooling your rhetoric down.
> > >  
> > > > 1. Completely new users may think that bluestore defaults are fine and
> > > > waste all that RAM in their machines.  
> > >
> > > What does "wasting" RAM mean in the context of a node running ceph? Are
> > > you upset that other applications can't come in and evict bluestore
> > > onode, OMAP, or object data from cache?
> > >  
> > What Jack pointed out, unless you go around and start tuning things,
> > all available free RAM won't be used for caching.
> >
> > This raises another point, it being per process data and from skimming
> > over some bluestore 

Re: [ceph-users] Bluestore caching, flawed by design?

2018-04-02 Thread John Hearns
Christian, you mention single socket systems for storage servers.
I often thought that the Xeon-D would be ideal as a building block for
storage servers
https://www.intel.com/content/www/us/en/products/processors/xeon/d-processors.html
Low power, and a complete System-On-Chip with 10gig Ethernet.

I haven't been following these processors lately. Is anyone building  CEPH
clusters using them

On 2 April 2018 at 02:59, Christian Balzer  wrote:

>
> Hello,
>
> firstly, Jack pretty much correctly correlated my issues to Mark's points,
> more below.
>
> On Sat, 31 Mar 2018 08:24:45 -0500 Mark Nelson wrote:
>
> > On 03/29/2018 08:59 PM, Christian Balzer wrote:
> >
> > > Hello,
> > >
> > > my crappy test cluster was rendered inoperational by an IP renumbering
> > > that wasn't planned and forced on me during a DC move, so I decided to
> > > start from scratch and explore the fascinating world of
> Luminous/bluestore
> > > and all the assorted bugs. ^_-
> > > (yes I could have recovered the cluster by setting up a local VLAN with
> > > the old IPs, extract the monmap, etc, but I consider the need for a
> > > running monitor a flaw, since all the relevant data was present in the
> > > leveldb).
> > >
> > > Anyways, while I've read about bluestore OSD cache in passing here, the
> > > back of my brain was clearly still hoping that it would use
> pagecache/SLAB
> > > like other filesystems.
> > > Which after my first round of playing with things clearly isn't the
> case.
> > >
> > > This strikes me as a design flaw and regression because:
> >
> > Bluestore's cache is not broken by design.
> >
>
> During further tests I verified something that caught my attention out of
> the corner of my when glancing at atop output of the OSDs during my fio
> runs.
>
> Consider this fio run, after having done the same with write to populate
> the file and caches (1GB per OSD default on the test cluster, 20 OSDs
> total on 5 nodes):
> ---
> $ fio --size=8G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
> --rw=randread --name=fiojob --blocksize=4M --iodepth=32
> ---
>
> This is being run against a kernel mounted RBD image.
> On the Luminous test cluster it will read the data from the disks,
> completely ignoring the pagecache on the host (as expected and desired)
> AND the bluestore cache.
>
> On a Jewel based test cluster with filestore the reads will be served from
> the pagecaches of the OSD nodes, not only massively improving speed but
> more importantly spindle contention.
>
> My guess is that bluestore treats "direct" differently than the kernel
> accessing a filestore based OSD and I'm not sure what the "correct"
> behavior here is.
> But somebody migrating to bluestore with such a use case and plenty of RAM
> on their OSD nodes is likely to notice this and not going to be happy about
> it.
>
>
> > I'm not totally convinced that some of the trade-offs we've made with
> > bluestore's cache implementation are optimal, but I think you should
> > consider cooling your rhetoric down.
> >
> > > 1. Completely new users may think that bluestore defaults are fine and
> > > waste all that RAM in their machines.
> >
> > What does "wasting" RAM mean in the context of a node running ceph? Are
> > you upset that other applications can't come in and evict bluestore
> > onode, OMAP, or object data from cache?
> >
> What Jack pointed out, unless you go around and start tuning things,
> all available free RAM won't be used for caching.
>
> This raises another point, it being per process data and from skimming
> over some bluestore threads here, if you go and raise the cache to use
> most RAM during normal ops you're likely to be visited by the evil OOM
> witch during heavy recovery OPS.
>
> Whereas the good ole pagecache would just get evicted in that scenario.
>
> > > 2. Having a per OSD cache is inefficient compared to a common cache
> like
> > > pagecache, since an OSD that is busier than others would benefit from a
> > > shared cache more.
> >
> > It's only "inefficient" if you assume that using the pagecache, and more
> > generally, kernel syscalls, is free.  Yes the pagecache is convenient
> > and yes it gives you a lot of flexibility, but you pay for that
> > flexibility if you are trying to do anything fast.
> >
> > For instance, take the new KPTI patches in the kernel for meltdown. Look
> > at how badly it can hurt MyISAM database performance in MariaDB:
> >
> I, like many others here, have decided that all the Meltdown and Spectre
> patches are a bit pointless on pure OSD nodes, because if somebody on the
> node is running random code you're already in deep doodoo.
>
> That being said, I will totally concur that syscalls aren't free.
> However given the latencies induced by the rather long/complex code IOPS
> have to transverse within Ceph, how much of a gain would you say
> eliminating these particular calls did achieve?
>
> > https://mariadb.org/myisam-table-scan-performance-kpti/
> >
> > MyISAM does not have 

Re: [ceph-users] Bluestore caching, flawed by design?

2018-04-02 Thread John Hearns
> A long time ago I was responsible for validating the performance of CXFS
on an SGI Altix UV distributed shared-memory supercomputer.  As it turns
out, we could achieve about 22GB/s writes with XFS (a huge >number at the
time), but CXFS was 5-10x slower.  A big part of that turned out to be the
kernel distributing page cache across the Numalink5 interconnects to remote
memory.
> The problem can potentially happen on any NUMA system to varying degrees.

That's very interesting. I used to manage Itanium Altixes and then an UV
system. That work sounds very interesting.
I set up cpusets on the UV system, which had a big performance increase
since user jobs had CPUs and memory close to each other.
I also had a boot cpuset on the first blade, which had the fibrechannel
HBA, so I guess that had a similar effect in that the CXFS processes were
local to the IO card.
UV was running SuSE - sorry.

On the subject of memory allocation, GPFS uses an amount of pagepool
memory. The given advice always seems to be make this large.
There is one fixed pagepool on a server, even if it has multiple NSDs
How does this compare to CEPH memory allocation?







On 31 March 2018 at 15:24, Mark Nelson  wrote:

> On 03/29/2018 08:59 PM, Christian Balzer wrote:
>
> Hello,
>>
>> my crappy test cluster was rendered inoperational by an IP renumbering
>> that wasn't planned and forced on me during a DC move, so I decided to
>> start from scratch and explore the fascinating world of Luminous/bluestore
>> and all the assorted bugs. ^_-
>> (yes I could have recovered the cluster by setting up a local VLAN with
>> the old IPs, extract the monmap, etc, but I consider the need for a
>> running monitor a flaw, since all the relevant data was present in the
>> leveldb).
>>
>> Anyways, while I've read about bluestore OSD cache in passing here, the
>> back of my brain was clearly still hoping that it would use pagecache/SLAB
>> like other filesystems.
>> Which after my first round of playing with things clearly isn't the case.
>>
>> This strikes me as a design flaw and regression because:
>>
>
> Bluestore's cache is not broken by design.
>
> I'm not totally convinced that some of the trade-offs we've made with
> bluestore's cache implementation are optimal, but I think you should
> consider cooling your rhetoric down.
>
> 1. Completely new users may think that bluestore defaults are fine and
>> waste all that RAM in their machines.
>>
>
> What does "wasting" RAM mean in the context of a node running ceph? Are
> you upset that other applications can't come in and evict bluestore onode,
> OMAP, or object data from cache?
>
> 2. Having a per OSD cache is inefficient compared to a common cache like
>> pagecache, since an OSD that is busier than others would benefit from a
>> shared cache more.
>>
>
> It's only "inefficient" if you assume that using the pagecache, and more
> generally, kernel syscalls, is free.  Yes the pagecache is convenient and
> yes it gives you a lot of flexibility, but you pay for that flexibility if
> you are trying to do anything fast.
>
> For instance, take the new KPTI patches in the kernel for meltdown. Look
> at how badly it can hurt MyISAM database performance in MariaDB:
>
> https://mariadb.org/myisam-table-scan-performance-kpti/
>
> MyISAM does not have a dedicated row cache and instead caches row data in
> the page cache as you suggest Bluestore should do for it's data.  Look at
> how badly KPTI hurts performance (~40%). Now look at ARIA with a dedicated
> 128MB cache (less than 1%).  KPTI is a really good example of how much this
> stuff can hurt you, but syscalls, context switches, and page faults were
> already expensive even before meltdown.  Not to mention that right now
> bluestore keeps onodes and buffers stored in it's cache in an unencoded
> form.
>
> Here's a couple of other articles worth looking at:
>
> https://eng.uber.com/mysql-migration/
> https://www.scylladb.com/2018/01/07/cost-of-avoiding-a-meltdown/
> http://www.brendangregg.com/blog/2018-02-09/kpti-kaiser-melt
> down-performance.html
>
> 3. A uniform OSD cache size of course will be a nightmare when having
>> non-uniform HW, either with RAM or number of OSDs.
>>
>
> Non-Uniform hardware is a big reason that pinning dedicated memory to
> specific cores/sockets is really nice vs relying on potentially remote
> memory page cache reads.  A long time ago I was responsible for validating
> the performance of CXFS on an SGI Altix UV distributed shared-memory
> supercomputer.  As it turns out, we could achieve about 22GB/s writes with
> XFS (a huge number at the time), but CXFS was 5-10x slower.  A big part of
> that turned out to be the kernel distributing page cache across the
> Numalink5 interconnects to remote memory.  The problem can potentially
> happen on any NUMA system to varying degrees.
>
> Personally I have two primary issues with bluestore's memory configuration
> right now:
>
> 1) It's too complicated for users to figure 

Re: [ceph-users] Bluestore caching, flawed by design?

2018-04-01 Thread Christian Balzer

Hello,

firstly, Jack pretty much correctly correlated my issues to Mark's points,
more below.

On Sat, 31 Mar 2018 08:24:45 -0500 Mark Nelson wrote:

> On 03/29/2018 08:59 PM, Christian Balzer wrote:
> 
> > Hello,
> >
> > my crappy test cluster was rendered inoperational by an IP renumbering
> > that wasn't planned and forced on me during a DC move, so I decided to
> > start from scratch and explore the fascinating world of Luminous/bluestore
> > and all the assorted bugs. ^_-
> > (yes I could have recovered the cluster by setting up a local VLAN with
> > the old IPs, extract the monmap, etc, but I consider the need for a
> > running monitor a flaw, since all the relevant data was present in the
> > leveldb).
> >
> > Anyways, while I've read about bluestore OSD cache in passing here, the
> > back of my brain was clearly still hoping that it would use pagecache/SLAB
> > like other filesystems.
> > Which after my first round of playing with things clearly isn't the case.
> >
> > This strikes me as a design flaw and regression because:  
> 
> Bluestore's cache is not broken by design.
> 

During further tests I verified something that caught my attention out of
the corner of my when glancing at atop output of the OSDs during my fio
runs.

Consider this fio run, after having done the same with write to populate
the file and caches (1GB per OSD default on the test cluster, 20 OSDs
total on 5 nodes):
---
$ fio --size=8G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
--rw=randread --name=fiojob --blocksize=4M --iodepth=32 
---

This is being run against a kernel mounted RBD image.
On the Luminous test cluster it will read the data from the disks,
completely ignoring the pagecache on the host (as expected and desired)
AND the bluestore cache.

On a Jewel based test cluster with filestore the reads will be served from
the pagecaches of the OSD nodes, not only massively improving speed but
more importantly spindle contention. 

My guess is that bluestore treats "direct" differently than the kernel
accessing a filestore based OSD and I'm not sure what the "correct"
behavior here is.
But somebody migrating to bluestore with such a use case and plenty of RAM
on their OSD nodes is likely to notice this and not going to be happy about
it.


> I'm not totally convinced that some of the trade-offs we've made with 
> bluestore's cache implementation are optimal, but I think you should 
> consider cooling your rhetoric down.
> 
> > 1. Completely new users may think that bluestore defaults are fine and
> > waste all that RAM in their machines.  
> 
> What does "wasting" RAM mean in the context of a node running ceph? Are 
> you upset that other applications can't come in and evict bluestore 
> onode, OMAP, or object data from cache?
> 
What Jack pointed out, unless you go around and start tuning things,
all available free RAM won't be used for caching.

This raises another point, it being per process data and from skimming
over some bluestore threads here, if you go and raise the cache to use
most RAM during normal ops you're likely to be visited by the evil OOM
witch during heavy recovery OPS.

Whereas the good ole pagecache would just get evicted in that scenario.

> > 2. Having a per OSD cache is inefficient compared to a common cache like
> > pagecache, since an OSD that is busier than others would benefit from a
> > shared cache more.  
> 
> It's only "inefficient" if you assume that using the pagecache, and more 
> generally, kernel syscalls, is free.  Yes the pagecache is convenient 
> and yes it gives you a lot of flexibility, but you pay for that 
> flexibility if you are trying to do anything fast.
> 
> For instance, take the new KPTI patches in the kernel for meltdown. Look 
> at how badly it can hurt MyISAM database performance in MariaDB:
> 
I, like many others here, have decided that all the Meltdown and Spectre
patches are a bit pointless on pure OSD nodes, because if somebody on the
node is running random code you're already in deep doodoo.

That being said, I will totally concur that syscalls aren't free.
However given the latencies induced by the rather long/complex code IOPS
have to transverse within Ceph, how much of a gain would you say
eliminating these particular calls did achieve?

> https://mariadb.org/myisam-table-scan-performance-kpti/
> 
> MyISAM does not have a dedicated row cache and instead caches row data 
> in the page cache as you suggest Bluestore should do for it's data.  
> Look at how badly KPTI hurts performance (~40%). Now look at ARIA with a 
> dedicated 128MB cache (less than 1%).  KPTI is a really good example of 
> how much this stuff can hurt you, but syscalls, context switches, and 
> page faults were already expensive even before meltdown.  Not to mention 
> that right now bluestore keeps onodes and buffers stored in it's cache 
> in an unencoded form.
> 
That last bit is quite relevant of course.

> Here's a couple of other articles worth looking at:
> 
> 

Re: [ceph-users] Bluestore caching, flawed by design?

2018-03-31 Thread Jack
On 03/31/2018 03:24 PM, Mark Nelson wrote:
>> 1. Completely new users may think that bluestore defaults are fine and
>> waste all that RAM in their machines.
> 
> What does "wasting" RAM mean in the context of a node running ceph? Are
> you upset that other applications can't come in and evict bluestore
> onode, OMAP, or object data from cache?

I think he thought of your #1
Unless I am mistaken, with bluestore, you allocate some cache per OSD,
and the OSD won't use more, even if there is free memory laying around
Thus, a "waste" of ram


>> 2. Having a per OSD cache is inefficient compared to a common cache like
>> pagecache, since an OSD that is busier than others would benefit from a
>> shared cache more.
> 
> It's only "inefficient" if you assume that using the pagecache, and more
> generally, kernel syscalls, is free.  Yes the pagecache is convenient
> and yes it gives you a lot of flexibility, but you pay for that
> flexibility if you are trying to do anything fast.

I think he thought of your #2
"Inefficient" because each OSDs have a fixed cache size, unrelated to
their real usage


To me, "flawed" is a bit extreme, bluestore is a good piece of work,
even if there is still place for improvements;

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore caching, flawed by design?

2018-03-31 Thread Mark Nelson

On 03/29/2018 08:59 PM, Christian Balzer wrote:


Hello,

my crappy test cluster was rendered inoperational by an IP renumbering
that wasn't planned and forced on me during a DC move, so I decided to
start from scratch and explore the fascinating world of Luminous/bluestore
and all the assorted bugs. ^_-
(yes I could have recovered the cluster by setting up a local VLAN with
the old IPs, extract the monmap, etc, but I consider the need for a
running monitor a flaw, since all the relevant data was present in the
leveldb).

Anyways, while I've read about bluestore OSD cache in passing here, the
back of my brain was clearly still hoping that it would use pagecache/SLAB
like other filesystems.
Which after my first round of playing with things clearly isn't the case.

This strikes me as a design flaw and regression because:


Bluestore's cache is not broken by design.

I'm not totally convinced that some of the trade-offs we've made with 
bluestore's cache implementation are optimal, but I think you should 
consider cooling your rhetoric down.



1. Completely new users may think that bluestore defaults are fine and
waste all that RAM in their machines.


What does "wasting" RAM mean in the context of a node running ceph? Are 
you upset that other applications can't come in and evict bluestore 
onode, OMAP, or object data from cache?



2. Having a per OSD cache is inefficient compared to a common cache like
pagecache, since an OSD that is busier than others would benefit from a
shared cache more.


It's only "inefficient" if you assume that using the pagecache, and more 
generally, kernel syscalls, is free.  Yes the pagecache is convenient 
and yes it gives you a lot of flexibility, but you pay for that 
flexibility if you are trying to do anything fast.


For instance, take the new KPTI patches in the kernel for meltdown. Look 
at how badly it can hurt MyISAM database performance in MariaDB:


https://mariadb.org/myisam-table-scan-performance-kpti/

MyISAM does not have a dedicated row cache and instead caches row data 
in the page cache as you suggest Bluestore should do for it's data.  
Look at how badly KPTI hurts performance (~40%). Now look at ARIA with a 
dedicated 128MB cache (less than 1%).  KPTI is a really good example of 
how much this stuff can hurt you, but syscalls, context switches, and 
page faults were already expensive even before meltdown.  Not to mention 
that right now bluestore keeps onodes and buffers stored in it's cache 
in an unencoded form.


Here's a couple of other articles worth looking at:

https://eng.uber.com/mysql-migration/
https://www.scylladb.com/2018/01/07/cost-of-avoiding-a-meltdown/
http://www.brendangregg.com/blog/2018-02-09/kpti-kaiser-meltdown-performance.html


3. A uniform OSD cache size of course will be a nightmare when having
non-uniform HW, either with RAM or number of OSDs.


Non-Uniform hardware is a big reason that pinning dedicated memory to 
specific cores/sockets is really nice vs relying on potentially remote 
memory page cache reads.  A long time ago I was responsible for 
validating the performance of CXFS on an SGI Altix UV distributed 
shared-memory supercomputer.  As it turns out, we could achieve about 
22GB/s writes with XFS (a huge number at the time), but CXFS was 5-10x 
slower.  A big part of that turned out to be the kernel distributing 
page cache across the Numalink5 interconnects to remote memory.  The 
problem can potentially happen on any NUMA system to varying degrees.


Personally I have two primary issues with bluestore's memory 
configuration right now:


1) It's too complicated for users to figure out where to assign memory 
and in what ratios.  I'm attempting to improve this by making 
bluestore's cache autotuning so the user just gives it a number and 
bluestore will try to work out where it should assign memory.


2) In the case where a subset of OSDs are really hot (maybe RGW bucket 
accesses) you might want some OSDs to get more memory than others.  I 
think we can tackle this better if we migrate to a one-osd-per-node 
sharded architecture (likely based on seastar), though we'll still need 
to be very aware of remote memory.  Given that this is fairly difficult 
to do well, we're probably going to be better off just dedicating a 
static pool to each shard initially.


Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluestore caching, flawed by design?

2018-03-29 Thread Christian Balzer

Hello,

my crappy test cluster was rendered inoperational by an IP renumbering
that wasn't planned and forced on me during a DC move, so I decided to
start from scratch and explore the fascinating world of Luminous/bluestore
and all the assorted bugs. ^_- 
(yes I could have recovered the cluster by setting up a local VLAN with
the old IPs, extract the monmap, etc, but I consider the need for a
running monitor a flaw, since all the relevant data was present in the
leveldb). 

Anyways, while I've read about bluestore OSD cache in passing here, the
back of my brain was clearly still hoping that it would use pagecache/SLAB
like other filesystems. 
Which after my first round of playing with things clearly isn't the case.

This strikes me as a design flaw and regression because:
1. Completely new users may think that bluestore defaults are fine and
waste all that RAM in their machines.
2. Having a per OSD cache is inefficient compared to a common cache like
pagecache, since an OSD that is busier than others would benefit from a
shared cache more.
3. A uniform OSD cache size of course will be a nightmare when having
non-uniform HW, either with RAM or number of OSDs.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com