Re: [ceph-users] Bluestore caching, flawed by design?
On 04/01/2018 07:59 PM, Christian Balzer wrote: Hello, firstly, Jack pretty much correctly correlated my issues to Mark's points, more below. On Sat, 31 Mar 2018 08:24:45 -0500 Mark Nelson wrote: On 03/29/2018 08:59 PM, Christian Balzer wrote: Hello, my crappy test cluster was rendered inoperational by an IP renumbering that wasn't planned and forced on me during a DC move, so I decided to start from scratch and explore the fascinating world of Luminous/bluestore and all the assorted bugs. ^_- (yes I could have recovered the cluster by setting up a local VLAN with the old IPs, extract the monmap, etc, but I consider the need for a running monitor a flaw, since all the relevant data was present in the leveldb). Anyways, while I've read about bluestore OSD cache in passing here, the back of my brain was clearly still hoping that it would use pagecache/SLAB like other filesystems. Which after my first round of playing with things clearly isn't the case. This strikes me as a design flaw and regression because: Bluestore's cache is not broken by design. During further tests I verified something that caught my attention out of the corner of my when glancing at atop output of the OSDs during my fio runs. Consider this fio run, after having done the same with write to populate the file and caches (1GB per OSD default on the test cluster, 20 OSDs total on 5 nodes): --- $ fio --size=8G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randread --name=fiojob --blocksize=4M --iodepth=32 --- This is being run against a kernel mounted RBD image. On the Luminous test cluster it will read the data from the disks, completely ignoring the pagecache on the host (as expected and desired) AND the bluestore cache. On a Jewel based test cluster with filestore the reads will be served from the pagecaches of the OSD nodes, not only massively improving speed but more importantly spindle contention. Filestore absolutely will be able to do better than bluestore in the case where a single OSD benefits by utilizing all of the memory in a node even at the expense of other OSDs. One situation where this could be the case is RGW bucket indexes, but even there the better solution imho is to shard the buckets. I'd argue though that you need to be careful about how you approach this. Let's say you have a single node with multiple OSDs and one of those OSDs has a big set of temporarily hot read data. If you let that OSD use up most of the memory on the node to cache the data set, all of the other OSDs have to give up something: Namely cached onodes. That means that once your hot data is no longer hot, all of those other OSDs will need to perform future onode reads from disk. Whether or not it's beneficial to cache the hot data set depends on how long it's going to stay hot and how likely those other OSDs are going to have a read/write operation at some point in the future. I'd argue that if you assume a generally mixed workload that generally spans multiple OSDs, you are much better off ignoring the hot data and simply keeping the onodes cached. I suspect that the more common case where bluestore looks bad is when someone is benchmarking reads on a single filestore OSD vs a single bluestore OSD and doesn't bother giving bluestore a large portion of the memory on the node. Filestore can look faster than bluestore in that case, especially if the data set is relatively small and can fit entirely in memory. In the case where you've configured bluestore to use most of your available memory, bluestore should be pretty close. For some configurations/workloads potentially faster. My guess is that bluestore treats "direct" differently than the kernel accessing a filestore based OSD and I'm not sure what the "correct" behavior here is. But somebody migrating to bluestore with such a use case and plenty of RAM on their OSD nodes is likely to notice this and not going to be happy about it. Like I said earlier, it's all about trade-offs. The pagecache gives you a lot of flexibility and on slower devices the price you pay isn't terribly high. On faster devices it's a bigger issue. I'm not totally convinced that some of the trade-offs we've made with bluestore's cache implementation are optimal, but I think you should consider cooling your rhetoric down. 1. Completely new users may think that bluestore defaults are fine and waste all that RAM in their machines. What does "wasting" RAM mean in the context of a node running ceph? Are you upset that other applications can't come in and evict bluestore onode, OMAP, or object data from cache? What Jack pointed out, unless you go around and start tuning things, all available free RAM won't be used for caching. This raises another point, it being per process data and from skimming over some bluestore threads here, if you go and raise the cache to use most RAM during normal ops you're likely to be visited by the evil OOM
Re: [ceph-users] Bluestore caching, flawed by design?
Christian Balzer writes: > On Mon, 2 Apr 2018 08:33:35 +0200 John Hearns wrote: >> Christian, you mention single socket systems for storage servers. >> I often thought that the Xeon-D would be ideal as a building block for >> storage servers >> https://www.intel.com/content/www/us/en/products/processors/xeon/d-processors.html >> Low power, and a complete System-On-Chip with 10gig Ethernet. >> > If you (re)search the ML archives you should be able find discussions > about this and I seem to remember them coming up as well. > If you're going to have a typical HDDs for storage and 1-2 SSDs for > journal/WAL/DB setup, they should do well enough. We have such systems (QuantaGrid SD1Q-1ULH with Xeon D-1541) and are generally happy with them. They are certainly very power-efficient. > But in that scenario you're likely not all that latency conscious > about NUMA issues to begin with, given that current CPU interlinks are > quite decent. Right. > They however do feel underpowered when mated with really fast (NVMe) or > more than 4 SSDs per node if you have a lot of small writes. [...] The new Xeon-D 2100 look promising. I haven't seen any storage-optimized servers based on this yet, though. -- Simon. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore caching, flawed by design?
Hello, On Mon, 2 Apr 2018 08:33:35 +0200 John Hearns wrote: > Christian, you mention single socket systems for storage servers. > I often thought that the Xeon-D would be ideal as a building block for > storage servers > https://www.intel.com/content/www/us/en/products/processors/xeon/d-processors.html > Low power, and a complete System-On-Chip with 10gig Ethernet. > If you (re)search the ML archives you should be able find discussions about this and I seem to remember them coming up as well. If you're going to have a typical HDDs for storage and 1-2 SSDs for journal/WAL/DB setup, they should do well enough. But in that scenario you're likely not all that latency conscious about NUMA issues to begin with, given that current CPU interlinks are quite decent. They however do feel underpowered when mated with really fast (NVMe) or more than 4 SSDs per node if you have a lot of small writes. For example with a Jewel cluster and Intel DC S3610 SSDs this fio line: --- $ fio --size=8G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4K --iodepth=32 --- with this CPU: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz will leave the SSDs only about 40% busy, but use about 2.5 cores (250% on atop) per OSD process, leaving very little free CPU cycles to go around. I'd look at something with these for high end single node systems or just go Epyc and drown in PCIe lanes as well for a change: Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz (12/6 cores) Christian > I haven't been following these processors lately. Is anyone building CEPH > clusters using them > > On 2 April 2018 at 02:59, Christian Balzerwrote: > > > > > Hello, > > > > firstly, Jack pretty much correctly correlated my issues to Mark's points, > > more below. > > > > On Sat, 31 Mar 2018 08:24:45 -0500 Mark Nelson wrote: > > > > > On 03/29/2018 08:59 PM, Christian Balzer wrote: > > > > > > > Hello, > > > > > > > > my crappy test cluster was rendered inoperational by an IP renumbering > > > > that wasn't planned and forced on me during a DC move, so I decided to > > > > start from scratch and explore the fascinating world of > > Luminous/bluestore > > > > and all the assorted bugs. ^_- > > > > (yes I could have recovered the cluster by setting up a local VLAN with > > > > the old IPs, extract the monmap, etc, but I consider the need for a > > > > running monitor a flaw, since all the relevant data was present in the > > > > leveldb). > > > > > > > > Anyways, while I've read about bluestore OSD cache in passing here, the > > > > back of my brain was clearly still hoping that it would use > > pagecache/SLAB > > > > like other filesystems. > > > > Which after my first round of playing with things clearly isn't the > > case. > > > > > > > > This strikes me as a design flaw and regression because: > > > > > > Bluestore's cache is not broken by design. > > > > > > > During further tests I verified something that caught my attention out of > > the corner of my when glancing at atop output of the OSDs during my fio > > runs. > > > > Consider this fio run, after having done the same with write to populate > > the file and caches (1GB per OSD default on the test cluster, 20 OSDs > > total on 5 nodes): > > --- > > $ fio --size=8G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 > > --rw=randread --name=fiojob --blocksize=4M --iodepth=32 > > --- > > > > This is being run against a kernel mounted RBD image. > > On the Luminous test cluster it will read the data from the disks, > > completely ignoring the pagecache on the host (as expected and desired) > > AND the bluestore cache. > > > > On a Jewel based test cluster with filestore the reads will be served from > > the pagecaches of the OSD nodes, not only massively improving speed but > > more importantly spindle contention. > > > > My guess is that bluestore treats "direct" differently than the kernel > > accessing a filestore based OSD and I'm not sure what the "correct" > > behavior here is. > > But somebody migrating to bluestore with such a use case and plenty of RAM > > on their OSD nodes is likely to notice this and not going to be happy about > > it. > > > > > > > I'm not totally convinced that some of the trade-offs we've made with > > > bluestore's cache implementation are optimal, but I think you should > > > consider cooling your rhetoric down. > > > > > > > 1. Completely new users may think that bluestore defaults are fine and > > > > waste all that RAM in their machines. > > > > > > What does "wasting" RAM mean in the context of a node running ceph? Are > > > you upset that other applications can't come in and evict bluestore > > > onode, OMAP, or object data from cache? > > > > > What Jack pointed out, unless you go around and start tuning things, > > all available free RAM won't be used for caching. > > > > This raises another point, it being per process data and from skimming > > over some bluestore
Re: [ceph-users] Bluestore caching, flawed by design?
Christian, you mention single socket systems for storage servers. I often thought that the Xeon-D would be ideal as a building block for storage servers https://www.intel.com/content/www/us/en/products/processors/xeon/d-processors.html Low power, and a complete System-On-Chip with 10gig Ethernet. I haven't been following these processors lately. Is anyone building CEPH clusters using them On 2 April 2018 at 02:59, Christian Balzerwrote: > > Hello, > > firstly, Jack pretty much correctly correlated my issues to Mark's points, > more below. > > On Sat, 31 Mar 2018 08:24:45 -0500 Mark Nelson wrote: > > > On 03/29/2018 08:59 PM, Christian Balzer wrote: > > > > > Hello, > > > > > > my crappy test cluster was rendered inoperational by an IP renumbering > > > that wasn't planned and forced on me during a DC move, so I decided to > > > start from scratch and explore the fascinating world of > Luminous/bluestore > > > and all the assorted bugs. ^_- > > > (yes I could have recovered the cluster by setting up a local VLAN with > > > the old IPs, extract the monmap, etc, but I consider the need for a > > > running monitor a flaw, since all the relevant data was present in the > > > leveldb). > > > > > > Anyways, while I've read about bluestore OSD cache in passing here, the > > > back of my brain was clearly still hoping that it would use > pagecache/SLAB > > > like other filesystems. > > > Which after my first round of playing with things clearly isn't the > case. > > > > > > This strikes me as a design flaw and regression because: > > > > Bluestore's cache is not broken by design. > > > > During further tests I verified something that caught my attention out of > the corner of my when glancing at atop output of the OSDs during my fio > runs. > > Consider this fio run, after having done the same with write to populate > the file and caches (1GB per OSD default on the test cluster, 20 OSDs > total on 5 nodes): > --- > $ fio --size=8G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 > --rw=randread --name=fiojob --blocksize=4M --iodepth=32 > --- > > This is being run against a kernel mounted RBD image. > On the Luminous test cluster it will read the data from the disks, > completely ignoring the pagecache on the host (as expected and desired) > AND the bluestore cache. > > On a Jewel based test cluster with filestore the reads will be served from > the pagecaches of the OSD nodes, not only massively improving speed but > more importantly spindle contention. > > My guess is that bluestore treats "direct" differently than the kernel > accessing a filestore based OSD and I'm not sure what the "correct" > behavior here is. > But somebody migrating to bluestore with such a use case and plenty of RAM > on their OSD nodes is likely to notice this and not going to be happy about > it. > > > > I'm not totally convinced that some of the trade-offs we've made with > > bluestore's cache implementation are optimal, but I think you should > > consider cooling your rhetoric down. > > > > > 1. Completely new users may think that bluestore defaults are fine and > > > waste all that RAM in their machines. > > > > What does "wasting" RAM mean in the context of a node running ceph? Are > > you upset that other applications can't come in and evict bluestore > > onode, OMAP, or object data from cache? > > > What Jack pointed out, unless you go around and start tuning things, > all available free RAM won't be used for caching. > > This raises another point, it being per process data and from skimming > over some bluestore threads here, if you go and raise the cache to use > most RAM during normal ops you're likely to be visited by the evil OOM > witch during heavy recovery OPS. > > Whereas the good ole pagecache would just get evicted in that scenario. > > > > 2. Having a per OSD cache is inefficient compared to a common cache > like > > > pagecache, since an OSD that is busier than others would benefit from a > > > shared cache more. > > > > It's only "inefficient" if you assume that using the pagecache, and more > > generally, kernel syscalls, is free. Yes the pagecache is convenient > > and yes it gives you a lot of flexibility, but you pay for that > > flexibility if you are trying to do anything fast. > > > > For instance, take the new KPTI patches in the kernel for meltdown. Look > > at how badly it can hurt MyISAM database performance in MariaDB: > > > I, like many others here, have decided that all the Meltdown and Spectre > patches are a bit pointless on pure OSD nodes, because if somebody on the > node is running random code you're already in deep doodoo. > > That being said, I will totally concur that syscalls aren't free. > However given the latencies induced by the rather long/complex code IOPS > have to transverse within Ceph, how much of a gain would you say > eliminating these particular calls did achieve? > > > https://mariadb.org/myisam-table-scan-performance-kpti/ > > > > MyISAM does not have
Re: [ceph-users] Bluestore caching, flawed by design?
> A long time ago I was responsible for validating the performance of CXFS on an SGI Altix UV distributed shared-memory supercomputer. As it turns out, we could achieve about 22GB/s writes with XFS (a huge >number at the time), but CXFS was 5-10x slower. A big part of that turned out to be the kernel distributing page cache across the Numalink5 interconnects to remote memory. > The problem can potentially happen on any NUMA system to varying degrees. That's very interesting. I used to manage Itanium Altixes and then an UV system. That work sounds very interesting. I set up cpusets on the UV system, which had a big performance increase since user jobs had CPUs and memory close to each other. I also had a boot cpuset on the first blade, which had the fibrechannel HBA, so I guess that had a similar effect in that the CXFS processes were local to the IO card. UV was running SuSE - sorry. On the subject of memory allocation, GPFS uses an amount of pagepool memory. The given advice always seems to be make this large. There is one fixed pagepool on a server, even if it has multiple NSDs How does this compare to CEPH memory allocation? On 31 March 2018 at 15:24, Mark Nelsonwrote: > On 03/29/2018 08:59 PM, Christian Balzer wrote: > > Hello, >> >> my crappy test cluster was rendered inoperational by an IP renumbering >> that wasn't planned and forced on me during a DC move, so I decided to >> start from scratch and explore the fascinating world of Luminous/bluestore >> and all the assorted bugs. ^_- >> (yes I could have recovered the cluster by setting up a local VLAN with >> the old IPs, extract the monmap, etc, but I consider the need for a >> running monitor a flaw, since all the relevant data was present in the >> leveldb). >> >> Anyways, while I've read about bluestore OSD cache in passing here, the >> back of my brain was clearly still hoping that it would use pagecache/SLAB >> like other filesystems. >> Which after my first round of playing with things clearly isn't the case. >> >> This strikes me as a design flaw and regression because: >> > > Bluestore's cache is not broken by design. > > I'm not totally convinced that some of the trade-offs we've made with > bluestore's cache implementation are optimal, but I think you should > consider cooling your rhetoric down. > > 1. Completely new users may think that bluestore defaults are fine and >> waste all that RAM in their machines. >> > > What does "wasting" RAM mean in the context of a node running ceph? Are > you upset that other applications can't come in and evict bluestore onode, > OMAP, or object data from cache? > > 2. Having a per OSD cache is inefficient compared to a common cache like >> pagecache, since an OSD that is busier than others would benefit from a >> shared cache more. >> > > It's only "inefficient" if you assume that using the pagecache, and more > generally, kernel syscalls, is free. Yes the pagecache is convenient and > yes it gives you a lot of flexibility, but you pay for that flexibility if > you are trying to do anything fast. > > For instance, take the new KPTI patches in the kernel for meltdown. Look > at how badly it can hurt MyISAM database performance in MariaDB: > > https://mariadb.org/myisam-table-scan-performance-kpti/ > > MyISAM does not have a dedicated row cache and instead caches row data in > the page cache as you suggest Bluestore should do for it's data. Look at > how badly KPTI hurts performance (~40%). Now look at ARIA with a dedicated > 128MB cache (less than 1%). KPTI is a really good example of how much this > stuff can hurt you, but syscalls, context switches, and page faults were > already expensive even before meltdown. Not to mention that right now > bluestore keeps onodes and buffers stored in it's cache in an unencoded > form. > > Here's a couple of other articles worth looking at: > > https://eng.uber.com/mysql-migration/ > https://www.scylladb.com/2018/01/07/cost-of-avoiding-a-meltdown/ > http://www.brendangregg.com/blog/2018-02-09/kpti-kaiser-melt > down-performance.html > > 3. A uniform OSD cache size of course will be a nightmare when having >> non-uniform HW, either with RAM or number of OSDs. >> > > Non-Uniform hardware is a big reason that pinning dedicated memory to > specific cores/sockets is really nice vs relying on potentially remote > memory page cache reads. A long time ago I was responsible for validating > the performance of CXFS on an SGI Altix UV distributed shared-memory > supercomputer. As it turns out, we could achieve about 22GB/s writes with > XFS (a huge number at the time), but CXFS was 5-10x slower. A big part of > that turned out to be the kernel distributing page cache across the > Numalink5 interconnects to remote memory. The problem can potentially > happen on any NUMA system to varying degrees. > > Personally I have two primary issues with bluestore's memory configuration > right now: > > 1) It's too complicated for users to figure
Re: [ceph-users] Bluestore caching, flawed by design?
Hello, firstly, Jack pretty much correctly correlated my issues to Mark's points, more below. On Sat, 31 Mar 2018 08:24:45 -0500 Mark Nelson wrote: > On 03/29/2018 08:59 PM, Christian Balzer wrote: > > > Hello, > > > > my crappy test cluster was rendered inoperational by an IP renumbering > > that wasn't planned and forced on me during a DC move, so I decided to > > start from scratch and explore the fascinating world of Luminous/bluestore > > and all the assorted bugs. ^_- > > (yes I could have recovered the cluster by setting up a local VLAN with > > the old IPs, extract the monmap, etc, but I consider the need for a > > running monitor a flaw, since all the relevant data was present in the > > leveldb). > > > > Anyways, while I've read about bluestore OSD cache in passing here, the > > back of my brain was clearly still hoping that it would use pagecache/SLAB > > like other filesystems. > > Which after my first round of playing with things clearly isn't the case. > > > > This strikes me as a design flaw and regression because: > > Bluestore's cache is not broken by design. > During further tests I verified something that caught my attention out of the corner of my when glancing at atop output of the OSDs during my fio runs. Consider this fio run, after having done the same with write to populate the file and caches (1GB per OSD default on the test cluster, 20 OSDs total on 5 nodes): --- $ fio --size=8G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randread --name=fiojob --blocksize=4M --iodepth=32 --- This is being run against a kernel mounted RBD image. On the Luminous test cluster it will read the data from the disks, completely ignoring the pagecache on the host (as expected and desired) AND the bluestore cache. On a Jewel based test cluster with filestore the reads will be served from the pagecaches of the OSD nodes, not only massively improving speed but more importantly spindle contention. My guess is that bluestore treats "direct" differently than the kernel accessing a filestore based OSD and I'm not sure what the "correct" behavior here is. But somebody migrating to bluestore with such a use case and plenty of RAM on their OSD nodes is likely to notice this and not going to be happy about it. > I'm not totally convinced that some of the trade-offs we've made with > bluestore's cache implementation are optimal, but I think you should > consider cooling your rhetoric down. > > > 1. Completely new users may think that bluestore defaults are fine and > > waste all that RAM in their machines. > > What does "wasting" RAM mean in the context of a node running ceph? Are > you upset that other applications can't come in and evict bluestore > onode, OMAP, or object data from cache? > What Jack pointed out, unless you go around and start tuning things, all available free RAM won't be used for caching. This raises another point, it being per process data and from skimming over some bluestore threads here, if you go and raise the cache to use most RAM during normal ops you're likely to be visited by the evil OOM witch during heavy recovery OPS. Whereas the good ole pagecache would just get evicted in that scenario. > > 2. Having a per OSD cache is inefficient compared to a common cache like > > pagecache, since an OSD that is busier than others would benefit from a > > shared cache more. > > It's only "inefficient" if you assume that using the pagecache, and more > generally, kernel syscalls, is free. Yes the pagecache is convenient > and yes it gives you a lot of flexibility, but you pay for that > flexibility if you are trying to do anything fast. > > For instance, take the new KPTI patches in the kernel for meltdown. Look > at how badly it can hurt MyISAM database performance in MariaDB: > I, like many others here, have decided that all the Meltdown and Spectre patches are a bit pointless on pure OSD nodes, because if somebody on the node is running random code you're already in deep doodoo. That being said, I will totally concur that syscalls aren't free. However given the latencies induced by the rather long/complex code IOPS have to transverse within Ceph, how much of a gain would you say eliminating these particular calls did achieve? > https://mariadb.org/myisam-table-scan-performance-kpti/ > > MyISAM does not have a dedicated row cache and instead caches row data > in the page cache as you suggest Bluestore should do for it's data. > Look at how badly KPTI hurts performance (~40%). Now look at ARIA with a > dedicated 128MB cache (less than 1%). KPTI is a really good example of > how much this stuff can hurt you, but syscalls, context switches, and > page faults were already expensive even before meltdown. Not to mention > that right now bluestore keeps onodes and buffers stored in it's cache > in an unencoded form. > That last bit is quite relevant of course. > Here's a couple of other articles worth looking at: > >
Re: [ceph-users] Bluestore caching, flawed by design?
On 03/31/2018 03:24 PM, Mark Nelson wrote: >> 1. Completely new users may think that bluestore defaults are fine and >> waste all that RAM in their machines. > > What does "wasting" RAM mean in the context of a node running ceph? Are > you upset that other applications can't come in and evict bluestore > onode, OMAP, or object data from cache? I think he thought of your #1 Unless I am mistaken, with bluestore, you allocate some cache per OSD, and the OSD won't use more, even if there is free memory laying around Thus, a "waste" of ram >> 2. Having a per OSD cache is inefficient compared to a common cache like >> pagecache, since an OSD that is busier than others would benefit from a >> shared cache more. > > It's only "inefficient" if you assume that using the pagecache, and more > generally, kernel syscalls, is free. Yes the pagecache is convenient > and yes it gives you a lot of flexibility, but you pay for that > flexibility if you are trying to do anything fast. I think he thought of your #2 "Inefficient" because each OSDs have a fixed cache size, unrelated to their real usage To me, "flawed" is a bit extreme, bluestore is a good piece of work, even if there is still place for improvements; ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore caching, flawed by design?
On 03/29/2018 08:59 PM, Christian Balzer wrote: Hello, my crappy test cluster was rendered inoperational by an IP renumbering that wasn't planned and forced on me during a DC move, so I decided to start from scratch and explore the fascinating world of Luminous/bluestore and all the assorted bugs. ^_- (yes I could have recovered the cluster by setting up a local VLAN with the old IPs, extract the monmap, etc, but I consider the need for a running monitor a flaw, since all the relevant data was present in the leveldb). Anyways, while I've read about bluestore OSD cache in passing here, the back of my brain was clearly still hoping that it would use pagecache/SLAB like other filesystems. Which after my first round of playing with things clearly isn't the case. This strikes me as a design flaw and regression because: Bluestore's cache is not broken by design. I'm not totally convinced that some of the trade-offs we've made with bluestore's cache implementation are optimal, but I think you should consider cooling your rhetoric down. 1. Completely new users may think that bluestore defaults are fine and waste all that RAM in their machines. What does "wasting" RAM mean in the context of a node running ceph? Are you upset that other applications can't come in and evict bluestore onode, OMAP, or object data from cache? 2. Having a per OSD cache is inefficient compared to a common cache like pagecache, since an OSD that is busier than others would benefit from a shared cache more. It's only "inefficient" if you assume that using the pagecache, and more generally, kernel syscalls, is free. Yes the pagecache is convenient and yes it gives you a lot of flexibility, but you pay for that flexibility if you are trying to do anything fast. For instance, take the new KPTI patches in the kernel for meltdown. Look at how badly it can hurt MyISAM database performance in MariaDB: https://mariadb.org/myisam-table-scan-performance-kpti/ MyISAM does not have a dedicated row cache and instead caches row data in the page cache as you suggest Bluestore should do for it's data. Look at how badly KPTI hurts performance (~40%). Now look at ARIA with a dedicated 128MB cache (less than 1%). KPTI is a really good example of how much this stuff can hurt you, but syscalls, context switches, and page faults were already expensive even before meltdown. Not to mention that right now bluestore keeps onodes and buffers stored in it's cache in an unencoded form. Here's a couple of other articles worth looking at: https://eng.uber.com/mysql-migration/ https://www.scylladb.com/2018/01/07/cost-of-avoiding-a-meltdown/ http://www.brendangregg.com/blog/2018-02-09/kpti-kaiser-meltdown-performance.html 3. A uniform OSD cache size of course will be a nightmare when having non-uniform HW, either with RAM or number of OSDs. Non-Uniform hardware is a big reason that pinning dedicated memory to specific cores/sockets is really nice vs relying on potentially remote memory page cache reads. A long time ago I was responsible for validating the performance of CXFS on an SGI Altix UV distributed shared-memory supercomputer. As it turns out, we could achieve about 22GB/s writes with XFS (a huge number at the time), but CXFS was 5-10x slower. A big part of that turned out to be the kernel distributing page cache across the Numalink5 interconnects to remote memory. The problem can potentially happen on any NUMA system to varying degrees. Personally I have two primary issues with bluestore's memory configuration right now: 1) It's too complicated for users to figure out where to assign memory and in what ratios. I'm attempting to improve this by making bluestore's cache autotuning so the user just gives it a number and bluestore will try to work out where it should assign memory. 2) In the case where a subset of OSDs are really hot (maybe RGW bucket accesses) you might want some OSDs to get more memory than others. I think we can tackle this better if we migrate to a one-osd-per-node sharded architecture (likely based on seastar), though we'll still need to be very aware of remote memory. Given that this is fairly difficult to do well, we're probably going to be better off just dedicating a static pool to each shard initially. Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Bluestore caching, flawed by design?
Hello, my crappy test cluster was rendered inoperational by an IP renumbering that wasn't planned and forced on me during a DC move, so I decided to start from scratch and explore the fascinating world of Luminous/bluestore and all the assorted bugs. ^_- (yes I could have recovered the cluster by setting up a local VLAN with the old IPs, extract the monmap, etc, but I consider the need for a running monitor a flaw, since all the relevant data was present in the leveldb). Anyways, while I've read about bluestore OSD cache in passing here, the back of my brain was clearly still hoping that it would use pagecache/SLAB like other filesystems. Which after my first round of playing with things clearly isn't the case. This strikes me as a design flaw and regression because: 1. Completely new users may think that bluestore defaults are fine and waste all that RAM in their machines. 2. Having a per OSD cache is inefficient compared to a common cache like pagecache, since an OSD that is busier than others would benefit from a shared cache more. 3. A uniform OSD cache size of course will be a nightmare when having non-uniform HW, either with RAM or number of OSDs. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Rakuten Communications ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com