Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?
Hello, On Wed, 18 Mar 2015 11:05:47 -0700 Gregory Farnum wrote: On Wed, Mar 18, 2015 at 8:04 AM, Nick Fisk n...@fisk.me.uk wrote: Hi Greg, Thanks for your input and completely agree that we cannot expect developers to fully document what impact each setting has on a cluster, particularly in a performance related way That said, if you or others could spare some time for a few pointers it would be much appreciated and I will endeavour to create some useful results/documents that are more relevant to end users. I have taken on board what you said about the WB throttle and have been experimenting with it by switching it on and off. I know it's a bit of a blunt configuration change, but it was useful to understand its effect. With it off, I do see initially quite a large performance increase but overtime it actually starts to slow the average throughput down. Like you said, I am guessing this is to do with it making sure the journal doesn't get to far ahead, leaving it with massive sync's to carry out. One thing I do see with the WBT enabled and to some extent with it disabled, is that there are large periods of small block writes at the max speed of the underlying sata disk (70-80iops). Here are 2 blktrace seekwatcher traces of performing an OSD bench (64kb io's for 500MB) where this behaviour can be seen. If you're doing 64k IOs then I believe it's creating a new on-disk file for each of those writes. How that's laid out on-disk will depend on your filesystem and the specific config options that we're using to try to avoid running too far ahead of the journal. Could you elaborate on that a bit? I would have expected those 64KB writes to go to the same object (file) until it is full (4MB). Because this behavior would explain some (if not all) of the write amplification I've seen in the past with small writes (see the SSD Hardware recommendation thread). Christian I think you're just using these config options in conflict with eachother. You've set the min sync time to 20 seconds for some reason, presumably to try and batch stuff up? So in that case you probably want to let your journal run for twenty seconds worth of backing disk IO before you start throttling it, and probably 10-20 seconds worth of IO before forcing file flushes. That means increasing the throttle limits while still leaving the flusher enabled. -Greg http://www.sys-pro.co.uk/misc/wbt_on.png http://www.sys-pro.co.uk/misc/wbt_off.png I would really appreciate if someone could comment on why this type of behaviour happens? As can be seen in the trace, if the blocks are submitted to the disk as larger IO's and with higher concurrency, hundreds of Mb of data can be flushed in seconds. Is this something specific to the filesystem behaviour which Ceph cannot influence, like dirty filesystem metadata/inodes which can't be merged into larger IO's? For sequential writes, I would have thought that in an optimum scenario, a spinning disk should be able to almost maintain its large block write speed (100MB/s) no matter the underlying block size. That being said, from what I understand when a sync is called it will try and flush all dirty data so the end result is probably slightly different to a traditional battery backed write back cache. Chris, would you be interested in forming a ceph-users based performance team? There's a developer performance meeting which is mainly concerned with improving the internals of Ceph. There is also a raft of information on the mailing list archives where people have said hey look at my SSD speed at x,y,z settings, but making comparisons or recommendations is not that easy. It may also reduce a lot of the repetitive posts of why is X so slowetc -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?
I think this could be part of what I am seeing. I found this post from back in 2003 http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/12083 Which seems to describe a work around for the behaviour to what I am seeing. The constant small block IO I was seeing looks like it was either the pg log and info updates or FS metatdata. I have been going through the blktraces I did today and 90% of the time I am just seeing 8kb writes and journal writes. I think the journal and filestore settings I have been adjusting, have just been moving the data sync around the benchmark timeline and altering when the journal starts throttling. It seems that with small IO's the metadata overhead takes several times longer than the actual data writing. This probably also explains why a full SSD OSD is faster than a HDD+SSD even for brief bursts of IO. In the thread I posted above, it seems that adding something like flashcache can massively help overcome this problem, so this is something I might look into. It’s a shame I didn't get BBWC with my OSD nodes as this would have also likely alleviated this problem with a lot less hassle. Ah, no, you're right. With the bench command it all goes in to one object, it's just a separate transaction for each 64k write. But again depending on flusher and throttler settings in the OSD, and the backing FS' configuration, it can be a lot of individual updates — in particular, every time there's a sync it has to update the inode. Certainly that'll be the case in the described configuration, with relatively low writeahead limits on the journal but high sync intervals — once you hit the limits, every write will get an immediate flush request. But none of that should have much impact on your write amplification tests unless you're actually using osd bench to test it. You're more likely to be seeing the overhead of the pg log entry, pg info change, etc that's associated with each write. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?
On Wed, Mar 18, 2015 at 11:10 PM, Christian Balzer ch...@gol.com wrote: Hello, On Wed, 18 Mar 2015 11:05:47 -0700 Gregory Farnum wrote: On Wed, Mar 18, 2015 at 8:04 AM, Nick Fisk n...@fisk.me.uk wrote: Hi Greg, Thanks for your input and completely agree that we cannot expect developers to fully document what impact each setting has on a cluster, particularly in a performance related way That said, if you or others could spare some time for a few pointers it would be much appreciated and I will endeavour to create some useful results/documents that are more relevant to end users. I have taken on board what you said about the WB throttle and have been experimenting with it by switching it on and off. I know it's a bit of a blunt configuration change, but it was useful to understand its effect. With it off, I do see initially quite a large performance increase but overtime it actually starts to slow the average throughput down. Like you said, I am guessing this is to do with it making sure the journal doesn't get to far ahead, leaving it with massive sync's to carry out. One thing I do see with the WBT enabled and to some extent with it disabled, is that there are large periods of small block writes at the max speed of the underlying sata disk (70-80iops). Here are 2 blktrace seekwatcher traces of performing an OSD bench (64kb io's for 500MB) where this behaviour can be seen. If you're doing 64k IOs then I believe it's creating a new on-disk file for each of those writes. How that's laid out on-disk will depend on your filesystem and the specific config options that we're using to try to avoid running too far ahead of the journal. Could you elaborate on that a bit? I would have expected those 64KB writes to go to the same object (file) until it is full (4MB). Because this behavior would explain some (if not all) of the write amplification I've seen in the past with small writes (see the SSD Hardware recommendation thread). Ah, no, you're right. With the bench command it all goes in to one object, it's just a separate transaction for each 64k write. But again depending on flusher and throttler settings in the OSD, and the backing FS' configuration, it can be a lot of individual updates — in particular, every time there's a sync it has to update the inode. Certainly that'll be the case in the described configuration, with relatively low writeahead limits on the journal but high sync intervals — once you hit the limits, every write will get an immediate flush request. But none of that should have much impact on your write amplification tests unless you're actually using osd bench to test it. You're more likely to be seeing the overhead of the pg log entry, pg info change, etc that's associated with each write. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?
Hi Greg, Thanks for your input and completely agree that we cannot expect developers to fully document what impact each setting has on a cluster, particularly in a performance related way That said, if you or others could spare some time for a few pointers it would be much appreciated and I will endeavour to create some useful results/documents that are more relevant to end users. I have taken on board what you said about the WB throttle and have been experimenting with it by switching it on and off. I know it's a bit of a blunt configuration change, but it was useful to understand its effect. With it off, I do see initially quite a large performance increase but overtime it actually starts to slow the average throughput down. Like you said, I am guessing this is to do with it making sure the journal doesn't get to far ahead, leaving it with massive sync's to carry out. One thing I do see with the WBT enabled and to some extent with it disabled, is that there are large periods of small block writes at the max speed of the underlying sata disk (70-80iops). Here are 2 blktrace seekwatcher traces of performing an OSD bench (64kb io's for 500MB) where this behaviour can be seen. http://www.sys-pro.co.uk/misc/wbt_on.png http://www.sys-pro.co.uk/misc/wbt_off.png I would really appreciate if someone could comment on why this type of behaviour happens? As can be seen in the trace, if the blocks are submitted to the disk as larger IO's and with higher concurrency, hundreds of Mb of data can be flushed in seconds. Is this something specific to the filesystem behaviour which Ceph cannot influence, like dirty filesystem metadata/inodes which can't be merged into larger IO's? For sequential writes, I would have thought that in an optimum scenario, a spinning disk should be able to almost maintain its large block write speed (100MB/s) no matter the underlying block size. That being said, from what I understand when a sync is called it will try and flush all dirty data so the end result is probably slightly different to a traditional battery backed write back cache. Chris, would you be interested in forming a ceph-users based performance team? There's a developer performance meeting which is mainly concerned with improving the internals of Ceph. There is also a raft of information on the mailing list archives where people have said hey look at my SSD speed at x,y,z settings, but making comparisons or recommendations is not that easy. It may also reduce a lot of the repetitive posts of why is X so slowetc Nick -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Gregory Farnum Sent: 16 March 2015 23:57 To: Christian Balzer Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync? On Mon, Mar 16, 2015 at 4:46 PM, Christian Balzer ch...@gol.com wrote: On Mon, 16 Mar 2015 16:09:12 -0700 Gregory Farnum wrote: Nothing here particularly surprises me. I don't remember all the details of the filestore's rate limiting off the top of my head, but it goes to great lengths to try and avoid letting the journal get too far ahead of the backing store. Disabling the filestore flusher and increasing the sync intervals without also increasing the filestore_wbthrottle_* limits is not going to work well for you. -Greg While very true and what I recalled (backing store being kicked off early) from earlier mails, I think having every last configuration parameter documented in a way that doesn't reduce people to guesswork would be very helpful. PRs welcome! ;) More seriously, we create a lot of config options and it's not always clear when doing so which ones should be changed by users or not. And a lot of them (case in point: anything to do with changing journal and FS interactions) should only be changed by people who really understand them, because it's possible (as evidenced) to really bust up your cluster's performance enough that it's basically broken. Historically that's meant people who can read the code and understand it, although we might now have enough people at a mid-line that it's worth going back and documenting. There's not a lot of pressure coming from anybody to do that work in comparison to other stuff like make CephFS supported and make RADOS faster though, for understandable reasons. So while we can try and document these things some in future, the names of things here are really pretty self-explanatory and the sort of configuration reference guide I think you're asking for (ie, here are all the settings to change if you are running on SSDs, and here's how they're related) is not the kind of thing that developers produce. That comes out of the community or is produced by support contracts. ...so I guess I've circled back around to PRs welcome! For example filestore_wbthrottle_xfs_inodes_start_flusher which defaults to 500. Assuming
Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?
On Wed, Mar 18, 2015 at 8:04 AM, Nick Fisk n...@fisk.me.uk wrote: Hi Greg, Thanks for your input and completely agree that we cannot expect developers to fully document what impact each setting has on a cluster, particularly in a performance related way That said, if you or others could spare some time for a few pointers it would be much appreciated and I will endeavour to create some useful results/documents that are more relevant to end users. I have taken on board what you said about the WB throttle and have been experimenting with it by switching it on and off. I know it's a bit of a blunt configuration change, but it was useful to understand its effect. With it off, I do see initially quite a large performance increase but overtime it actually starts to slow the average throughput down. Like you said, I am guessing this is to do with it making sure the journal doesn't get to far ahead, leaving it with massive sync's to carry out. One thing I do see with the WBT enabled and to some extent with it disabled, is that there are large periods of small block writes at the max speed of the underlying sata disk (70-80iops). Here are 2 blktrace seekwatcher traces of performing an OSD bench (64kb io's for 500MB) where this behaviour can be seen. If you're doing 64k IOs then I believe it's creating a new on-disk file for each of those writes. How that's laid out on-disk will depend on your filesystem and the specific config options that we're using to try to avoid running too far ahead of the journal. I think you're just using these config options in conflict with eachother. You've set the min sync time to 20 seconds for some reason, presumably to try and batch stuff up? So in that case you probably want to let your journal run for twenty seconds worth of backing disk IO before you start throttling it, and probably 10-20 seconds worth of IO before forcing file flushes. That means increasing the throttle limits while still leaving the flusher enabled. -Greg http://www.sys-pro.co.uk/misc/wbt_on.png http://www.sys-pro.co.uk/misc/wbt_off.png I would really appreciate if someone could comment on why this type of behaviour happens? As can be seen in the trace, if the blocks are submitted to the disk as larger IO's and with higher concurrency, hundreds of Mb of data can be flushed in seconds. Is this something specific to the filesystem behaviour which Ceph cannot influence, like dirty filesystem metadata/inodes which can't be merged into larger IO's? For sequential writes, I would have thought that in an optimum scenario, a spinning disk should be able to almost maintain its large block write speed (100MB/s) no matter the underlying block size. That being said, from what I understand when a sync is called it will try and flush all dirty data so the end result is probably slightly different to a traditional battery backed write back cache. Chris, would you be interested in forming a ceph-users based performance team? There's a developer performance meeting which is mainly concerned with improving the internals of Ceph. There is also a raft of information on the mailing list archives where people have said hey look at my SSD speed at x,y,z settings, but making comparisons or recommendations is not that easy. It may also reduce a lot of the repetitive posts of why is X so slowetc ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?
On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk n...@fisk.me.uk wrote: I’m not sure if it’s something I’m doing wrong or just experiencing an oddity, but when my cache tier flushes dirty blocks out to the base tier, the writes seem to hit the OSD’s straight away instead of coalescing in the journals, is this correct? For example if I create a RBD on a standard 3 way replica pool and run fio via librbd 128k writes, I see the journals take all the io’s until I hit my filestore_min_sync_interval and then I see it start writing to the underlying disks. Doing the same on a full cache tier (to force flushing) I immediately see the base disks at a very high utilisation. The journals also have some write IO at the same time. The only other odd thing I can see via iostat is that most of the time whilst I’m running Fio, is that I can see the underlying disks doing very small write IO’s of around 16kb with an occasional big burst of activity. I know erasure coding+cache tier is slower than just plain replicated pools, but even with various high queue depths I’m struggling to get much above 100-150 iops compared to a 3 way replica pool which can easily achieve 1000-1500. The base tier is comprised of 40 disks. It seems quite a marked difference and I’m wondering if this strange journal behaviour is the cause. Does anyone have any ideas? If you're running a full cache pool, then on every operation touching an object which isn't in the cache pool it will try and evict an object. That's probably what you're seeing. Cache pool in general are only a wise idea if you have a very skewed distribution of data hotness and the entire hot zone can fit in cache at once. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?
On Mon, 16 Mar 2015 16:09:12 -0700 Gregory Farnum wrote: Nothing here particularly surprises me. I don't remember all the details of the filestore's rate limiting off the top of my head, but it goes to great lengths to try and avoid letting the journal get too far ahead of the backing store. Disabling the filestore flusher and increasing the sync intervals without also increasing the filestore_wbthrottle_* limits is not going to work well for you. -Greg While very true and what I recalled (backing store being kicked off early) from earlier mails, I think having every last configuration parameter documented in a way that doesn't reduce people to guesswork would be very helpful. For example filestore_wbthrottle_xfs_inodes_start_flusher which defaults to 500. Assuming that this means to start flushing once 500 inodes have accumulated, how would Ceph even know how many inodes are needed for the data present? Lastly with these parameters, there is xfs and btrfs incarnations, no ext4. Do the xfs parameters also apply to ext4? Christian On Mon, Mar 16, 2015 at 3:58 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Gregory Farnum Sent: 16 March 2015 17:33 To: Nick Fisk Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync? On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk n...@fisk.me.uk wrote: I’m not sure if it’s something I’m doing wrong or just experiencing an oddity, but when my cache tier flushes dirty blocks out to the base tier, the writes seem to hit the OSD’s straight away instead of coalescing in the journals, is this correct? For example if I create a RBD on a standard 3 way replica pool and run fio via librbd 128k writes, I see the journals take all the io’s until I hit my filestore_min_sync_interval and then I see it start writing to the underlying disks. Doing the same on a full cache tier (to force flushing) I immediately see the base disks at a very high utilisation. The journals also have some write IO at the same time. The only other odd thing I can see via iostat is that most of the time whilst I’m running Fio, is that I can see the underlying disks doing very small write IO’s of around 16kb with an occasional big burst of activity. I know erasure coding+cache tier is slower than just plain replicated pools, but even with various high queue depths I’m struggling to get much above 100-150 iops compared to a 3 way replica pool which can easily achieve 1000- 1500. The base tier is comprised of 40 disks. It seems quite a marked difference and I’m wondering if this strange journal behaviour is the cause. Does anyone have any ideas? If you're running a full cache pool, then on every operation touching an object which isn't in the cache pool it will try and evict an object. That's probably what you're seeing. Cache pool in general are only a wise idea if you have a very skewed distribution of data hotness and the entire hot zone can fit in cache at once. -Greg Hi Greg, It's not the caching behaviour that I confused about, it’s the journal behaviour on the base disks during flushing. I've been doing some more tests and can do something reproducible which seems strange to me. First off 10MB of 4kb writes: time ceph tell osd.1 bench 1000 4096 { bytes_written: 1000, blocksize: 4096, bytes_per_sec: 16009426.00} real0m0.760s user0m0.063s sys 0m0.022s Now split this into 2x5mb writes: time ceph tell osd.1 bench 500 4096 time ceph tell osd.1 bench 500 4096 { bytes_written: 500, blocksize: 4096, bytes_per_sec: 10580846.00} real0m0.595s user0m0.065s sys 0m0.018s { bytes_written: 500, blocksize: 4096, bytes_per_sec: 9944252.00} real0m4.412s user0m0.053s sys 0m0.071s 2nd bench takes a lot longer even though both should easily fit in the 5GB journal. Looking at iostat, I think I can see that no writes happen to the journal whilst the writes from the 1st bench are being flushed. Is this the expected behaviour? I would have thought as long as there is space available in the journal it shouldn't block on new writes. Also I see in iostat writes to the underlying disk happening at a QD of 1 and 16kb IO's for a number of seconds, with a large blip or activity just before the flush finishes. Is this the correct behaviour? I would have thought if this tell osd bench is doing sequential IO then the journal should be able to flush 5-10mb of data in a fraction a second. Ceph.conf [osd] filestore max sync interval = 30 filestore min sync interval = 20 filestore flusher = false osd_journal_size = 5120 osd_crush_location_hook = /usr/local/bin/crush-location
Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?
-Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Gregory Farnum Sent: 16 March 2015 17:33 To: Nick Fisk Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync? On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk n...@fisk.me.uk wrote: I’m not sure if it’s something I’m doing wrong or just experiencing an oddity, but when my cache tier flushes dirty blocks out to the base tier, the writes seem to hit the OSD’s straight away instead of coalescing in the journals, is this correct? For example if I create a RBD on a standard 3 way replica pool and run fio via librbd 128k writes, I see the journals take all the io’s until I hit my filestore_min_sync_interval and then I see it start writing to the underlying disks. Doing the same on a full cache tier (to force flushing) I immediately see the base disks at a very high utilisation. The journals also have some write IO at the same time. The only other odd thing I can see via iostat is that most of the time whilst I’m running Fio, is that I can see the underlying disks doing very small write IO’s of around 16kb with an occasional big burst of activity. I know erasure coding+cache tier is slower than just plain replicated pools, but even with various high queue depths I’m struggling to get much above 100-150 iops compared to a 3 way replica pool which can easily achieve 1000- 1500. The base tier is comprised of 40 disks. It seems quite a marked difference and I’m wondering if this strange journal behaviour is the cause. Does anyone have any ideas? If you're running a full cache pool, then on every operation touching an object which isn't in the cache pool it will try and evict an object. That's probably what you're seeing. Cache pool in general are only a wise idea if you have a very skewed distribution of data hotness and the entire hot zone can fit in cache at once. -Greg Hi Greg, It's not the caching behaviour that I confused about, it’s the journal behaviour on the base disks during flushing. I've been doing some more tests and can do something reproducible which seems strange to me. First off 10MB of 4kb writes: time ceph tell osd.1 bench 1000 4096 { bytes_written: 1000, blocksize: 4096, bytes_per_sec: 16009426.00} real0m0.760s user0m0.063s sys 0m0.022s Now split this into 2x5mb writes: time ceph tell osd.1 bench 500 4096 time ceph tell osd.1 bench 500 4096 { bytes_written: 500, blocksize: 4096, bytes_per_sec: 10580846.00} real0m0.595s user0m0.065s sys 0m0.018s { bytes_written: 500, blocksize: 4096, bytes_per_sec: 9944252.00} real0m4.412s user0m0.053s sys 0m0.071s 2nd bench takes a lot longer even though both should easily fit in the 5GB journal. Looking at iostat, I think I can see that no writes happen to the journal whilst the writes from the 1st bench are being flushed. Is this the expected behaviour? I would have thought as long as there is space available in the journal it shouldn't block on new writes. Also I see in iostat writes to the underlying disk happening at a QD of 1 and 16kb IO's for a number of seconds, with a large blip or activity just before the flush finishes. Is this the correct behaviour? I would have thought if this tell osd bench is doing sequential IO then the journal should be able to flush 5-10mb of data in a fraction a second. Ceph.conf [osd] filestore max sync interval = 30 filestore min sync interval = 20 filestore flusher = false osd_journal_size = 5120 osd_crush_location_hook = /usr/local/bin/crush-location osd_op_threads = 5 filestore_op_threads = 4 iostat during period where writes seem to be blocked (journal=sda disk=sdd) Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 sdb 0.00 0.000.002.00 0.00 4.00 4.00 0.000.000.000.00 0.00 0.00 sdc 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 sdd 0.00 0.000.00 76.00 0.00 760.0020.00 0.99 13.110.00 13.11 13.05 99.20 iostat during what I believe to be the actual flush Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 sdb 0.00 0.000.002.00 0.00 4.00 4.00 0.000.000.000.00 0.00 0.00 sdc 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00
Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?
Nothing here particularly surprises me. I don't remember all the details of the filestore's rate limiting off the top of my head, but it goes to great lengths to try and avoid letting the journal get too far ahead of the backing store. Disabling the filestore flusher and increasing the sync intervals without also increasing the filestore_wbthrottle_* limits is not going to work well for you. -Greg On Mon, Mar 16, 2015 at 3:58 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Gregory Farnum Sent: 16 March 2015 17:33 To: Nick Fisk Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync? On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk n...@fisk.me.uk wrote: I’m not sure if it’s something I’m doing wrong or just experiencing an oddity, but when my cache tier flushes dirty blocks out to the base tier, the writes seem to hit the OSD’s straight away instead of coalescing in the journals, is this correct? For example if I create a RBD on a standard 3 way replica pool and run fio via librbd 128k writes, I see the journals take all the io’s until I hit my filestore_min_sync_interval and then I see it start writing to the underlying disks. Doing the same on a full cache tier (to force flushing) I immediately see the base disks at a very high utilisation. The journals also have some write IO at the same time. The only other odd thing I can see via iostat is that most of the time whilst I’m running Fio, is that I can see the underlying disks doing very small write IO’s of around 16kb with an occasional big burst of activity. I know erasure coding+cache tier is slower than just plain replicated pools, but even with various high queue depths I’m struggling to get much above 100-150 iops compared to a 3 way replica pool which can easily achieve 1000- 1500. The base tier is comprised of 40 disks. It seems quite a marked difference and I’m wondering if this strange journal behaviour is the cause. Does anyone have any ideas? If you're running a full cache pool, then on every operation touching an object which isn't in the cache pool it will try and evict an object. That's probably what you're seeing. Cache pool in general are only a wise idea if you have a very skewed distribution of data hotness and the entire hot zone can fit in cache at once. -Greg Hi Greg, It's not the caching behaviour that I confused about, it’s the journal behaviour on the base disks during flushing. I've been doing some more tests and can do something reproducible which seems strange to me. First off 10MB of 4kb writes: time ceph tell osd.1 bench 1000 4096 { bytes_written: 1000, blocksize: 4096, bytes_per_sec: 16009426.00} real0m0.760s user0m0.063s sys 0m0.022s Now split this into 2x5mb writes: time ceph tell osd.1 bench 500 4096 time ceph tell osd.1 bench 500 4096 { bytes_written: 500, blocksize: 4096, bytes_per_sec: 10580846.00} real0m0.595s user0m0.065s sys 0m0.018s { bytes_written: 500, blocksize: 4096, bytes_per_sec: 9944252.00} real0m4.412s user0m0.053s sys 0m0.071s 2nd bench takes a lot longer even though both should easily fit in the 5GB journal. Looking at iostat, I think I can see that no writes happen to the journal whilst the writes from the 1st bench are being flushed. Is this the expected behaviour? I would have thought as long as there is space available in the journal it shouldn't block on new writes. Also I see in iostat writes to the underlying disk happening at a QD of 1 and 16kb IO's for a number of seconds, with a large blip or activity just before the flush finishes. Is this the correct behaviour? I would have thought if this tell osd bench is doing sequential IO then the journal should be able to flush 5-10mb of data in a fraction a second. Ceph.conf [osd] filestore max sync interval = 30 filestore min sync interval = 20 filestore flusher = false osd_journal_size = 5120 osd_crush_location_hook = /usr/local/bin/crush-location osd_op_threads = 5 filestore_op_threads = 4 iostat during period where writes seem to be blocked (journal=sda disk=sdd) Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 sdb 0.00 0.000.002.00 0.00 4.00 4.00 0.000.000.000.00 0.00 0.00 sdc 0.00 0.000.000.00 0.00 0.00 0.00 0.000.000.000.00 0.00 0.00 sdd 0.00 0.000.00 76.00 0.00 760.0020.00 0.99 13.110.00 13.11 13.05 99.20 iostat during
[ceph-users] Cache Tier Flush = immediate base tier journal sync?
I'm not sure if it's something I'm doing wrong or just experiencing an oddity, but when my cache tier flushes dirty blocks out to the base tier, the writes seem to hit the OSD's straight away instead of coalescing in the journals, is this correct? For example if I create a RBD on a standard 3 way replica pool and run fio via librbd 128k writes, I see the journals take all the io's until I hit my filestore_min_sync_interval and then I see it start writing to the underlying disks. Doing the same on a full cache tier (to force flushing) I immediately see the base disks at a very high utilisation. The journals also have some write IO at the same time. The only other odd thing I can see via iostat is that most of the time whilst I'm running Fio, is that I can see the underlying disks doing very small write IO's of around 16kb with an occasional big burst of activity. I know erasure coding+cache tier is slower than just plain replicated pools, but even with various high queue depths I'm struggling to get much above 100-150 iops compared to a 3 way replica pool which can easily achieve 1000-1500. The base tier is comprised of 40 disks. It seems quite a marked difference and I'm wondering if this strange journal behaviour is the cause. Does anyone have any ideas? Nick ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com