Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

2015-03-19 Thread Christian Balzer

Hello,

On Wed, 18 Mar 2015 11:05:47 -0700 Gregory Farnum wrote:

 On Wed, Mar 18, 2015 at 8:04 AM, Nick Fisk n...@fisk.me.uk wrote:
  Hi Greg,
 
  Thanks for your input and completely agree that we cannot expect
  developers to fully document what impact each setting has on a
  cluster, particularly in a performance related way
 
  That said, if you or others could spare some time for a few pointers it
  would be much appreciated and I will endeavour to create some useful
  results/documents that are more relevant to end users.
 
  I have taken on board what you said about the WB throttle and have been
  experimenting with it by switching it on and off. I know it's a bit of
  a blunt configuration change, but it was useful to understand its
  effect. With it off, I do see initially quite a large performance
  increase but overtime it actually starts to slow the average
  throughput down. Like you said, I am guessing this is to do with it
  making sure the journal doesn't get to far ahead, leaving it with
  massive sync's to carry out.
 
  One thing I do see with the WBT enabled and to some extent with it
  disabled, is that there are large periods of small block writes at the
  max speed of the underlying sata disk (70-80iops). Here are 2 blktrace
  seekwatcher traces of performing an OSD bench (64kb io's for 500MB)
  where this behaviour can be seen.
 
 If you're doing 64k IOs then I believe it's creating a new on-disk
 file for each of those writes. How that's laid out on-disk will depend
 on your filesystem and the specific config options that we're using to
 try to avoid running too far ahead of the journal.
 
Could you elaborate on that a bit?
I would have expected those 64KB writes to go to the same object (file)
until it is full (4MB).

Because this behavior would explain some (if not all) of the write
amplification I've seen in the past with small writes (see the SSD
Hardware recommendation thread).

Christian

 I think you're just using these config options in conflict with
 eachother. You've set the min sync time to 20 seconds for some reason,
 presumably to try and batch stuff up? So in that case you probably
 want to let your journal run for twenty seconds worth of backing disk
 IO before you start throttling it, and probably 10-20 seconds worth of
 IO before forcing file flushes. That means increasing the throttle
 limits while still leaving the flusher enabled.
 -Greg
 
 
  http://www.sys-pro.co.uk/misc/wbt_on.png
 
  http://www.sys-pro.co.uk/misc/wbt_off.png
 
  I would really appreciate if someone could comment on why this type of
  behaviour happens? As can be seen in the trace, if the blocks are
  submitted to the disk as larger IO's and with higher concurrency,
  hundreds of Mb of data can be flushed in seconds. Is this something
  specific to the filesystem behaviour which Ceph cannot influence, like
  dirty filesystem metadata/inodes which can't be merged into larger
  IO's?
 
  For sequential writes, I would have thought that in an optimum
  scenario, a spinning disk should be able to almost maintain its large
  block write speed (100MB/s) no matter the underlying block size. That
  being said, from what I understand when a sync is called it will try
  and flush all dirty data so the end result is probably slightly
  different to a traditional battery backed write back cache.
 
  Chris, would you be interested in forming a ceph-users based
  performance team? There's a developer performance meeting which is
  mainly concerned with improving the internals of Ceph. There is also a
  raft of information on the mailing list archives where people have
  said hey look at my SSD speed at x,y,z settings, but making
  comparisons or recommendations is not that easy. It may also reduce a
  lot of the repetitive posts of why is X so slowetc
 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

2015-03-19 Thread Nick Fisk
I think this could be part of what I am seeing. I found this post from back in 
2003

http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/12083

Which seems to describe a work around for the behaviour to what I am seeing. 
The constant small block IO I was seeing looks like it was either the pg log 
and info updates or FS metatdata. I have been going through the blktraces I did 
today and 90% of the time I am just seeing 8kb writes and journal writes. 

I think the journal and filestore settings I have been adjusting, have just 
been moving the data sync around the benchmark timeline and altering when the 
journal starts throttling. It seems that with small IO's the metadata overhead 
takes several times longer than the actual data writing. This probably also 
explains why a full SSD OSD is faster than a HDD+SSD even for brief bursts of 
IO.

In the thread I posted above, it seems that adding something like flashcache 
can massively help overcome this problem, so this is something I might look 
into. It’s a shame I didn't get BBWC with my OSD nodes as this would have also 
likely alleviated this problem with a lot less hassle.


 Ah, no, you're right. With the bench command it all goes in to one object, 
 it's
 just a separate transaction for each 64k write. But again depending on flusher
 and throttler settings in the OSD, and the backing FS' configuration, it can 
 be
 a lot of individual updates — in particular, every time there's a sync it has 
 to
 update the inode.
 Certainly that'll be the case in the described configuration, with relatively 
 low
 writeahead limits on the journal but high sync intervals — once you hit the
 limits, every write will get an immediate flush request.
 
 But none of that should have much impact on your write amplification tests
 unless you're actually using osd bench to test it. You're more likely to be
 seeing the overhead of the pg log entry, pg info change, etc that's associated
 with each write.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

2015-03-19 Thread Gregory Farnum
On Wed, Mar 18, 2015 at 11:10 PM, Christian Balzer ch...@gol.com wrote:

 Hello,

 On Wed, 18 Mar 2015 11:05:47 -0700 Gregory Farnum wrote:

 On Wed, Mar 18, 2015 at 8:04 AM, Nick Fisk n...@fisk.me.uk wrote:
  Hi Greg,
 
  Thanks for your input and completely agree that we cannot expect
  developers to fully document what impact each setting has on a
  cluster, particularly in a performance related way
 
  That said, if you or others could spare some time for a few pointers it
  would be much appreciated and I will endeavour to create some useful
  results/documents that are more relevant to end users.
 
  I have taken on board what you said about the WB throttle and have been
  experimenting with it by switching it on and off. I know it's a bit of
  a blunt configuration change, but it was useful to understand its
  effect. With it off, I do see initially quite a large performance
  increase but overtime it actually starts to slow the average
  throughput down. Like you said, I am guessing this is to do with it
  making sure the journal doesn't get to far ahead, leaving it with
  massive sync's to carry out.
 
  One thing I do see with the WBT enabled and to some extent with it
  disabled, is that there are large periods of small block writes at the
  max speed of the underlying sata disk (70-80iops). Here are 2 blktrace
  seekwatcher traces of performing an OSD bench (64kb io's for 500MB)
  where this behaviour can be seen.

 If you're doing 64k IOs then I believe it's creating a new on-disk
 file for each of those writes. How that's laid out on-disk will depend
 on your filesystem and the specific config options that we're using to
 try to avoid running too far ahead of the journal.

 Could you elaborate on that a bit?
 I would have expected those 64KB writes to go to the same object (file)
 until it is full (4MB).

 Because this behavior would explain some (if not all) of the write
 amplification I've seen in the past with small writes (see the SSD
 Hardware recommendation thread).

Ah, no, you're right. With the bench command it all goes in to one
object, it's just a separate transaction for each 64k write. But again
depending on flusher and throttler settings in the OSD, and the
backing FS' configuration, it can be a lot of individual updates — in
particular, every time there's a sync it has to update the inode.
Certainly that'll be the case in the described configuration, with
relatively low writeahead limits on the journal but high sync
intervals — once you hit the limits, every write will get an immediate
flush request.

But none of that should have much impact on your write amplification
tests unless you're actually using osd bench to test it. You're more
likely to be seeing the overhead of the pg log entry, pg info change,
etc that's associated with each write.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

2015-03-18 Thread Nick Fisk
Hi Greg,

Thanks for your input and completely agree that we cannot expect developers
to fully document what impact each setting has on a cluster, particularly in
a performance related way

That said, if you or others could spare some time for a few pointers it
would be much appreciated and I will endeavour to create some useful
results/documents that are more relevant to end users.

I have taken on board what you said about the WB throttle and have been
experimenting with it by switching it on and off. I know it's a bit of a
blunt configuration change, but it was useful to understand its effect. With
it off, I do see initially quite a large performance increase but overtime
it actually starts to slow the average throughput down. Like you said, I am
guessing this is to do with it making sure the journal doesn't get to far
ahead, leaving it with massive sync's to carry out.

One thing I do see with the WBT enabled and to some extent with it disabled,
is that there are large periods of small block writes at the max speed of
the underlying sata disk (70-80iops). Here are 2 blktrace seekwatcher traces
of performing an OSD bench (64kb io's for 500MB) where this behaviour can be
seen.

http://www.sys-pro.co.uk/misc/wbt_on.png

http://www.sys-pro.co.uk/misc/wbt_off.png

I would really appreciate if someone could comment on why this type of
behaviour happens? As can be seen in the trace, if the blocks are submitted
to the disk as larger IO's and with higher concurrency, hundreds of Mb of
data can be flushed in seconds. Is this something specific to the filesystem
behaviour which Ceph cannot influence, like dirty filesystem metadata/inodes
which can't be merged into larger IO's? 

For sequential writes, I would have thought that in an optimum scenario, a
spinning disk should be able to almost maintain its large block write speed
(100MB/s) no matter the underlying block size. That being said, from what I
understand when a sync is called it will try and flush all dirty data so the
end result is probably slightly different to a traditional battery backed
write back cache.

Chris, would you be interested in forming a ceph-users based performance
team? There's a developer performance meeting which is mainly concerned with
improving the internals of Ceph. There is also a raft of information on the
mailing list archives where people have said hey look at my SSD speed at
x,y,z settings, but making comparisons or recommendations is not that easy.
It may also reduce a lot of the repetitive posts of why is X so
slowetc

Nick

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Gregory Farnum
 Sent: 16 March 2015 23:57
 To: Christian Balzer
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Cache Tier Flush = immediate base tier journal
 sync?
 
 On Mon, Mar 16, 2015 at 4:46 PM, Christian Balzer ch...@gol.com wrote:
  On Mon, 16 Mar 2015 16:09:12 -0700 Gregory Farnum wrote:
 
  Nothing here particularly surprises me. I don't remember all the
  details of the filestore's rate limiting off the top of my head, but
  it goes to great lengths to try and avoid letting the journal get too
  far ahead of the backing store. Disabling the filestore flusher and
  increasing the sync intervals without also increasing the
  filestore_wbthrottle_* limits is not going to work well for you.
  -Greg
 
  While very true and what I recalled (backing store being kicked off
  early) from earlier mails, I think having every last configuration
  parameter documented in a way that doesn't reduce people to guesswork
  would be very helpful.
 
 PRs welcome! ;)
 
 More seriously, we create a lot of config options and it's not always
clear
 when doing so which ones should be changed by users or not. And a lot of
 them (case in point: anything to do with changing journal and FS
interactions)
 should only be changed by people who really understand them, because it's
 possible (as evidenced) to really bust up your cluster's performance
enough
 that it's basically broken.
 Historically that's meant people who can read the code and understand
it,
 although we might now have enough people at a mid-line that it's worth
 going back and documenting. There's not a lot of pressure coming from
 anybody to do that work in comparison to other stuff like make CephFS
 supported and make RADOS faster though, for understandable reasons.
 So while we can try and document these things some in future, the names of
 things here are really pretty self-explanatory and the sort of
configuration
 reference guide  I think you're asking for (ie, here are all the settings
to
 change if you are running on SSDs, and here's how they're related) is not
 the kind of thing that developers produce. That comes out of the community
 or is produced by support contracts.
 
 ...so I guess I've circled back around to PRs welcome!
 
  For example filestore_wbthrottle_xfs_inodes_start_flusher which
  defaults to 500.
  Assuming

Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

2015-03-18 Thread Gregory Farnum
On Wed, Mar 18, 2015 at 8:04 AM, Nick Fisk n...@fisk.me.uk wrote:
 Hi Greg,

 Thanks for your input and completely agree that we cannot expect developers
 to fully document what impact each setting has on a cluster, particularly in
 a performance related way

 That said, if you or others could spare some time for a few pointers it
 would be much appreciated and I will endeavour to create some useful
 results/documents that are more relevant to end users.

 I have taken on board what you said about the WB throttle and have been
 experimenting with it by switching it on and off. I know it's a bit of a
 blunt configuration change, but it was useful to understand its effect. With
 it off, I do see initially quite a large performance increase but overtime
 it actually starts to slow the average throughput down. Like you said, I am
 guessing this is to do with it making sure the journal doesn't get to far
 ahead, leaving it with massive sync's to carry out.

 One thing I do see with the WBT enabled and to some extent with it disabled,
 is that there are large periods of small block writes at the max speed of
 the underlying sata disk (70-80iops). Here are 2 blktrace seekwatcher traces
 of performing an OSD bench (64kb io's for 500MB) where this behaviour can be
 seen.

If you're doing 64k IOs then I believe it's creating a new on-disk
file for each of those writes. How that's laid out on-disk will depend
on your filesystem and the specific config options that we're using to
try to avoid running too far ahead of the journal.

I think you're just using these config options in conflict with
eachother. You've set the min sync time to 20 seconds for some reason,
presumably to try and batch stuff up? So in that case you probably
want to let your journal run for twenty seconds worth of backing disk
IO before you start throttling it, and probably 10-20 seconds worth of
IO before forcing file flushes. That means increasing the throttle
limits while still leaving the flusher enabled.
-Greg


 http://www.sys-pro.co.uk/misc/wbt_on.png

 http://www.sys-pro.co.uk/misc/wbt_off.png

 I would really appreciate if someone could comment on why this type of
 behaviour happens? As can be seen in the trace, if the blocks are submitted
 to the disk as larger IO's and with higher concurrency, hundreds of Mb of
 data can be flushed in seconds. Is this something specific to the filesystem
 behaviour which Ceph cannot influence, like dirty filesystem metadata/inodes
 which can't be merged into larger IO's?

 For sequential writes, I would have thought that in an optimum scenario, a
 spinning disk should be able to almost maintain its large block write speed
 (100MB/s) no matter the underlying block size. That being said, from what I
 understand when a sync is called it will try and flush all dirty data so the
 end result is probably slightly different to a traditional battery backed
 write back cache.

 Chris, would you be interested in forming a ceph-users based performance
 team? There's a developer performance meeting which is mainly concerned with
 improving the internals of Ceph. There is also a raft of information on the
 mailing list archives where people have said hey look at my SSD speed at
 x,y,z settings, but making comparisons or recommendations is not that easy.
 It may also reduce a lot of the repetitive posts of why is X so
 slowetc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

2015-03-16 Thread Gregory Farnum
On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk n...@fisk.me.uk wrote:

 I’m not sure if it’s something I’m doing wrong or just experiencing an 
 oddity, but when my cache tier flushes dirty blocks out to the base tier, the 
 writes seem to hit the OSD’s straight away instead of coalescing in the 
 journals, is this correct?

 For example if I create a RBD on a standard 3 way replica pool and run fio 
 via librbd 128k writes, I see the journals take all the io’s until I hit my 
 filestore_min_sync_interval and then I see it start writing to the underlying 
 disks.

 Doing the same on a full cache tier (to force flushing)  I immediately see 
 the base disks at a very high utilisation. The journals also have some write 
 IO at the same time. The only other odd thing I can see via iostat is that 
 most of the time whilst I’m running Fio, is that I can see the underlying 
 disks doing very small write IO’s of around 16kb with an occasional big burst 
 of activity.

 I know erasure coding+cache tier is slower than just plain replicated pools, 
 but even with various high queue depths I’m struggling to get much above 
 100-150 iops compared to a 3 way replica pool which can easily achieve 
 1000-1500. The base tier is comprised of 40 disks. It seems quite a marked 
 difference and I’m wondering if this strange journal behaviour is the cause.

 Does anyone have any ideas?

If you're running a full cache pool, then on every operation touching
an object which isn't in the cache pool it will try and evict an
object. That's probably what you're seeing.

Cache pool in general are only a wise idea if you have a very skewed
distribution of data hotness and the entire hot zone can fit in
cache at once.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

2015-03-16 Thread Christian Balzer
On Mon, 16 Mar 2015 16:09:12 -0700 Gregory Farnum wrote:

 Nothing here particularly surprises me. I don't remember all the
 details of the filestore's rate limiting off the top of my head, but
 it goes to great lengths to try and avoid letting the journal get too
 far ahead of the backing store. Disabling the filestore flusher and
 increasing the sync intervals without also increasing the
 filestore_wbthrottle_* limits is not going to work well for you.
 -Greg
 
While very true and what I recalled (backing store being kicked off early)
from earlier mails, I think having every last configuration parameter
documented in a way that doesn't reduce people to guesswork would be very
helpful.

For example filestore_wbthrottle_xfs_inodes_start_flusher which defaults
to 500. 
Assuming that this means to start flushing once 500 inodes have
accumulated, how would Ceph even know how many inodes are needed for the
data present?

Lastly with these parameters, there is xfs and btrfs incarnations, no
ext4. 
Do the xfs parameters also apply to ext4?

Christian

 On Mon, Mar 16, 2015 at 3:58 PM, Nick Fisk n...@fisk.me.uk wrote:
 
 
 
 
  -Original Message-
  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
  Of Gregory Farnum
  Sent: 16 March 2015 17:33
  To: Nick Fisk
  Cc: ceph-users@lists.ceph.com
  Subject: Re: [ceph-users] Cache Tier Flush = immediate base tier
  journal sync?
 
  On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk n...@fisk.me.uk wrote:
  
   I’m not sure if it’s something I’m doing wrong or just experiencing
   an
  oddity, but when my cache tier flushes dirty blocks out to the base
  tier, the writes seem to hit the OSD’s straight away instead of
  coalescing in the journals, is this correct?
  
   For example if I create a RBD on a standard 3 way replica pool and
   run fio
  via librbd 128k writes, I see the journals take all the io’s until I
  hit my filestore_min_sync_interval and then I see it start writing to
  the underlying disks.
  
   Doing the same on a full cache tier (to force flushing)  I
   immediately see the
  base disks at a very high utilisation. The journals also have some
  write IO at the same time. The only other odd thing I can see via
  iostat is that most of the time whilst I’m running Fio, is that I can
  see the underlying disks doing very small write IO’s of around 16kb
  with an occasional big burst of activity.
  
   I know erasure coding+cache tier is slower than just plain
   replicated pools,
  but even with various high queue depths I’m struggling to get much
  above 100-150 iops compared to a 3 way replica pool which can easily
  achieve 1000- 1500. The base tier is comprised of 40 disks. It seems
  quite a marked difference and I’m wondering if this strange journal
  behaviour is the cause.
  
   Does anyone have any ideas?
 
  If you're running a full cache pool, then on every operation touching
  an object which isn't in the cache pool it will try and evict an
  object. That's probably what you're seeing.
 
  Cache pool in general are only a wise idea if you have a very skewed
  distribution of data hotness and the entire hot zone can fit in
  cache at once.
  -Greg
 
  Hi Greg,
 
  It's not the caching behaviour that I confused about, it’s the journal
  behaviour on the base disks during flushing. I've been doing some more
  tests and can do something reproducible which seems strange to me.
 
  First off 10MB of 4kb writes:
  time ceph tell osd.1 bench 1000 4096
  { bytes_written: 1000,
blocksize: 4096,
bytes_per_sec: 16009426.00}
 
  real0m0.760s
  user0m0.063s
  sys 0m0.022s
 
  Now split this into 2x5mb writes:
  time ceph tell osd.1 bench 500 4096   time ceph tell osd.1 bench
  500 4096 { bytes_written: 500,
blocksize: 4096,
bytes_per_sec: 10580846.00}
 
  real0m0.595s
  user0m0.065s
  sys 0m0.018s
  { bytes_written: 500,
blocksize: 4096,
bytes_per_sec: 9944252.00}
 
  real0m4.412s
  user0m0.053s
  sys 0m0.071s
 
  2nd bench takes a lot longer even though both should easily fit in the
  5GB journal. Looking at iostat, I think I can see that no writes
  happen to the journal whilst the writes from the 1st bench are being
  flushed. Is this the expected behaviour? I would have thought as long
  as there is space available in the journal it shouldn't block on new
  writes. Also I see in iostat writes to the underlying disk happening
  at a QD of 1 and 16kb IO's for a number of seconds, with a large blip
  or activity just before the flush finishes. Is this the correct
  behaviour? I would have thought if this tell osd bench is doing
  sequential IO then the journal should be able to flush 5-10mb of data
  in a fraction a second.
 
  Ceph.conf
  [osd]
  filestore max sync interval = 30
  filestore min sync interval = 20
  filestore flusher = false
  osd_journal_size = 5120
  osd_crush_location_hook = /usr/local/bin/crush-location

Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

2015-03-16 Thread Nick Fisk




 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Gregory Farnum
 Sent: 16 March 2015 17:33
 To: Nick Fisk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Cache Tier Flush = immediate base tier journal
 sync?
 
 On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk n...@fisk.me.uk wrote:
 
  I’m not sure if it’s something I’m doing wrong or just experiencing an
 oddity, but when my cache tier flushes dirty blocks out to the base tier, the
 writes seem to hit the OSD’s straight away instead of coalescing in the
 journals, is this correct?
 
  For example if I create a RBD on a standard 3 way replica pool and run fio
 via librbd 128k writes, I see the journals take all the io’s until I hit my
 filestore_min_sync_interval and then I see it start writing to the underlying
 disks.
 
  Doing the same on a full cache tier (to force flushing)  I immediately see 
  the
 base disks at a very high utilisation. The journals also have some write IO at
 the same time. The only other odd thing I can see via iostat is that most of
 the time whilst I’m running Fio, is that I can see the underlying disks doing
 very small write IO’s of around 16kb with an occasional big burst of activity.
 
  I know erasure coding+cache tier is slower than just plain replicated pools,
 but even with various high queue depths I’m struggling to get much above
 100-150 iops compared to a 3 way replica pool which can easily achieve 1000-
 1500. The base tier is comprised of 40 disks. It seems quite a marked
 difference and I’m wondering if this strange journal behaviour is the cause.
 
  Does anyone have any ideas?
 
 If you're running a full cache pool, then on every operation touching an
 object which isn't in the cache pool it will try and evict an object. That's
 probably what you're seeing.
 
 Cache pool in general are only a wise idea if you have a very skewed
 distribution of data hotness and the entire hot zone can fit in cache at
 once.
 -Greg

Hi Greg,

It's not the caching behaviour that I confused about, it’s the journal 
behaviour on the base disks during flushing. I've been doing some more tests 
and can do something reproducible which seems strange to me. 

First off 10MB of 4kb writes:
time ceph tell osd.1 bench 1000 4096
{ bytes_written: 1000,
  blocksize: 4096,
  bytes_per_sec: 16009426.00}

real0m0.760s
user0m0.063s
sys 0m0.022s

Now split this into 2x5mb writes:
time ceph tell osd.1 bench 500 4096   time ceph tell osd.1 bench 500 
4096
{ bytes_written: 500,
  blocksize: 4096,
  bytes_per_sec: 10580846.00}

real0m0.595s
user0m0.065s
sys 0m0.018s
{ bytes_written: 500,
  blocksize: 4096,
  bytes_per_sec: 9944252.00}

real0m4.412s
user0m0.053s
sys 0m0.071s

2nd bench takes a lot longer even though both should easily fit in the 5GB 
journal. Looking at iostat, I think I can see that no writes happen to the 
journal whilst the writes from the 1st bench are being flushed. Is this the 
expected behaviour? I would have thought as long as there is space available in 
the journal it shouldn't block on new writes. Also I see in iostat writes to 
the underlying disk happening at a QD of 1 and 16kb IO's for a number of 
seconds, with a large blip or activity just before the flush finishes. Is this 
the correct behaviour? I would have thought if this tell osd bench is doing 
sequential IO then the journal should be able to flush 5-10mb of data in a 
fraction a second.

Ceph.conf
[osd]
filestore max sync interval = 30
filestore min sync interval = 20
filestore flusher = false
osd_journal_size = 5120
osd_crush_location_hook = /usr/local/bin/crush-location
osd_op_threads = 5
filestore_op_threads = 4


iostat during period where writes seem to be blocked (journal=sda disk=sdd)

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
sdb   0.00 0.000.002.00 0.00 4.00 4.00 
0.000.000.000.00   0.00   0.00
sdc   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
sdd   0.00 0.000.00   76.00 0.00   760.0020.00 
0.99   13.110.00   13.11  13.05  99.20

iostat during what I believe to be the actual flush

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00
sdb   0.00 0.000.002.00 0.00 4.00 4.00 
0.000.000.000.00   0.00   0.00
sdc   0.00 0.000.000.00 0.00 0.00 0.00 
0.000.000.000.00   0.00   0.00

Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

2015-03-16 Thread Gregory Farnum
Nothing here particularly surprises me. I don't remember all the
details of the filestore's rate limiting off the top of my head, but
it goes to great lengths to try and avoid letting the journal get too
far ahead of the backing store. Disabling the filestore flusher and
increasing the sync intervals without also increasing the
filestore_wbthrottle_* limits is not going to work well for you.
-Greg

On Mon, Mar 16, 2015 at 3:58 PM, Nick Fisk n...@fisk.me.uk wrote:




 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Gregory Farnum
 Sent: 16 March 2015 17:33
 To: Nick Fisk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Cache Tier Flush = immediate base tier journal
 sync?

 On Wed, Mar 11, 2015 at 2:25 PM, Nick Fisk n...@fisk.me.uk wrote:
 
  I’m not sure if it’s something I’m doing wrong or just experiencing an
 oddity, but when my cache tier flushes dirty blocks out to the base tier, the
 writes seem to hit the OSD’s straight away instead of coalescing in the
 journals, is this correct?
 
  For example if I create a RBD on a standard 3 way replica pool and run fio
 via librbd 128k writes, I see the journals take all the io’s until I hit my
 filestore_min_sync_interval and then I see it start writing to the underlying
 disks.
 
  Doing the same on a full cache tier (to force flushing)  I immediately see 
  the
 base disks at a very high utilisation. The journals also have some write IO 
 at
 the same time. The only other odd thing I can see via iostat is that most of
 the time whilst I’m running Fio, is that I can see the underlying disks doing
 very small write IO’s of around 16kb with an occasional big burst of 
 activity.
 
  I know erasure coding+cache tier is slower than just plain replicated 
  pools,
 but even with various high queue depths I’m struggling to get much above
 100-150 iops compared to a 3 way replica pool which can easily achieve 1000-
 1500. The base tier is comprised of 40 disks. It seems quite a marked
 difference and I’m wondering if this strange journal behaviour is the cause.
 
  Does anyone have any ideas?

 If you're running a full cache pool, then on every operation touching an
 object which isn't in the cache pool it will try and evict an object. That's
 probably what you're seeing.

 Cache pool in general are only a wise idea if you have a very skewed
 distribution of data hotness and the entire hot zone can fit in cache at
 once.
 -Greg

 Hi Greg,

 It's not the caching behaviour that I confused about, it’s the journal 
 behaviour on the base disks during flushing. I've been doing some more tests 
 and can do something reproducible which seems strange to me.

 First off 10MB of 4kb writes:
 time ceph tell osd.1 bench 1000 4096
 { bytes_written: 1000,
   blocksize: 4096,
   bytes_per_sec: 16009426.00}

 real0m0.760s
 user0m0.063s
 sys 0m0.022s

 Now split this into 2x5mb writes:
 time ceph tell osd.1 bench 500 4096   time ceph tell osd.1 bench 
 500 4096
 { bytes_written: 500,
   blocksize: 4096,
   bytes_per_sec: 10580846.00}

 real0m0.595s
 user0m0.065s
 sys 0m0.018s
 { bytes_written: 500,
   blocksize: 4096,
   bytes_per_sec: 9944252.00}

 real0m4.412s
 user0m0.053s
 sys 0m0.071s

 2nd bench takes a lot longer even though both should easily fit in the 5GB 
 journal. Looking at iostat, I think I can see that no writes happen to the 
 journal whilst the writes from the 1st bench are being flushed. Is this the 
 expected behaviour? I would have thought as long as there is space available 
 in the journal it shouldn't block on new writes. Also I see in iostat writes 
 to the underlying disk happening at a QD of 1 and 16kb IO's for a number of 
 seconds, with a large blip or activity just before the flush finishes. Is 
 this the correct behaviour? I would have thought if this tell osd bench is 
 doing sequential IO then the journal should be able to flush 5-10mb of data 
 in a fraction a second.

 Ceph.conf
 [osd]
 filestore max sync interval = 30
 filestore min sync interval = 20
 filestore flusher = false
 osd_journal_size = 5120
 osd_crush_location_hook = /usr/local/bin/crush-location
 osd_op_threads = 5
 filestore_op_threads = 4


 iostat during period where writes seem to be blocked (journal=sda disk=sdd)

 Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
 avgqu-sz   await r_await w_await  svctm  %util
 sda   0.00 0.000.000.00 0.00 0.00 0.00
  0.000.000.000.00   0.00   0.00
 sdb   0.00 0.000.002.00 0.00 4.00 4.00
  0.000.000.000.00   0.00   0.00
 sdc   0.00 0.000.000.00 0.00 0.00 0.00
  0.000.000.000.00   0.00   0.00
 sdd   0.00 0.000.00   76.00 0.00   760.0020.00
  0.99   13.110.00   13.11  13.05  99.20

 iostat during

[ceph-users] Cache Tier Flush = immediate base tier journal sync?

2015-03-11 Thread Nick Fisk
I'm not sure if it's something I'm doing wrong or just experiencing an
oddity, but when my cache tier flushes dirty blocks out to the base tier,
the writes seem to hit the OSD's straight away instead of coalescing in the
journals, is this correct?

For example if I create a RBD on a standard 3 way replica pool and run fio
via librbd 128k writes, I see the journals take all the io's until I hit my
filestore_min_sync_interval and then I see it start writing to the
underlying disks.

Doing the same on a full cache tier (to force flushing)  I immediately see
the base disks at a very high utilisation. The journals also have some write
IO at the same time. The only other odd thing I can see via iostat is that
most of the time whilst I'm running Fio, is that I can see the underlying
disks doing very small write IO's of around 16kb with an occasional big
burst of activity.

I know erasure coding+cache tier is slower than just plain replicated pools,
but even with various high queue depths I'm struggling to get much above
100-150 iops compared to a 3 way replica pool which can easily achieve
1000-1500. The base tier is comprised of 40 disks. It seems quite a marked
difference and I'm wondering if this strange journal behaviour is the cause.

Does anyone have any ideas?

Nick


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com