Re: [ceph-users] Serious performance problems with small file writes

2014-08-21 Thread Christian Balzer

Hello,

On Wed, 20 Aug 2014 15:39:11 +0100 Hugo Mills wrote:

We have a ceph system here, and we're seeing performance regularly
 descend into unusability for periods of minutes at a time (or longer).
 This appears to be triggered by writing large numbers of small files.
 
Specifications:
 
 ceph 0.80.5
 6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2
 threads) 
 2 machines running primary and standby MDS
 3 monitors on the same machines as the OSDs
 Infiniband to about 8 CephFS clients (headless, in the machine room)
 Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop
machines, in the analysis lab)
 
Please let us know the CPU and memory specs of the OSD nodes as well.
And the replication factor, I presume 3 if you value that data.
Also the PG and PGP values for the pool(s) you're using.

The cluster stores home directories of the users and a larger area
 of scientific data (approx 15 TB) which is being processed and
 analysed by the users of the cluster.
 
We have a relatively small number of concurrent users (typically
 4-6 at most), who use GUI tools to examine their data, and then
 complex sets of MATLAB scripts to process it, with processing often
 being distributed across all the machines using Condor.
 
It's not unusual to see the analysis scripts write out large
 numbers (thousands, possibly tens or hundreds of thousands) of small
 files, often from many client machines at once in parallel. When this
 happens, the ceph cluster becomes almost completely unresponsive for
 tens of seconds (or even for minutes) at a time, until the writes are
 flushed through the system. Given the nature of modern GUI desktop
 environments (often reading and writing small state files in the
 user's home directory), this means that desktop interactiveness and
 responsiveness for all the other users of the cluster suffer.
 
1-minute load on the servers typically peaks at about 8 during
 these events (on 4-core machines). Load on the clients also peaks
 high, because of the number of processes waiting for a response from
 the FS. The MDS shows little sign of stress -- it seems to be entirely
 down to the OSDs. ceph -w shows requests blocked for more than 10
 seconds, and in bad cases, ceph -s shows up to many hundreds of
 requests blocked for more than 32s.
 
We've had to turn off scrubbing and deep scrubbing completely --
 except between 01.00 and 04.00 every night -- because it triggers the
 exact same symptoms, even with only 2-3 PGs being scrubbed. If it gets
 up to 7 PGs being scrubbed, as it did on Monday, it's completely
 unusable.
 
Note that I know nothing about CephFS and while there are probably
tunables the slow requests you're seeing and the hardware up there
definitely suggests slow OSDs.

Now with a replication factor of 3, your total cluster performance
(sustained) is that of just 6 disks and 4TB ones are never any speed
wonders. Minus the latency overheads from the network, which should be
minimal in your case though.

Your old NFS (cluster?) had twice the spindles you wrote, so if that means
36 disks it was quite a bit faster.

A cluster I'm just building with 3 nodes, 4 journal SSDs and 8 OSD HDDs
per node can do about 7000 write IOPS (4KB), so I would expect yours to be
worse off.

Having the journals on dedicated partitions instead of files on the rootfs
would not only be faster (though probably not significantly so), but also
prevent any potential failures based on FS corruption.

The SSD journals will compensate for some spikes of high IOPS, but 25
files is clearly beyond that.

Putting lots of RAM (relatively cheap these days) into the OSD nodes has
the big benefit that reads of hot objects will not have to go to disk and
thus compete with write IOPS.

Is this problem something that's often seen? If so, what are the
 best options for mitigation or elimination of the problem? I've found
 a few references to issue #6278 [1], but that seems to be referencing
 scrub specifically, not ordinary (if possibly pathological) writes.
 
You need to match your cluster to your workload.
Aside from tuning things (which tends to have limited effects), you can
either scale out by adding more servers or scale up by using faster
storage and/or a cache pool.

What are the sorts of things I should be looking at to work out
 where the bottleneck(s) are? I'm a bit lost about how to drill down
 into the ceph system for identifying performance issues. Is there a
 useful guide to tools somewhere?
 
Reading/scouring this ML can be quite helpful. 

Watch your OSD nodes (all of them!) with iostat or preferably atop (which
will also show you how your CPUs and network is doing) while running the
below stuff. 

To get a baseline do:
rados -p pool-in-question bench 60 write -t 64
This will test your throughput most of all and due to the 4MB block size
spread the load very equally amongst the OSDs.
During that test you should see all OSDs more or 

Re: [ceph-users] Serious performance problems with small file writes

2014-08-21 Thread Dan Van Der Ster
Hi Hugo,

On 20 Aug 2014, at 17:54, Hugo Mills h.r.mi...@reading.ac.uk wrote:

 What are you using for OSD journals?
 
   On each machine, the three OSD journals live on the same ext4
 filesystem on an SSD, which is also the root filesystem of the
 machine.
 
 Also check the CPU usage for the mons and osds...
 
   The mons are doing pretty much nothing in terms of CPU, as far as I
 can see. I will double-check during an incident.
 
 Does your hardware provide enough IOPS for what your users need?
 (e.g. what is the op/s from ceph -w)
 
   Not really an answer to your question, but: Before the ceph cluster
 went in, we were running the system on two 5-year-old NFS servers for
 a while. We have about half the total number of spindles that we used
 to, but more modern drives.

NFS exported async or sync? If async, it can’t be compared to CephFS. Also, if 
those NFS servers had RAID cards with a wb-cache, it can’t really be compared.

 
   I'll look at how the op/s values change when we have the problem.
 At the moment (with what I assume to be normal desktop usage from the
 3-4 users in the lab), they're flapping wildly somewhere around a
 median of 350-400, with peaks up to 800. Somewhere around 15-20 MB/s
 read and write.


Another tunable to look at is the filestore max sync interval — in my 
experience the colocated journal/OSD setup suffers with the default (5s, IIRC), 
especially when an OSD is getting a constant stream of writes. When this 
happens, the disk heads are constantly seeking back and forth between 
synchronously writing to the journal and flushing the outstanding writes. If we 
would have a dedicated (spinning) disk for the journal, then the synchronous 
writes (to the journal) could be done sequentially (thus, quickly) and the 
flushes would also be quick(er). SSD journals can obviously also help with this.

For a short test I would try increasing filestore max sync interval to 30s or 
maybe even 60s to see if it helps. (I know that at least one of the Inktank 
experts advise against changing the filestore max sync interval — but in my 
experience 5s is much too short for the colocated journal setup.) You need to 
make sure your journals are large enough to store 30/60s of writes, but when 
you have predominantly small writes even a few GB of journal ought to be 
enough. 

Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Serious performance problems with small file writes

2014-08-21 Thread Hugo Mills
   Just to fill in some of the gaps from yesterday's mail:

On Wed, Aug 20, 2014 at 04:54:28PM +0100, Hugo Mills wrote:
Some questions below I can't answer immediately, but I'll spend
 tomorrow morning irritating people by triggering these events (I think
 I have a reproducer -- unpacking a 1.2 GiB tarball with 25 small
 files in it) and giving you more details. 

   Yes, the tarball with the 25 small files in it is definitely a
reproducer.

[snip]
  What about iostat on the OSDs — are your OSD disks busy reading or
  writing during these incidents?
 
Not sure. I don't think so, but I'll try to trigger an incident and
 report back on this one.

   Mostly writing. I'm seeing figures of up to about 2-3 MB/s writes,
and 200-300 kB/s reads on all three, but it fluctuates a lot (with
5-second intervals). Sample data at the end of the email.

  What are you using for OSD journals?
 
On each machine, the three OSD journals live on the same ext4
 filesystem on an SSD, which is also the root filesystem of the
 machine.
 
  Also check the CPU usage for the mons and osds...
 
The mons are doing pretty much nothing in terms of CPU, as far as I
 can see. I will double-check during an incident.

   The mons are just ticking over with a 1% CPU usage.

  Does your hardware provide enough IOPS for what your users need?
  (e.g. what is the op/s from ceph -w)
 
Not really an answer to your question, but: Before the ceph cluster
 went in, we were running the system on two 5-year-old NFS servers for
 a while. We have about half the total number of spindles that we used
 to, but more modern drives.
 
I'll look at how the op/s values change when we have the problem.
 At the moment (with what I assume to be normal desktop usage from the
 3-4 users in the lab), they're flapping wildly somewhere around a
 median of 350-400, with peaks up to 800. Somewhere around 15-20 MB/s
 read and write.

   With minimal users and one machine running the tar unpacking
process, I'm getting somewhere around 100-200 op/s on the ceph
cluster, but interactivity on the desktop machine I'm logged in on is
horrible -- I'm frequently getting tens of seconds of latency. Compare
that to the (relatively) comfortable 350-400 op/s we had yesterday
with what is probably workloads with larger files.

  If disabling deep scrub helps, then it might be that something else
  is reading the disks heavily. One thing to check is updatedb — we
  had to disable it from indexing /var/lib/ceph on our OSDs.
 
I haven't seen that running at all during the day, but I'll look
 into it.

   No, it's not anything like that -- iotop reports pretty much the
only things doing IO are ceph-osd and the occasional xfsaild.

   Hugo.

Hugo.
 
  Best Regards,
  Dan
  
  -- Dan van der Ster || Data  Storage Services || CERN IT Department --
  
  
  On 20 Aug 2014, at 16:39, Hugo Mills h.r.mi...@reading.ac.uk wrote:
  
 We have a ceph system here, and we're seeing performance regularly
   descend into unusability for periods of minutes at a time (or longer).
   This appears to be triggered by writing large numbers of small files.
   
 Specifications:
   
   ceph 0.80.5
   6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads)
   2 machines running primary and standby MDS
   3 monitors on the same machines as the OSDs
   Infiniband to about 8 CephFS clients (headless, in the machine room)
   Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop
 machines, in the analysis lab)
   
 The cluster stores home directories of the users and a larger area
   of scientific data (approx 15 TB) which is being processed and
   analysed by the users of the cluster.
   
 We have a relatively small number of concurrent users (typically
   4-6 at most), who use GUI tools to examine their data, and then
   complex sets of MATLAB scripts to process it, with processing often
   being distributed across all the machines using Condor.
   
 It's not unusual to see the analysis scripts write out large
   numbers (thousands, possibly tens or hundreds of thousands) of small
   files, often from many client machines at once in parallel. When this
   happens, the ceph cluster becomes almost completely unresponsive for
   tens of seconds (or even for minutes) at a time, until the writes are
   flushed through the system. Given the nature of modern GUI desktop
   environments (often reading and writing small state files in the
   user's home directory), this means that desktop interactiveness and
   responsiveness for all the other users of the cluster suffer.
   
 1-minute load on the servers typically peaks at about 8 during
   these events (on 4-core machines). Load on the clients also peaks
   high, because of the number of processes waiting for a response from
   the FS. The MDS shows little sign of stress -- it seems to be entirely
   down to the OSDs. ceph -w shows requests blocked for more than 10
   seconds, 

Re: [ceph-users] Serious performance problems with small file writes

2014-08-21 Thread Hugo Mills
On Thu, Aug 21, 2014 at 07:40:45AM +, Dan Van Der Ster wrote:
 On 20 Aug 2014, at 17:54, Hugo Mills h.r.mi...@reading.ac.uk wrote:
  Does your hardware provide enough IOPS for what your users need?
  (e.g. what is the op/s from ceph -w)
  
Not really an answer to your question, but: Before the ceph cluster
  went in, we were running the system on two 5-year-old NFS servers for
  a while. We have about half the total number of spindles that we used
  to, but more modern drives.
 
 NFS exported async or sync? If async, it can’t be compared to
 CephFS. Also, if those NFS servers had RAID cards with a wb-cache,
 it can’t really be compared.

   Hmm. Yes, async. Probably wouldn't have been my choice... (I only
started working with this system recently -- about the same time that
the ceph cluster was deployed to replace the older machines. I haven't
had much of say in what's implemented here, but I have to try to
support it.)

   I'm tempted to put the users' home directories back on an NFS
server, and keep ceph for the research data. That at least should give
us more in the way of interactivity (which is the main thing I'm
getting complaints about).

I'll look at how the op/s values change when we have the problem.
  At the moment (with what I assume to be normal desktop usage from the
  3-4 users in the lab), they're flapping wildly somewhere around a
  median of 350-400, with peaks up to 800. Somewhere around 15-20 MB/s
  read and write.

 Another tunable to look at is the filestore max sync interval — in
 my experience the colocated journal/OSD setup suffers with the
 default (5s, IIRC), especially when an OSD is getting a constant
 stream of writes. When this happens, the disk heads are constantly
 seeking back and forth between synchronously writing to the journal
 and flushing the outstanding writes. If we would have a dedicated
 (spinning) disk for the journal, then the synchronous writes (to the
 journal) could be done sequentially (thus, quickly) and the flushes
 would also be quick(er). SSD journals can obviously also help with
 this.

   Not sure what you mean about colocated journal/OSD. The journals
aren't on the same device as the OSDs. However, all three journals on
each machine are on the same SSD.

 For a short test I would try increasing filestore max sync interval
 to 30s or maybe even 60s to see if it helps. (I know that at least
 one of the Inktank experts advise against changing the filestore max
 sync interval — but in my experience 5s is much too short for the
 colocated journal setup.) You need to make sure your journals are
 large enough to store 30/60s of writes, but when you have
 predominantly small writes even a few GB of journal ought to be
 enough.

   I'll have a play with that.

   Thanks for all the help so far -- it's been useful. I'm learning
what the right kind of questions are.

   Hugo.

-- 
Hugo Mills :: IT Services, University of Reading
Specialist Engineer, Research Servers :: x6943 :: R07 Harry Pitt Building
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Serious performance problems with small file writes

2014-08-21 Thread Dan Van Der Ster
Hi Hugo,

On 21 Aug 2014, at 14:17, Hugo Mills h.r.mi...@reading.ac.uk wrote:

 
   Not sure what you mean about colocated journal/OSD. The journals
 aren't on the same device as the OSDs. However, all three journals on
 each machine are on the same SSD.

embarrassed I obviously didn’t drink enough coffee this morning. I read your 
reply as something like … On each machine, the three OSD journals live on the 
same ext4 filesystem on an OSD”.

Anyway… what kind of SSD do you have? With iostat -xm 1, do you see high % 
utilisation on that SSD during these incidents? It could be that you’re 
exceeding even the iops capacity of the SSD.

Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Serious performance problems with small file writes

2014-08-20 Thread Dan Van Der Ster
Hi,

Do you get slow requests during the slowness incidents? What about monitor 
elections?
Are your MDSs using a lot of CPU? did you try tuning anything in the MDS (I 
think the default config is still conservative, and there are options to cache 
more entries, etc…)
What about iostat on the OSDs — are your OSD disks busy reading or writing 
during these incidents?
What are you using for OSD journals?
Also check the CPU usage for the mons and osds...

Does your hardware provide enough IOPS for what your users need? (e.g. what is 
the op/s from ceph -w)

If disabling deep scrub helps, then it might be that something else is reading 
the disks heavily. One thing to check is updatedb — we had to disable it from 
indexing /var/lib/ceph on our OSDs.

Best Regards,
Dan

-- Dan van der Ster || Data  Storage Services || CERN IT Department --


On 20 Aug 2014, at 16:39, Hugo Mills h.r.mi...@reading.ac.uk wrote:

   We have a ceph system here, and we're seeing performance regularly
 descend into unusability for periods of minutes at a time (or longer).
 This appears to be triggered by writing large numbers of small files.
 
   Specifications:
 
 ceph 0.80.5
 6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads)
 2 machines running primary and standby MDS
 3 monitors on the same machines as the OSDs
 Infiniband to about 8 CephFS clients (headless, in the machine room)
 Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop
   machines, in the analysis lab)
 
   The cluster stores home directories of the users and a larger area
 of scientific data (approx 15 TB) which is being processed and
 analysed by the users of the cluster.
 
   We have a relatively small number of concurrent users (typically
 4-6 at most), who use GUI tools to examine their data, and then
 complex sets of MATLAB scripts to process it, with processing often
 being distributed across all the machines using Condor.
 
   It's not unusual to see the analysis scripts write out large
 numbers (thousands, possibly tens or hundreds of thousands) of small
 files, often from many client machines at once in parallel. When this
 happens, the ceph cluster becomes almost completely unresponsive for
 tens of seconds (or even for minutes) at a time, until the writes are
 flushed through the system. Given the nature of modern GUI desktop
 environments (often reading and writing small state files in the
 user's home directory), this means that desktop interactiveness and
 responsiveness for all the other users of the cluster suffer.
 
   1-minute load on the servers typically peaks at about 8 during
 these events (on 4-core machines). Load on the clients also peaks
 high, because of the number of processes waiting for a response from
 the FS. The MDS shows little sign of stress -- it seems to be entirely
 down to the OSDs. ceph -w shows requests blocked for more than 10
 seconds, and in bad cases, ceph -s shows up to many hundreds of
 requests blocked for more than 32s.
 
   We've had to turn off scrubbing and deep scrubbing completely --
 except between 01.00 and 04.00 every night -- because it triggers the
 exact same symptoms, even with only 2-3 PGs being scrubbed. If it gets
 up to 7 PGs being scrubbed, as it did on Monday, it's completely
 unusable.
 
   Is this problem something that's often seen? If so, what are the
 best options for mitigation or elimination of the problem? I've found
 a few references to issue #6278 [1], but that seems to be referencing
 scrub specifically, not ordinary (if possibly pathological) writes.
 
   What are the sorts of things I should be looking at to work out
 where the bottleneck(s) are? I'm a bit lost about how to drill down
 into the ceph system for identifying performance issues. Is there a
 useful guide to tools somewhere?
 
   Is an upgrade to 0.84 likely to be helpful? How development are
 the development releases, from a stability / dangerous bugs point of
 view?
 
   Thanks,
   Hugo.
 
 [1] http://tracker.ceph.com/issues/6278
 
 -- 
 Hugo Mills :: IT Services, University of Reading
 Specialist Engineer, Research Servers
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Serious performance problems with small file writes

2014-08-20 Thread Dan Van Der Ster
Hi,

On 20 Aug 2014, at 16:55, German Anders 
gand...@despegar.commailto:gand...@despegar.com wrote:

Hi Dan,

  How are you? I want to know how you disable the indexing on the 
/var/lib/ceph OSDs?


# grep ceph /etc/updatedb.conf
PRUNEPATHS = /afs /media /net /sfs /tmp /udev /var/cache/ccache 
/var/spool/cups /var/spool/squid /var/tmp /var/lib/ceph



Did you disable deep scrub on you OSDs?


No but this can be an issue. If you get many PGs scrubbing at once, performance 
will suffer.

There is a new feature in 0.67.10 to sleep between scrubbing “chunks”. I set 
the that sleep to 0.1 (and the chunk_max to 5, and the scrub size 1MB). In 
0.67.10+1 there are some new options to set the iopriority of the scrubbing 
threads. Set that to class = 3, priority = 0 to give the scrubbing thread the 
idle priority. You need to use the cfq disk scheduler for io priorities to 
work. (cfq will also help if updatedb is causing any problems, since it runs 
with ionice -c 3).

I’m pretty sure those features will come in 0.80.6 as well.

Do you have the journals on SSD's or RAMDISK?


Never use RAMDISK.

We currently have the journals on the same spinning disk as the OSD, but the 
iops performance is low for the rbd and fs use-cases. (For object store it 
should be OK). But for rbd or fs, you really need journals on SSDs or your 
cluster will suffer.

We now have SSDs on order to augment our cluster. (The way I justified this is 
that our cluster has X TB of storage capacity and Y iops capacity. With disk 
journals we will run out of iops capacity well before we run out of storage 
capacity. So you can either increase the iops capacity substantially by 
decreasing the volume of the cluster by 20% and replacing those disks with SSD 
journals, or you can just leave 50% of the disk capacity empty since you can’t 
use it anyway).


What's the perf of your cluster? randos bench? fio? I've setup a new cluster 
and I want to know what would be the best option scheme to go.

It’s not really meaningful to compare performance of different clusters with 
different hardware. Some “constants” I can advise
  - with few clients, large write throughput is limited by the clients 
bandwidth, as long as you have enough OSDs and the client is striping over many 
objects.
  - with disk journals, small write latency will be ~30-50ms even when the 
cluster is idle. if you have SSD journals, maybe ~10ms.
  - count your iops. Each disk OSD can do ~100, and you need to divide by the 
number of replicas. With SSDs you can do a bit better than this since the 
synchronous writes go to the SSDs not the disks. In my tests with our hardware 
I estimate that going from disk to SSD journal will multiply the iops capacity 
by around 5x.

I also found that I needed to increase some the journal max write and journal 
queue max limits, also the filestore limits, to squeeze the best performance 
out of the SSD journals. Try increasing filestore queue max ops/bytes, 
filestore queue committing max ops/bytes, and the filestore wbthrottle xfs * 
options. (I’m not going to publish exact configs here because I haven’t 
finished tuning yet).

Cheers, Dan


Thanks a lot!!

Best regards,

German Anders














On Wednesday 20/08/2014 at 11:51, Dan Van Der Ster wrote:
Hi,

Do you get slow requests during the slowness incidents? What about monitor 
elections?
Are your MDSs using a lot of CPU? did you try tuning anything in the MDS (I 
think the default config is still conservative, and there are options to cache 
more entries, etc…)
What about iostat on the OSDs — are your OSD disks busy reading or writing 
during these incidents?
What are you using for OSD journals?
Also check the CPU usage for the mons and osds...

Does your hardware provide enough IOPS for what your users need? (e.g. what is 
the op/s from ceph -w)

If disabling deep scrub helps, then it might be that something else is reading 
the disks heavily. One thing to check is updatedb — we had to disable it from 
indexing /var/lib/ceph on our OSDs.

Best Regards,
Dan

-- Dan van der Ster || Data  Storage Services || CERN IT Department --


On 20 Aug 2014, at 16:39, Hugo Mills 
h.r.mi...@reading.ac.ukmailto:h.r.mi...@reading.ac.uk wrote:

We have a ceph system here, and we're seeing performance regularly
descend into unusability for periods of minutes at a time (or longer).
This appears to be triggered by writing large numbers of small files.

Specifications:

ceph 0.80.5
6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads)
2 machines running primary and standby MDS
3 monitors on the same machines as the OSDs
Infiniband to about 8 CephFS clients (headless, in the machine room)
Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop
machines, in the analysis lab)

The cluster stores home directories of the users and a larger area
of scientific data (approx 15 TB) which is being processed and
analysed by the users of the cluster.

We have a 

Re: [ceph-users] Serious performance problems with small file writes

2014-08-20 Thread Hugo Mills
   Hi, Dan,

   Some questions below I can't answer immediately, but I'll spend
tomorrow morning irritating people by triggering these events (I think
I have a reproducer -- unpacking a 1.2 GiB tarball with 25 small
files in it) and giving you more details. For the ones I can answer
right now:

On Wed, Aug 20, 2014 at 02:51:12PM +, Dan Van Der Ster wrote:
 Do you get slow requests during the slowness incidents?

   Slow requests, yes. ceph -w reports them coming in groups, e.g.:

2014-08-20 15:51:23.911711 mon.1 [INF] pgmap v2287926: 704 pgs: 704 
active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 8449 
kB/s rd, 3506 kB/s wr, 527 op/s
2014-08-20 15:51:22.381063 osd.5 [WRN] 6 slow requests, 6 included below; 
oldest blocked for  10.133901 secs
2014-08-20 15:51:22.381066 osd.5 [WRN] slow request 10.133901 seconds old, 
received at 2014-08-20 15:51:12.247127: osd_op(mds.0.101:5528578 
10005889b29. [create 0~0,setxattr parent (394)] 0.786a9365 ondisk+write 
e217298) v4 currently waiting for subops from 6
2014-08-20 15:51:22.381068 osd.5 [WRN] slow request 10.116337 seconds old, 
received at 2014-08-20 15:51:12.264691: osd_op(mds.0.101:5529006 
1000599e576. [create 0~0,setxattr parent (392)] 0.5ccbd6a9 ondisk+write 
e217298) v4 currently waiting for subops from 7
2014-08-20 15:51:22.381070 osd.5 [WRN] slow request 10.116277 seconds old, 
received at 2014-08-20 15:51:12.264751: osd_op(mds.0.101:5529009 
1000588932d. [create 0~0,setxattr parent (394)] 0.de5eca4e ondisk+write 
e217298) v4 currently waiting for subops from 6
2014-08-20 15:51:22.381071 osd.5 [WRN] slow request 10.115296 seconds old, 
received at 2014-08-20 15:51:12.265732: osd_op(mds.0.101:5529042 
1000588933e. [create 0~0,setxattr parent (395)] 0.5e4d56be ondisk+write 
e217298) v4 currently waiting for subops from 7
2014-08-20 15:51:22.381073 osd.5 [WRN] slow request 10.115184 seconds old, 
received at 2014-08-20 15:51:12.265844: osd_op(mds.0.101:5529047 
1000599e58a. [create 0~0,setxattr parent (395)] 0.6a487965 ondisk+write 
e217298) v4 currently waiting for subops from 6
2014-08-20 15:51:24.381370 osd.5 [WRN] 2 slow requests, 2 included below; 
oldest blocked for  10.73 secs
2014-08-20 15:51:24.381373 osd.5 [WRN] slow request 10.73 seconds old, 
received at 2014-08-20 15:51:14.381267: osd_op(mds.0.101:5529327 
100058893ca. [create 0~0,setxattr parent (395)] 0.750c7574 ondisk+write 
e217298) v4 currently commit sent
2014-08-20 15:51:24.381375 osd.5 [WRN] slow request 10.28 seconds old, 
received at 2014-08-20 15:51:14.381312: osd_op(mds.0.101:5529329 
100058893cb. [create 0~0,setxattr parent (395)] 0.c75853fa ondisk+write 
e217298) v4 currently commit sent
2014-08-20 15:51:24.913554 mon.1 [INF] pgmap v2287927: 704 pgs: 704 
active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 13218 
B/s rd, 3532 kB/s wr, 377 op/s
2014-08-20 15:51:25.381582 osd.5 [WRN] 3 slow requests, 3 included below; 
oldest blocked for  10.709989 secs
2014-08-20 15:51:25.381586 osd.5 [WRN] slow request 10.709989 seconds old, 
received at 2014-08-20 15:51:14.671549: osd_op(mds.0.101:5529457 
10005889403. [create 0~0,setxattr parent (407)] 0.e15ab1fa ondisk+write 
e217298) v4 currently no flag points reached
2014-08-20 15:51:25.381587 osd.5 [WRN] slow request 10.709767 seconds old, 
received at 2014-08-20 15:51:14.671771: osd_op(mds.0.101:5529462 
10005889406. [create 0~0,setxattr parent (406)] 0.70f8a5d3 ondisk+write 
e217298) v4 currently no flag points reached
2014-08-20 15:51:25.381589 osd.5 [WRN] slow request 10.182354 seconds old, 
received at 2014-08-20 15:51:15.199184: osd_op(mds.0.101:5529464 
10005889407. [create 0~0,setxattr parent (391)] 0.30535d02 ondisk+write 
e217298) v4 currently no flag points reached
2014-08-20 15:51:25.920298 mon.1 [INF] pgmap v2287928: 704 pgs: 704 
active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 12231 
B/s rd, 5534 kB/s wr, 370 op/s
2014-08-20 15:51:26.925996 mon.1 [INF] pgmap v2287929: 704 pgs: 704 
active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 26498 
B/s rd, 8121 kB/s wr, 367 op/s
2014-08-20 15:51:27.933424 mon.1 [INF] pgmap v2287930: 704 pgs: 704 
active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 706 kB/s 
rd, 7552 kB/s wr, 444 op/s

 What about monitor elections?

   No, that's been reporting monmap e3 and election epoch 130 for
a week or two. I assume that to mean we've had no elections. We're
actually running without one monitor at the moment, because one
machine is down, but we've had the same problems with the machine
present.

 Are your MDSs using a lot of CPU?

   No, they're showing load averages well under 1 the whole time. Peak
load average is about 0.6.

 did you try tuning anything in the MDS (I think the default config
 is still conservative, and there are options to cache more entries,
 etc…)

   Not much. We have:


Re: [ceph-users] Serious performance problems with small file writes

2014-08-20 Thread Andrei Mikhailovsky
Hugo,

I would look at setting up a cache pool made of 4-6 ssds to start with. So, if 
you have 6 osd servers, stick at least 1 ssd disk in each server for the cache 
pool. It should greatly reduce the osd's stress of writing a large number of 
small files. Your cluster should become more responsive and the end user's 
experience should also improve.

I am planning on doing so in a near future, but according to my friend's 
experience, introducing a cache pool has greatly increased the overall 
performance of the cluster and has removed the performance issues that he was 
having during scrubbing/deep-scrubbing/recovery activities.

The size of your working data set should determine the size of the cache pool, 
but in general it will create a nice speedy buffer between your clients and 
those terribly slow spindles.

Andrei





- Original Message -
From: Hugo Mills h.r.mi...@reading.ac.uk
To: Dan Van Der Ster daniel.vanders...@cern.ch
Cc: Ceph Users List ceph-users@lists.ceph.com
Sent: Wednesday, 20 August, 2014 4:54:28 PM
Subject: Re: [ceph-users] Serious performance problems with small file writes

   Hi, Dan,

   Some questions below I can't answer immediately, but I'll spend
tomorrow morning irritating people by triggering these events (I think
I have a reproducer -- unpacking a 1.2 GiB tarball with 25 small
files in it) and giving you more details. For the ones I can answer
right now:

On Wed, Aug 20, 2014 at 02:51:12PM +, Dan Van Der Ster wrote:
 Do you get slow requests during the slowness incidents?

   Slow requests, yes. ceph -w reports them coming in groups, e.g.:

2014-08-20 15:51:23.911711 mon.1 [INF] pgmap v2287926: 704 pgs: 704 
active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 8449 
kB/s rd, 3506 kB/s wr, 527 op/s
2014-08-20 15:51:22.381063 osd.5 [WRN] 6 slow requests, 6 included below; 
oldest blocked for  10.133901 secs
2014-08-20 15:51:22.381066 osd.5 [WRN] slow request 10.133901 seconds old, 
received at 2014-08-20 15:51:12.247127: osd_op(mds.0.101:5528578 
10005889b29. [create 0~0,setxattr parent (394)] 0.786a9365 ondisk+write 
e217298) v4 currently waiting for subops from 6
2014-08-20 15:51:22.381068 osd.5 [WRN] slow request 10.116337 seconds old, 
received at 2014-08-20 15:51:12.264691: osd_op(mds.0.101:5529006 
1000599e576. [create 0~0,setxattr parent (392)] 0.5ccbd6a9 ondisk+write 
e217298) v4 currently waiting for subops from 7
2014-08-20 15:51:22.381070 osd.5 [WRN] slow request 10.116277 seconds old, 
received at 2014-08-20 15:51:12.264751: osd_op(mds.0.101:5529009 
1000588932d. [create 0~0,setxattr parent (394)] 0.de5eca4e ondisk+write 
e217298) v4 currently waiting for subops from 6
2014-08-20 15:51:22.381071 osd.5 [WRN] slow request 10.115296 seconds old, 
received at 2014-08-20 15:51:12.265732: osd_op(mds.0.101:5529042 
1000588933e. [create 0~0,setxattr parent (395)] 0.5e4d56be ondisk+write 
e217298) v4 currently waiting for subops from 7
2014-08-20 15:51:22.381073 osd.5 [WRN] slow request 10.115184 seconds old, 
received at 2014-08-20 15:51:12.265844: osd_op(mds.0.101:5529047 
1000599e58a. [create 0~0,setxattr parent (395)] 0.6a487965 ondisk+write 
e217298) v4 currently waiting for subops from 6
2014-08-20 15:51:24.381370 osd.5 [WRN] 2 slow requests, 2 included below; 
oldest blocked for  10.73 secs
2014-08-20 15:51:24.381373 osd.5 [WRN] slow request 10.73 seconds old, 
received at 2014-08-20 15:51:14.381267: osd_op(mds.0.101:5529327 
100058893ca. [create 0~0,setxattr parent (395)] 0.750c7574 ondisk+write 
e217298) v4 currently commit sent
2014-08-20 15:51:24.381375 osd.5 [WRN] slow request 10.28 seconds old, 
received at 2014-08-20 15:51:14.381312: osd_op(mds.0.101:5529329 
100058893cb. [create 0~0,setxattr parent (395)] 0.c75853fa ondisk+write 
e217298) v4 currently commit sent
2014-08-20 15:51:24.913554 mon.1 [INF] pgmap v2287927: 704 pgs: 704 
active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 13218 
B/s rd, 3532 kB/s wr, 377 op/s
2014-08-20 15:51:25.381582 osd.5 [WRN] 3 slow requests, 3 included below; 
oldest blocked for  10.709989 secs
2014-08-20 15:51:25.381586 osd.5 [WRN] slow request 10.709989 seconds old, 
received at 2014-08-20 15:51:14.671549: osd_op(mds.0.101:5529457 
10005889403. [create 0~0,setxattr parent (407)] 0.e15ab1fa ondisk+write 
e217298) v4 currently no flag points reached
2014-08-20 15:51:25.381587 osd.5 [WRN] slow request 10.709767 seconds old, 
received at 2014-08-20 15:51:14.671771: osd_op(mds.0.101:5529462 
10005889406. [create 0~0,setxattr parent (406)] 0.70f8a5d3 ondisk+write 
e217298) v4 currently no flag points reached
2014-08-20 15:51:25.381589 osd.5 [WRN] slow request 10.182354 seconds old, 
received at 2014-08-20 15:51:15.199184: osd_op(mds.0.101:5529464 
10005889407. [create 0~0,setxattr parent (391)] 0.30535d02 ondisk+write 
e217298) v4 currently no flag points reached
2014-08-20 15:51