Re: [ceph-users] Serious performance problems with small file writes

2014-08-21 Thread Hugo Mills
   Just to fill in some of the gaps from yesterday's mail:

On Wed, Aug 20, 2014 at 04:54:28PM +0100, Hugo Mills wrote:
Some questions below I can't answer immediately, but I'll spend
 tomorrow morning irritating people by triggering these events (I think
 I have a reproducer -- unpacking a 1.2 GiB tarball with 25 small
 files in it) and giving you more details. 

   Yes, the tarball with the 25 small files in it is definitely a
reproducer.

[snip]
  What about iostat on the OSDs — are your OSD disks busy reading or
  writing during these incidents?
 
Not sure. I don't think so, but I'll try to trigger an incident and
 report back on this one.

   Mostly writing. I'm seeing figures of up to about 2-3 MB/s writes,
and 200-300 kB/s reads on all three, but it fluctuates a lot (with
5-second intervals). Sample data at the end of the email.

  What are you using for OSD journals?
 
On each machine, the three OSD journals live on the same ext4
 filesystem on an SSD, which is also the root filesystem of the
 machine.
 
  Also check the CPU usage for the mons and osds...
 
The mons are doing pretty much nothing in terms of CPU, as far as I
 can see. I will double-check during an incident.

   The mons are just ticking over with a 1% CPU usage.

  Does your hardware provide enough IOPS for what your users need?
  (e.g. what is the op/s from ceph -w)
 
Not really an answer to your question, but: Before the ceph cluster
 went in, we were running the system on two 5-year-old NFS servers for
 a while. We have about half the total number of spindles that we used
 to, but more modern drives.
 
I'll look at how the op/s values change when we have the problem.
 At the moment (with what I assume to be normal desktop usage from the
 3-4 users in the lab), they're flapping wildly somewhere around a
 median of 350-400, with peaks up to 800. Somewhere around 15-20 MB/s
 read and write.

   With minimal users and one machine running the tar unpacking
process, I'm getting somewhere around 100-200 op/s on the ceph
cluster, but interactivity on the desktop machine I'm logged in on is
horrible -- I'm frequently getting tens of seconds of latency. Compare
that to the (relatively) comfortable 350-400 op/s we had yesterday
with what is probably workloads with larger files.

  If disabling deep scrub helps, then it might be that something else
  is reading the disks heavily. One thing to check is updatedb — we
  had to disable it from indexing /var/lib/ceph on our OSDs.
 
I haven't seen that running at all during the day, but I'll look
 into it.

   No, it's not anything like that -- iotop reports pretty much the
only things doing IO are ceph-osd and the occasional xfsaild.

   Hugo.

Hugo.
 
  Best Regards,
  Dan
  
  -- Dan van der Ster || Data  Storage Services || CERN IT Department --
  
  
  On 20 Aug 2014, at 16:39, Hugo Mills h.r.mi...@reading.ac.uk wrote:
  
 We have a ceph system here, and we're seeing performance regularly
   descend into unusability for periods of minutes at a time (or longer).
   This appears to be triggered by writing large numbers of small files.
   
 Specifications:
   
   ceph 0.80.5
   6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads)
   2 machines running primary and standby MDS
   3 monitors on the same machines as the OSDs
   Infiniband to about 8 CephFS clients (headless, in the machine room)
   Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop
 machines, in the analysis lab)
   
 The cluster stores home directories of the users and a larger area
   of scientific data (approx 15 TB) which is being processed and
   analysed by the users of the cluster.
   
 We have a relatively small number of concurrent users (typically
   4-6 at most), who use GUI tools to examine their data, and then
   complex sets of MATLAB scripts to process it, with processing often
   being distributed across all the machines using Condor.
   
 It's not unusual to see the analysis scripts write out large
   numbers (thousands, possibly tens or hundreds of thousands) of small
   files, often from many client machines at once in parallel. When this
   happens, the ceph cluster becomes almost completely unresponsive for
   tens of seconds (or even for minutes) at a time, until the writes are
   flushed through the system. Given the nature of modern GUI desktop
   environments (often reading and writing small state files in the
   user's home directory), this means that desktop interactiveness and
   responsiveness for all the other users of the cluster suffer.
   
 1-minute load on the servers typically peaks at about 8 during
   these events (on 4-core machines). Load on the clients also peaks
   high, because of the number of processes waiting for a response from
   the FS. The MDS shows little sign of stress -- it seems to be entirely
   down to the OSDs. ceph -w shows requests blocked for more than 10
   seconds

Re: [ceph-users] Serious performance problems with small file writes

2014-08-21 Thread Hugo Mills
On Thu, Aug 21, 2014 at 07:40:45AM +, Dan Van Der Ster wrote:
 On 20 Aug 2014, at 17:54, Hugo Mills h.r.mi...@reading.ac.uk wrote:
  Does your hardware provide enough IOPS for what your users need?
  (e.g. what is the op/s from ceph -w)
  
Not really an answer to your question, but: Before the ceph cluster
  went in, we were running the system on two 5-year-old NFS servers for
  a while. We have about half the total number of spindles that we used
  to, but more modern drives.
 
 NFS exported async or sync? If async, it can’t be compared to
 CephFS. Also, if those NFS servers had RAID cards with a wb-cache,
 it can’t really be compared.

   Hmm. Yes, async. Probably wouldn't have been my choice... (I only
started working with this system recently -- about the same time that
the ceph cluster was deployed to replace the older machines. I haven't
had much of say in what's implemented here, but I have to try to
support it.)

   I'm tempted to put the users' home directories back on an NFS
server, and keep ceph for the research data. That at least should give
us more in the way of interactivity (which is the main thing I'm
getting complaints about).

I'll look at how the op/s values change when we have the problem.
  At the moment (with what I assume to be normal desktop usage from the
  3-4 users in the lab), they're flapping wildly somewhere around a
  median of 350-400, with peaks up to 800. Somewhere around 15-20 MB/s
  read and write.

 Another tunable to look at is the filestore max sync interval — in
 my experience the colocated journal/OSD setup suffers with the
 default (5s, IIRC), especially when an OSD is getting a constant
 stream of writes. When this happens, the disk heads are constantly
 seeking back and forth between synchronously writing to the journal
 and flushing the outstanding writes. If we would have a dedicated
 (spinning) disk for the journal, then the synchronous writes (to the
 journal) could be done sequentially (thus, quickly) and the flushes
 would also be quick(er). SSD journals can obviously also help with
 this.

   Not sure what you mean about colocated journal/OSD. The journals
aren't on the same device as the OSDs. However, all three journals on
each machine are on the same SSD.

 For a short test I would try increasing filestore max sync interval
 to 30s or maybe even 60s to see if it helps. (I know that at least
 one of the Inktank experts advise against changing the filestore max
 sync interval — but in my experience 5s is much too short for the
 colocated journal setup.) You need to make sure your journals are
 large enough to store 30/60s of writes, but when you have
 predominantly small writes even a few GB of journal ought to be
 enough.

   I'll have a play with that.

   Thanks for all the help so far -- it's been useful. I'm learning
what the right kind of questions are.

   Hugo.

-- 
Hugo Mills :: IT Services, University of Reading
Specialist Engineer, Research Servers :: x6943 :: R07 Harry Pitt Building
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Serious performance problems with small file writes

2014-08-20 Thread Hugo Mills
   We have a ceph system here, and we're seeing performance regularly
descend into unusability for periods of minutes at a time (or longer).
This appears to be triggered by writing large numbers of small files.

   Specifications:

ceph 0.80.5
6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads)
2 machines running primary and standby MDS
3 monitors on the same machines as the OSDs
Infiniband to about 8 CephFS clients (headless, in the machine room)
Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop
   machines, in the analysis lab)

   The cluster stores home directories of the users and a larger area
of scientific data (approx 15 TB) which is being processed and
analysed by the users of the cluster.

   We have a relatively small number of concurrent users (typically
4-6 at most), who use GUI tools to examine their data, and then
complex sets of MATLAB scripts to process it, with processing often
being distributed across all the machines using Condor.

   It's not unusual to see the analysis scripts write out large
numbers (thousands, possibly tens or hundreds of thousands) of small
files, often from many client machines at once in parallel. When this
happens, the ceph cluster becomes almost completely unresponsive for
tens of seconds (or even for minutes) at a time, until the writes are
flushed through the system. Given the nature of modern GUI desktop
environments (often reading and writing small state files in the
user's home directory), this means that desktop interactiveness and
responsiveness for all the other users of the cluster suffer.

   1-minute load on the servers typically peaks at about 8 during
these events (on 4-core machines). Load on the clients also peaks
high, because of the number of processes waiting for a response from
the FS. The MDS shows little sign of stress -- it seems to be entirely
down to the OSDs. ceph -w shows requests blocked for more than 10
seconds, and in bad cases, ceph -s shows up to many hundreds of
requests blocked for more than 32s.

   We've had to turn off scrubbing and deep scrubbing completely --
except between 01.00 and 04.00 every night -- because it triggers the
exact same symptoms, even with only 2-3 PGs being scrubbed. If it gets
up to 7 PGs being scrubbed, as it did on Monday, it's completely
unusable.

   Is this problem something that's often seen? If so, what are the
best options for mitigation or elimination of the problem? I've found
a few references to issue #6278 [1], but that seems to be referencing
scrub specifically, not ordinary (if possibly pathological) writes.

   What are the sorts of things I should be looking at to work out
where the bottleneck(s) are? I'm a bit lost about how to drill down
into the ceph system for identifying performance issues. Is there a
useful guide to tools somewhere?

   Is an upgrade to 0.84 likely to be helpful? How development are
the development releases, from a stability / dangerous bugs point of
view?

   Thanks,
   Hugo.

[1] http://tracker.ceph.com/issues/6278

-- 
Hugo Mills :: IT Services, University of Reading
Specialist Engineer, Research Servers
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Serious performance problems with small file writes

2014-08-20 Thread Hugo Mills
:

[global]
mds inline_data = true
mds shutdown check = 2
mds cache size = 75

[mds]
mds client prealloc inos = 1

 What about iostat on the OSDs — are your OSD disks busy reading or
 writing during these incidents?

   Not sure. I don't think so, but I'll try to trigger an incident and
report back on this one.

 What are you using for OSD journals?

   On each machine, the three OSD journals live on the same ext4
filesystem on an SSD, which is also the root filesystem of the
machine.

 Also check the CPU usage for the mons and osds...

   The mons are doing pretty much nothing in terms of CPU, as far as I
can see. I will double-check during an incident.

 Does your hardware provide enough IOPS for what your users need?
 (e.g. what is the op/s from ceph -w)

   Not really an answer to your question, but: Before the ceph cluster
went in, we were running the system on two 5-year-old NFS servers for
a while. We have about half the total number of spindles that we used
to, but more modern drives.

   I'll look at how the op/s values change when we have the problem.
At the moment (with what I assume to be normal desktop usage from the
3-4 users in the lab), they're flapping wildly somewhere around a
median of 350-400, with peaks up to 800. Somewhere around 15-20 MB/s
read and write.

 If disabling deep scrub helps, then it might be that something else
 is reading the disks heavily. One thing to check is updatedb — we
 had to disable it from indexing /var/lib/ceph on our OSDs.

   I haven't seen that running at all during the day, but I'll look
into it.

   Hugo.

 Best Regards,
 Dan
 
 -- Dan van der Ster || Data  Storage Services || CERN IT Department --
 
 
 On 20 Aug 2014, at 16:39, Hugo Mills h.r.mi...@reading.ac.uk wrote:
 
We have a ceph system here, and we're seeing performance regularly
  descend into unusability for periods of minutes at a time (or longer).
  This appears to be triggered by writing large numbers of small files.
  
Specifications:
  
  ceph 0.80.5
  6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads)
  2 machines running primary and standby MDS
  3 monitors on the same machines as the OSDs
  Infiniband to about 8 CephFS clients (headless, in the machine room)
  Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop
machines, in the analysis lab)
  
The cluster stores home directories of the users and a larger area
  of scientific data (approx 15 TB) which is being processed and
  analysed by the users of the cluster.
  
We have a relatively small number of concurrent users (typically
  4-6 at most), who use GUI tools to examine their data, and then
  complex sets of MATLAB scripts to process it, with processing often
  being distributed across all the machines using Condor.
  
It's not unusual to see the analysis scripts write out large
  numbers (thousands, possibly tens or hundreds of thousands) of small
  files, often from many client machines at once in parallel. When this
  happens, the ceph cluster becomes almost completely unresponsive for
  tens of seconds (or even for minutes) at a time, until the writes are
  flushed through the system. Given the nature of modern GUI desktop
  environments (often reading and writing small state files in the
  user's home directory), this means that desktop interactiveness and
  responsiveness for all the other users of the cluster suffer.
  
1-minute load on the servers typically peaks at about 8 during
  these events (on 4-core machines). Load on the clients also peaks
  high, because of the number of processes waiting for a response from
  the FS. The MDS shows little sign of stress -- it seems to be entirely
  down to the OSDs. ceph -w shows requests blocked for more than 10
  seconds, and in bad cases, ceph -s shows up to many hundreds of
  requests blocked for more than 32s.
  
We've had to turn off scrubbing and deep scrubbing completely --
  except between 01.00 and 04.00 every night -- because it triggers the
  exact same symptoms, even with only 2-3 PGs being scrubbed. If it gets
  up to 7 PGs being scrubbed, as it did on Monday, it's completely
  unusable.
  
Is this problem something that's often seen? If so, what are the
  best options for mitigation or elimination of the problem? I've found
  a few references to issue #6278 [1], but that seems to be referencing
  scrub specifically, not ordinary (if possibly pathological) writes.
  
What are the sorts of things I should be looking at to work out
  where the bottleneck(s) are? I'm a bit lost about how to drill down
  into the ceph system for identifying performance issues. Is there a
  useful guide to tools somewhere?
  
Is an upgrade to 0.84 likely to be helpful? How development are
  the development releases, from a stability / dangerous bugs point of
  view?
  
Thanks,
Hugo.
  
  [1] http://tracker.ceph.com/issues/6278
  

-- 
Hugo Mills :: IT Services, University of Reading
Specialist