Re: [ceph-users] Serious performance problems with small file writes
Hello, On Wed, 20 Aug 2014 15:39:11 +0100 Hugo Mills wrote: We have a ceph system here, and we're seeing performance regularly descend into unusability for periods of minutes at a time (or longer). This appears to be triggered by writing large numbers of small files. Specifications: ceph 0.80.5 6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads) 2 machines running primary and standby MDS 3 monitors on the same machines as the OSDs Infiniband to about 8 CephFS clients (headless, in the machine room) Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop machines, in the analysis lab) Please let us know the CPU and memory specs of the OSD nodes as well. And the replication factor, I presume 3 if you value that data. Also the PG and PGP values for the pool(s) you're using. The cluster stores home directories of the users and a larger area of scientific data (approx 15 TB) which is being processed and analysed by the users of the cluster. We have a relatively small number of concurrent users (typically 4-6 at most), who use GUI tools to examine their data, and then complex sets of MATLAB scripts to process it, with processing often being distributed across all the machines using Condor. It's not unusual to see the analysis scripts write out large numbers (thousands, possibly tens or hundreds of thousands) of small files, often from many client machines at once in parallel. When this happens, the ceph cluster becomes almost completely unresponsive for tens of seconds (or even for minutes) at a time, until the writes are flushed through the system. Given the nature of modern GUI desktop environments (often reading and writing small state files in the user's home directory), this means that desktop interactiveness and responsiveness for all the other users of the cluster suffer. 1-minute load on the servers typically peaks at about 8 during these events (on 4-core machines). Load on the clients also peaks high, because of the number of processes waiting for a response from the FS. The MDS shows little sign of stress -- it seems to be entirely down to the OSDs. ceph -w shows requests blocked for more than 10 seconds, and in bad cases, ceph -s shows up to many hundreds of requests blocked for more than 32s. We've had to turn off scrubbing and deep scrubbing completely -- except between 01.00 and 04.00 every night -- because it triggers the exact same symptoms, even with only 2-3 PGs being scrubbed. If it gets up to 7 PGs being scrubbed, as it did on Monday, it's completely unusable. Note that I know nothing about CephFS and while there are probably tunables the slow requests you're seeing and the hardware up there definitely suggests slow OSDs. Now with a replication factor of 3, your total cluster performance (sustained) is that of just 6 disks and 4TB ones are never any speed wonders. Minus the latency overheads from the network, which should be minimal in your case though. Your old NFS (cluster?) had twice the spindles you wrote, so if that means 36 disks it was quite a bit faster. A cluster I'm just building with 3 nodes, 4 journal SSDs and 8 OSD HDDs per node can do about 7000 write IOPS (4KB), so I would expect yours to be worse off. Having the journals on dedicated partitions instead of files on the rootfs would not only be faster (though probably not significantly so), but also prevent any potential failures based on FS corruption. The SSD journals will compensate for some spikes of high IOPS, but 25 files is clearly beyond that. Putting lots of RAM (relatively cheap these days) into the OSD nodes has the big benefit that reads of hot objects will not have to go to disk and thus compete with write IOPS. Is this problem something that's often seen? If so, what are the best options for mitigation or elimination of the problem? I've found a few references to issue #6278 [1], but that seems to be referencing scrub specifically, not ordinary (if possibly pathological) writes. You need to match your cluster to your workload. Aside from tuning things (which tends to have limited effects), you can either scale out by adding more servers or scale up by using faster storage and/or a cache pool. What are the sorts of things I should be looking at to work out where the bottleneck(s) are? I'm a bit lost about how to drill down into the ceph system for identifying performance issues. Is there a useful guide to tools somewhere? Reading/scouring this ML can be quite helpful. Watch your OSD nodes (all of them!) with iostat or preferably atop (which will also show you how your CPUs and network is doing) while running the below stuff. To get a baseline do: rados -p pool-in-question bench 60 write -t 64 This will test your throughput most of all and due to the 4MB block size spread the load very equally amongst the OSDs. During that test you should see all OSDs more or
Re: [ceph-users] Serious performance problems with small file writes
Hi Hugo, On 20 Aug 2014, at 17:54, Hugo Mills h.r.mi...@reading.ac.uk wrote: What are you using for OSD journals? On each machine, the three OSD journals live on the same ext4 filesystem on an SSD, which is also the root filesystem of the machine. Also check the CPU usage for the mons and osds... The mons are doing pretty much nothing in terms of CPU, as far as I can see. I will double-check during an incident. Does your hardware provide enough IOPS for what your users need? (e.g. what is the op/s from ceph -w) Not really an answer to your question, but: Before the ceph cluster went in, we were running the system on two 5-year-old NFS servers for a while. We have about half the total number of spindles that we used to, but more modern drives. NFS exported async or sync? If async, it can’t be compared to CephFS. Also, if those NFS servers had RAID cards with a wb-cache, it can’t really be compared. I'll look at how the op/s values change when we have the problem. At the moment (with what I assume to be normal desktop usage from the 3-4 users in the lab), they're flapping wildly somewhere around a median of 350-400, with peaks up to 800. Somewhere around 15-20 MB/s read and write. Another tunable to look at is the filestore max sync interval — in my experience the colocated journal/OSD setup suffers with the default (5s, IIRC), especially when an OSD is getting a constant stream of writes. When this happens, the disk heads are constantly seeking back and forth between synchronously writing to the journal and flushing the outstanding writes. If we would have a dedicated (spinning) disk for the journal, then the synchronous writes (to the journal) could be done sequentially (thus, quickly) and the flushes would also be quick(er). SSD journals can obviously also help with this. For a short test I would try increasing filestore max sync interval to 30s or maybe even 60s to see if it helps. (I know that at least one of the Inktank experts advise against changing the filestore max sync interval — but in my experience 5s is much too short for the colocated journal setup.) You need to make sure your journals are large enough to store 30/60s of writes, but when you have predominantly small writes even a few GB of journal ought to be enough. Cheers, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Serious performance problems with small file writes
Just to fill in some of the gaps from yesterday's mail: On Wed, Aug 20, 2014 at 04:54:28PM +0100, Hugo Mills wrote: Some questions below I can't answer immediately, but I'll spend tomorrow morning irritating people by triggering these events (I think I have a reproducer -- unpacking a 1.2 GiB tarball with 25 small files in it) and giving you more details. Yes, the tarball with the 25 small files in it is definitely a reproducer. [snip] What about iostat on the OSDs — are your OSD disks busy reading or writing during these incidents? Not sure. I don't think so, but I'll try to trigger an incident and report back on this one. Mostly writing. I'm seeing figures of up to about 2-3 MB/s writes, and 200-300 kB/s reads on all three, but it fluctuates a lot (with 5-second intervals). Sample data at the end of the email. What are you using for OSD journals? On each machine, the three OSD journals live on the same ext4 filesystem on an SSD, which is also the root filesystem of the machine. Also check the CPU usage for the mons and osds... The mons are doing pretty much nothing in terms of CPU, as far as I can see. I will double-check during an incident. The mons are just ticking over with a 1% CPU usage. Does your hardware provide enough IOPS for what your users need? (e.g. what is the op/s from ceph -w) Not really an answer to your question, but: Before the ceph cluster went in, we were running the system on two 5-year-old NFS servers for a while. We have about half the total number of spindles that we used to, but more modern drives. I'll look at how the op/s values change when we have the problem. At the moment (with what I assume to be normal desktop usage from the 3-4 users in the lab), they're flapping wildly somewhere around a median of 350-400, with peaks up to 800. Somewhere around 15-20 MB/s read and write. With minimal users and one machine running the tar unpacking process, I'm getting somewhere around 100-200 op/s on the ceph cluster, but interactivity on the desktop machine I'm logged in on is horrible -- I'm frequently getting tens of seconds of latency. Compare that to the (relatively) comfortable 350-400 op/s we had yesterday with what is probably workloads with larger files. If disabling deep scrub helps, then it might be that something else is reading the disks heavily. One thing to check is updatedb — we had to disable it from indexing /var/lib/ceph on our OSDs. I haven't seen that running at all during the day, but I'll look into it. No, it's not anything like that -- iotop reports pretty much the only things doing IO are ceph-osd and the occasional xfsaild. Hugo. Hugo. Best Regards, Dan -- Dan van der Ster || Data Storage Services || CERN IT Department -- On 20 Aug 2014, at 16:39, Hugo Mills h.r.mi...@reading.ac.uk wrote: We have a ceph system here, and we're seeing performance regularly descend into unusability for periods of minutes at a time (or longer). This appears to be triggered by writing large numbers of small files. Specifications: ceph 0.80.5 6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads) 2 machines running primary and standby MDS 3 monitors on the same machines as the OSDs Infiniband to about 8 CephFS clients (headless, in the machine room) Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop machines, in the analysis lab) The cluster stores home directories of the users and a larger area of scientific data (approx 15 TB) which is being processed and analysed by the users of the cluster. We have a relatively small number of concurrent users (typically 4-6 at most), who use GUI tools to examine their data, and then complex sets of MATLAB scripts to process it, with processing often being distributed across all the machines using Condor. It's not unusual to see the analysis scripts write out large numbers (thousands, possibly tens or hundreds of thousands) of small files, often from many client machines at once in parallel. When this happens, the ceph cluster becomes almost completely unresponsive for tens of seconds (or even for minutes) at a time, until the writes are flushed through the system. Given the nature of modern GUI desktop environments (often reading and writing small state files in the user's home directory), this means that desktop interactiveness and responsiveness for all the other users of the cluster suffer. 1-minute load on the servers typically peaks at about 8 during these events (on 4-core machines). Load on the clients also peaks high, because of the number of processes waiting for a response from the FS. The MDS shows little sign of stress -- it seems to be entirely down to the OSDs. ceph -w shows requests blocked for more than 10 seconds,
Re: [ceph-users] Serious performance problems with small file writes
On Thu, Aug 21, 2014 at 07:40:45AM +, Dan Van Der Ster wrote: On 20 Aug 2014, at 17:54, Hugo Mills h.r.mi...@reading.ac.uk wrote: Does your hardware provide enough IOPS for what your users need? (e.g. what is the op/s from ceph -w) Not really an answer to your question, but: Before the ceph cluster went in, we were running the system on two 5-year-old NFS servers for a while. We have about half the total number of spindles that we used to, but more modern drives. NFS exported async or sync? If async, it can’t be compared to CephFS. Also, if those NFS servers had RAID cards with a wb-cache, it can’t really be compared. Hmm. Yes, async. Probably wouldn't have been my choice... (I only started working with this system recently -- about the same time that the ceph cluster was deployed to replace the older machines. I haven't had much of say in what's implemented here, but I have to try to support it.) I'm tempted to put the users' home directories back on an NFS server, and keep ceph for the research data. That at least should give us more in the way of interactivity (which is the main thing I'm getting complaints about). I'll look at how the op/s values change when we have the problem. At the moment (with what I assume to be normal desktop usage from the 3-4 users in the lab), they're flapping wildly somewhere around a median of 350-400, with peaks up to 800. Somewhere around 15-20 MB/s read and write. Another tunable to look at is the filestore max sync interval — in my experience the colocated journal/OSD setup suffers with the default (5s, IIRC), especially when an OSD is getting a constant stream of writes. When this happens, the disk heads are constantly seeking back and forth between synchronously writing to the journal and flushing the outstanding writes. If we would have a dedicated (spinning) disk for the journal, then the synchronous writes (to the journal) could be done sequentially (thus, quickly) and the flushes would also be quick(er). SSD journals can obviously also help with this. Not sure what you mean about colocated journal/OSD. The journals aren't on the same device as the OSDs. However, all three journals on each machine are on the same SSD. For a short test I would try increasing filestore max sync interval to 30s or maybe even 60s to see if it helps. (I know that at least one of the Inktank experts advise against changing the filestore max sync interval — but in my experience 5s is much too short for the colocated journal setup.) You need to make sure your journals are large enough to store 30/60s of writes, but when you have predominantly small writes even a few GB of journal ought to be enough. I'll have a play with that. Thanks for all the help so far -- it's been useful. I'm learning what the right kind of questions are. Hugo. -- Hugo Mills :: IT Services, University of Reading Specialist Engineer, Research Servers :: x6943 :: R07 Harry Pitt Building ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Serious performance problems with small file writes
Hi Hugo, On 21 Aug 2014, at 14:17, Hugo Mills h.r.mi...@reading.ac.uk wrote: Not sure what you mean about colocated journal/OSD. The journals aren't on the same device as the OSDs. However, all three journals on each machine are on the same SSD. embarrassed I obviously didn’t drink enough coffee this morning. I read your reply as something like … On each machine, the three OSD journals live on the same ext4 filesystem on an OSD”. Anyway… what kind of SSD do you have? With iostat -xm 1, do you see high % utilisation on that SSD during these incidents? It could be that you’re exceeding even the iops capacity of the SSD. Cheers, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Serious performance problems with small file writes
Hi, Do you get slow requests during the slowness incidents? What about monitor elections? Are your MDSs using a lot of CPU? did you try tuning anything in the MDS (I think the default config is still conservative, and there are options to cache more entries, etc…) What about iostat on the OSDs — are your OSD disks busy reading or writing during these incidents? What are you using for OSD journals? Also check the CPU usage for the mons and osds... Does your hardware provide enough IOPS for what your users need? (e.g. what is the op/s from ceph -w) If disabling deep scrub helps, then it might be that something else is reading the disks heavily. One thing to check is updatedb — we had to disable it from indexing /var/lib/ceph on our OSDs. Best Regards, Dan -- Dan van der Ster || Data Storage Services || CERN IT Department -- On 20 Aug 2014, at 16:39, Hugo Mills h.r.mi...@reading.ac.uk wrote: We have a ceph system here, and we're seeing performance regularly descend into unusability for periods of minutes at a time (or longer). This appears to be triggered by writing large numbers of small files. Specifications: ceph 0.80.5 6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads) 2 machines running primary and standby MDS 3 monitors on the same machines as the OSDs Infiniband to about 8 CephFS clients (headless, in the machine room) Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop machines, in the analysis lab) The cluster stores home directories of the users and a larger area of scientific data (approx 15 TB) which is being processed and analysed by the users of the cluster. We have a relatively small number of concurrent users (typically 4-6 at most), who use GUI tools to examine their data, and then complex sets of MATLAB scripts to process it, with processing often being distributed across all the machines using Condor. It's not unusual to see the analysis scripts write out large numbers (thousands, possibly tens or hundreds of thousands) of small files, often from many client machines at once in parallel. When this happens, the ceph cluster becomes almost completely unresponsive for tens of seconds (or even for minutes) at a time, until the writes are flushed through the system. Given the nature of modern GUI desktop environments (often reading and writing small state files in the user's home directory), this means that desktop interactiveness and responsiveness for all the other users of the cluster suffer. 1-minute load on the servers typically peaks at about 8 during these events (on 4-core machines). Load on the clients also peaks high, because of the number of processes waiting for a response from the FS. The MDS shows little sign of stress -- it seems to be entirely down to the OSDs. ceph -w shows requests blocked for more than 10 seconds, and in bad cases, ceph -s shows up to many hundreds of requests blocked for more than 32s. We've had to turn off scrubbing and deep scrubbing completely -- except between 01.00 and 04.00 every night -- because it triggers the exact same symptoms, even with only 2-3 PGs being scrubbed. If it gets up to 7 PGs being scrubbed, as it did on Monday, it's completely unusable. Is this problem something that's often seen? If so, what are the best options for mitigation or elimination of the problem? I've found a few references to issue #6278 [1], but that seems to be referencing scrub specifically, not ordinary (if possibly pathological) writes. What are the sorts of things I should be looking at to work out where the bottleneck(s) are? I'm a bit lost about how to drill down into the ceph system for identifying performance issues. Is there a useful guide to tools somewhere? Is an upgrade to 0.84 likely to be helpful? How development are the development releases, from a stability / dangerous bugs point of view? Thanks, Hugo. [1] http://tracker.ceph.com/issues/6278 -- Hugo Mills :: IT Services, University of Reading Specialist Engineer, Research Servers ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Serious performance problems with small file writes
Hi, On 20 Aug 2014, at 16:55, German Anders gand...@despegar.commailto:gand...@despegar.com wrote: Hi Dan, How are you? I want to know how you disable the indexing on the /var/lib/ceph OSDs? # grep ceph /etc/updatedb.conf PRUNEPATHS = /afs /media /net /sfs /tmp /udev /var/cache/ccache /var/spool/cups /var/spool/squid /var/tmp /var/lib/ceph Did you disable deep scrub on you OSDs? No but this can be an issue. If you get many PGs scrubbing at once, performance will suffer. There is a new feature in 0.67.10 to sleep between scrubbing “chunks”. I set the that sleep to 0.1 (and the chunk_max to 5, and the scrub size 1MB). In 0.67.10+1 there are some new options to set the iopriority of the scrubbing threads. Set that to class = 3, priority = 0 to give the scrubbing thread the idle priority. You need to use the cfq disk scheduler for io priorities to work. (cfq will also help if updatedb is causing any problems, since it runs with ionice -c 3). I’m pretty sure those features will come in 0.80.6 as well. Do you have the journals on SSD's or RAMDISK? Never use RAMDISK. We currently have the journals on the same spinning disk as the OSD, but the iops performance is low for the rbd and fs use-cases. (For object store it should be OK). But for rbd or fs, you really need journals on SSDs or your cluster will suffer. We now have SSDs on order to augment our cluster. (The way I justified this is that our cluster has X TB of storage capacity and Y iops capacity. With disk journals we will run out of iops capacity well before we run out of storage capacity. So you can either increase the iops capacity substantially by decreasing the volume of the cluster by 20% and replacing those disks with SSD journals, or you can just leave 50% of the disk capacity empty since you can’t use it anyway). What's the perf of your cluster? randos bench? fio? I've setup a new cluster and I want to know what would be the best option scheme to go. It’s not really meaningful to compare performance of different clusters with different hardware. Some “constants” I can advise - with few clients, large write throughput is limited by the clients bandwidth, as long as you have enough OSDs and the client is striping over many objects. - with disk journals, small write latency will be ~30-50ms even when the cluster is idle. if you have SSD journals, maybe ~10ms. - count your iops. Each disk OSD can do ~100, and you need to divide by the number of replicas. With SSDs you can do a bit better than this since the synchronous writes go to the SSDs not the disks. In my tests with our hardware I estimate that going from disk to SSD journal will multiply the iops capacity by around 5x. I also found that I needed to increase some the journal max write and journal queue max limits, also the filestore limits, to squeeze the best performance out of the SSD journals. Try increasing filestore queue max ops/bytes, filestore queue committing max ops/bytes, and the filestore wbthrottle xfs * options. (I’m not going to publish exact configs here because I haven’t finished tuning yet). Cheers, Dan Thanks a lot!! Best regards, German Anders On Wednesday 20/08/2014 at 11:51, Dan Van Der Ster wrote: Hi, Do you get slow requests during the slowness incidents? What about monitor elections? Are your MDSs using a lot of CPU? did you try tuning anything in the MDS (I think the default config is still conservative, and there are options to cache more entries, etc…) What about iostat on the OSDs — are your OSD disks busy reading or writing during these incidents? What are you using for OSD journals? Also check the CPU usage for the mons and osds... Does your hardware provide enough IOPS for what your users need? (e.g. what is the op/s from ceph -w) If disabling deep scrub helps, then it might be that something else is reading the disks heavily. One thing to check is updatedb — we had to disable it from indexing /var/lib/ceph on our OSDs. Best Regards, Dan -- Dan van der Ster || Data Storage Services || CERN IT Department -- On 20 Aug 2014, at 16:39, Hugo Mills h.r.mi...@reading.ac.ukmailto:h.r.mi...@reading.ac.uk wrote: We have a ceph system here, and we're seeing performance regularly descend into unusability for periods of minutes at a time (or longer). This appears to be triggered by writing large numbers of small files. Specifications: ceph 0.80.5 6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads) 2 machines running primary and standby MDS 3 monitors on the same machines as the OSDs Infiniband to about 8 CephFS clients (headless, in the machine room) Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop machines, in the analysis lab) The cluster stores home directories of the users and a larger area of scientific data (approx 15 TB) which is being processed and analysed by the users of the cluster. We have a
Re: [ceph-users] Serious performance problems with small file writes
Hi, Dan, Some questions below I can't answer immediately, but I'll spend tomorrow morning irritating people by triggering these events (I think I have a reproducer -- unpacking a 1.2 GiB tarball with 25 small files in it) and giving you more details. For the ones I can answer right now: On Wed, Aug 20, 2014 at 02:51:12PM +, Dan Van Der Ster wrote: Do you get slow requests during the slowness incidents? Slow requests, yes. ceph -w reports them coming in groups, e.g.: 2014-08-20 15:51:23.911711 mon.1 [INF] pgmap v2287926: 704 pgs: 704 active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 8449 kB/s rd, 3506 kB/s wr, 527 op/s 2014-08-20 15:51:22.381063 osd.5 [WRN] 6 slow requests, 6 included below; oldest blocked for 10.133901 secs 2014-08-20 15:51:22.381066 osd.5 [WRN] slow request 10.133901 seconds old, received at 2014-08-20 15:51:12.247127: osd_op(mds.0.101:5528578 10005889b29. [create 0~0,setxattr parent (394)] 0.786a9365 ondisk+write e217298) v4 currently waiting for subops from 6 2014-08-20 15:51:22.381068 osd.5 [WRN] slow request 10.116337 seconds old, received at 2014-08-20 15:51:12.264691: osd_op(mds.0.101:5529006 1000599e576. [create 0~0,setxattr parent (392)] 0.5ccbd6a9 ondisk+write e217298) v4 currently waiting for subops from 7 2014-08-20 15:51:22.381070 osd.5 [WRN] slow request 10.116277 seconds old, received at 2014-08-20 15:51:12.264751: osd_op(mds.0.101:5529009 1000588932d. [create 0~0,setxattr parent (394)] 0.de5eca4e ondisk+write e217298) v4 currently waiting for subops from 6 2014-08-20 15:51:22.381071 osd.5 [WRN] slow request 10.115296 seconds old, received at 2014-08-20 15:51:12.265732: osd_op(mds.0.101:5529042 1000588933e. [create 0~0,setxattr parent (395)] 0.5e4d56be ondisk+write e217298) v4 currently waiting for subops from 7 2014-08-20 15:51:22.381073 osd.5 [WRN] slow request 10.115184 seconds old, received at 2014-08-20 15:51:12.265844: osd_op(mds.0.101:5529047 1000599e58a. [create 0~0,setxattr parent (395)] 0.6a487965 ondisk+write e217298) v4 currently waiting for subops from 6 2014-08-20 15:51:24.381370 osd.5 [WRN] 2 slow requests, 2 included below; oldest blocked for 10.73 secs 2014-08-20 15:51:24.381373 osd.5 [WRN] slow request 10.73 seconds old, received at 2014-08-20 15:51:14.381267: osd_op(mds.0.101:5529327 100058893ca. [create 0~0,setxattr parent (395)] 0.750c7574 ondisk+write e217298) v4 currently commit sent 2014-08-20 15:51:24.381375 osd.5 [WRN] slow request 10.28 seconds old, received at 2014-08-20 15:51:14.381312: osd_op(mds.0.101:5529329 100058893cb. [create 0~0,setxattr parent (395)] 0.c75853fa ondisk+write e217298) v4 currently commit sent 2014-08-20 15:51:24.913554 mon.1 [INF] pgmap v2287927: 704 pgs: 704 active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 13218 B/s rd, 3532 kB/s wr, 377 op/s 2014-08-20 15:51:25.381582 osd.5 [WRN] 3 slow requests, 3 included below; oldest blocked for 10.709989 secs 2014-08-20 15:51:25.381586 osd.5 [WRN] slow request 10.709989 seconds old, received at 2014-08-20 15:51:14.671549: osd_op(mds.0.101:5529457 10005889403. [create 0~0,setxattr parent (407)] 0.e15ab1fa ondisk+write e217298) v4 currently no flag points reached 2014-08-20 15:51:25.381587 osd.5 [WRN] slow request 10.709767 seconds old, received at 2014-08-20 15:51:14.671771: osd_op(mds.0.101:5529462 10005889406. [create 0~0,setxattr parent (406)] 0.70f8a5d3 ondisk+write e217298) v4 currently no flag points reached 2014-08-20 15:51:25.381589 osd.5 [WRN] slow request 10.182354 seconds old, received at 2014-08-20 15:51:15.199184: osd_op(mds.0.101:5529464 10005889407. [create 0~0,setxattr parent (391)] 0.30535d02 ondisk+write e217298) v4 currently no flag points reached 2014-08-20 15:51:25.920298 mon.1 [INF] pgmap v2287928: 704 pgs: 704 active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 12231 B/s rd, 5534 kB/s wr, 370 op/s 2014-08-20 15:51:26.925996 mon.1 [INF] pgmap v2287929: 704 pgs: 704 active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 26498 B/s rd, 8121 kB/s wr, 367 op/s 2014-08-20 15:51:27.933424 mon.1 [INF] pgmap v2287930: 704 pgs: 704 active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 706 kB/s rd, 7552 kB/s wr, 444 op/s What about monitor elections? No, that's been reporting monmap e3 and election epoch 130 for a week or two. I assume that to mean we've had no elections. We're actually running without one monitor at the moment, because one machine is down, but we've had the same problems with the machine present. Are your MDSs using a lot of CPU? No, they're showing load averages well under 1 the whole time. Peak load average is about 0.6. did you try tuning anything in the MDS (I think the default config is still conservative, and there are options to cache more entries, etc…) Not much. We have:
Re: [ceph-users] Serious performance problems with small file writes
Hugo, I would look at setting up a cache pool made of 4-6 ssds to start with. So, if you have 6 osd servers, stick at least 1 ssd disk in each server for the cache pool. It should greatly reduce the osd's stress of writing a large number of small files. Your cluster should become more responsive and the end user's experience should also improve. I am planning on doing so in a near future, but according to my friend's experience, introducing a cache pool has greatly increased the overall performance of the cluster and has removed the performance issues that he was having during scrubbing/deep-scrubbing/recovery activities. The size of your working data set should determine the size of the cache pool, but in general it will create a nice speedy buffer between your clients and those terribly slow spindles. Andrei - Original Message - From: Hugo Mills h.r.mi...@reading.ac.uk To: Dan Van Der Ster daniel.vanders...@cern.ch Cc: Ceph Users List ceph-users@lists.ceph.com Sent: Wednesday, 20 August, 2014 4:54:28 PM Subject: Re: [ceph-users] Serious performance problems with small file writes Hi, Dan, Some questions below I can't answer immediately, but I'll spend tomorrow morning irritating people by triggering these events (I think I have a reproducer -- unpacking a 1.2 GiB tarball with 25 small files in it) and giving you more details. For the ones I can answer right now: On Wed, Aug 20, 2014 at 02:51:12PM +, Dan Van Der Ster wrote: Do you get slow requests during the slowness incidents? Slow requests, yes. ceph -w reports them coming in groups, e.g.: 2014-08-20 15:51:23.911711 mon.1 [INF] pgmap v2287926: 704 pgs: 704 active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 8449 kB/s rd, 3506 kB/s wr, 527 op/s 2014-08-20 15:51:22.381063 osd.5 [WRN] 6 slow requests, 6 included below; oldest blocked for 10.133901 secs 2014-08-20 15:51:22.381066 osd.5 [WRN] slow request 10.133901 seconds old, received at 2014-08-20 15:51:12.247127: osd_op(mds.0.101:5528578 10005889b29. [create 0~0,setxattr parent (394)] 0.786a9365 ondisk+write e217298) v4 currently waiting for subops from 6 2014-08-20 15:51:22.381068 osd.5 [WRN] slow request 10.116337 seconds old, received at 2014-08-20 15:51:12.264691: osd_op(mds.0.101:5529006 1000599e576. [create 0~0,setxattr parent (392)] 0.5ccbd6a9 ondisk+write e217298) v4 currently waiting for subops from 7 2014-08-20 15:51:22.381070 osd.5 [WRN] slow request 10.116277 seconds old, received at 2014-08-20 15:51:12.264751: osd_op(mds.0.101:5529009 1000588932d. [create 0~0,setxattr parent (394)] 0.de5eca4e ondisk+write e217298) v4 currently waiting for subops from 6 2014-08-20 15:51:22.381071 osd.5 [WRN] slow request 10.115296 seconds old, received at 2014-08-20 15:51:12.265732: osd_op(mds.0.101:5529042 1000588933e. [create 0~0,setxattr parent (395)] 0.5e4d56be ondisk+write e217298) v4 currently waiting for subops from 7 2014-08-20 15:51:22.381073 osd.5 [WRN] slow request 10.115184 seconds old, received at 2014-08-20 15:51:12.265844: osd_op(mds.0.101:5529047 1000599e58a. [create 0~0,setxattr parent (395)] 0.6a487965 ondisk+write e217298) v4 currently waiting for subops from 6 2014-08-20 15:51:24.381370 osd.5 [WRN] 2 slow requests, 2 included below; oldest blocked for 10.73 secs 2014-08-20 15:51:24.381373 osd.5 [WRN] slow request 10.73 seconds old, received at 2014-08-20 15:51:14.381267: osd_op(mds.0.101:5529327 100058893ca. [create 0~0,setxattr parent (395)] 0.750c7574 ondisk+write e217298) v4 currently commit sent 2014-08-20 15:51:24.381375 osd.5 [WRN] slow request 10.28 seconds old, received at 2014-08-20 15:51:14.381312: osd_op(mds.0.101:5529329 100058893cb. [create 0~0,setxattr parent (395)] 0.c75853fa ondisk+write e217298) v4 currently commit sent 2014-08-20 15:51:24.913554 mon.1 [INF] pgmap v2287927: 704 pgs: 704 active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 13218 B/s rd, 3532 kB/s wr, 377 op/s 2014-08-20 15:51:25.381582 osd.5 [WRN] 3 slow requests, 3 included below; oldest blocked for 10.709989 secs 2014-08-20 15:51:25.381586 osd.5 [WRN] slow request 10.709989 seconds old, received at 2014-08-20 15:51:14.671549: osd_op(mds.0.101:5529457 10005889403. [create 0~0,setxattr parent (407)] 0.e15ab1fa ondisk+write e217298) v4 currently no flag points reached 2014-08-20 15:51:25.381587 osd.5 [WRN] slow request 10.709767 seconds old, received at 2014-08-20 15:51:14.671771: osd_op(mds.0.101:5529462 10005889406. [create 0~0,setxattr parent (406)] 0.70f8a5d3 ondisk+write e217298) v4 currently no flag points reached 2014-08-20 15:51:25.381589 osd.5 [WRN] slow request 10.182354 seconds old, received at 2014-08-20 15:51:15.199184: osd_op(mds.0.101:5529464 10005889407. [create 0~0,setxattr parent (391)] 0.30535d02 ondisk+write e217298) v4 currently no flag points reached 2014-08-20 15:51