Re: [ceph-users] Best practice K/M-parameters EC pool
On 28/08/2014 16:29, Mike Dawson wrote: On 8/28/2014 12:23 AM, Christian Balzer wrote: On Wed, 27 Aug 2014 13:04:48 +0200 Loic Dachary wrote: On 27/08/2014 04:34, Christian Balzer wrote: Hello, On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote: Hi Craig, I assume the reason for the 48 hours recovery time is to keep the cost of the cluster low ? I wrote 1h recovery time because it is roughly the time it would take to move 4TB over a 10Gb/s link. Could you upgrade your hardware to reduce the recovery time to less than two hours ? Or are there factors other than cost that prevent this ? I doubt Craig is operating on a shoestring budget. And even if his network were to be just GbE, that would still make it only 10 hours according to your wishful thinking formula. He probably has set the max_backfills to 1 because that is the level of I/O his OSDs can handle w/o degrading cluster performance too much. The network is unlikely to be the limiting factor. The way I see it most Ceph clusters are in sort of steady state when operating normally, i.e. a few hundred VM RBD images ticking over, most actual OSD disk ops are writes, as nearly all hot objects that are being read are in the page cache of the storage nodes. Easy peasy. Until something happens that breaks this routine, like a deep scrub, all those VMs rebooting at the same time or a backfill caused by a failed OSD. Now all of a sudden client ops compete with the backfill ops, page caches are no longer hot, the spinners are seeking left and right. Pandemonium. I doubt very much that even with a SSD backed cluster you would get away with less than 2 hours for 4TB. To give you some real life numbers, I currently am building a new cluster but for the time being have only one storage node to play with. It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs and 8 actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it. So I took out one OSD (reweight 0 first, then the usual removal steps) because the actual disk was wonky. Replaced the disk and re-added the OSD. Both operations took about the same time, 4 minutes for evacuating the OSD (having 7 write targets clearly helped) for measly 12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the OSD. And that is on one node (thus no network latency) that has the default parameters (so a max_backfill of 10) which was otherwise totally idle. In other words, in this pretty ideal case it would have taken 22 hours to re-distribute 4TB. That makes sense to me :-) When I wrote 1h, I thought about what happens when an OSD becomes unavailable with no planning in advance. In the scenario you describe the risk of a data loss does not increase since the objects are evicted gradually from the disk being decommissioned and the number of replica stays the same at all times. There is not a sudden drop in the number of replica which is what I had in mind. That may be, but I'm rather certain that there is no difference in speed and priority of a rebalancing caused by an OSD set to weight 0 or one being set out. If the lost OSD was part of 100 PG, the other disks (let say 50 of them) will start transferring a new replica of the objects they have to the new OSD in their PG. The replacement will not be a single OSD although nothing prevents the same OSD to be used in more than one PG as a replacement for the lost one. If the cluster network is connected at 10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new duplicates do not originate from a single OSD but from at least dozens of them and since they target more than one OSD, I assume we can expect an actual throughput of 5Gb/s. I should have written 2h instead of 1h to account for the fact that the cluster network is never idle. Am I being too optimistic ? Vastly. Do you see another blocking factor that would significantly slow down recovery ? As Craig and I keep telling you, the network is not the limiting factor. Concurrent disk IO is, as I pointed out in the other thread. Completely agree. On a production cluster with OSDs backed by spindles, even with OSD journals on SSDs, it is insufficient to calculate single-disk replacement backfill time based solely on network throughput. IOPS will likely be the limiting factor when backfilling a single failed spinner in a production cluster. Last week I replaced a 3TB 7200rpm drive that was ~75% full in a 72-osd cluster, 24 hosts, rbd pool with 3 replicas, osd journals on SSDs (ratio of 3:1), with dual 1GbE bonded NICs. Using the only throughput math, backfill could have theoretically completed in a bit over 2.5 hours, but it actually took 15 hours. I've done this a few times with similar results. Why? Spindle contention on the replacement drive. Graph the '%util' metric from something like 'iostat -xt 2' during a single disk backfill to get a very clear view that spindle
Re: [ceph-users] Best practice K/M-parameters EC pool
On Thu, 28 Aug 2014 10:29:20 -0400 Mike Dawson wrote: On 8/28/2014 12:23 AM, Christian Balzer wrote: On Wed, 27 Aug 2014 13:04:48 +0200 Loic Dachary wrote: On 27/08/2014 04:34, Christian Balzer wrote: Hello, On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote: Hi Craig, I assume the reason for the 48 hours recovery time is to keep the cost of the cluster low ? I wrote 1h recovery time because it is roughly the time it would take to move 4TB over a 10Gb/s link. Could you upgrade your hardware to reduce the recovery time to less than two hours ? Or are there factors other than cost that prevent this ? I doubt Craig is operating on a shoestring budget. And even if his network were to be just GbE, that would still make it only 10 hours according to your wishful thinking formula. He probably has set the max_backfills to 1 because that is the level of I/O his OSDs can handle w/o degrading cluster performance too much. The network is unlikely to be the limiting factor. The way I see it most Ceph clusters are in sort of steady state when operating normally, i.e. a few hundred VM RBD images ticking over, most actual OSD disk ops are writes, as nearly all hot objects that are being read are in the page cache of the storage nodes. Easy peasy. Until something happens that breaks this routine, like a deep scrub, all those VMs rebooting at the same time or a backfill caused by a failed OSD. Now all of a sudden client ops compete with the backfill ops, page caches are no longer hot, the spinners are seeking left and right. Pandemonium. I doubt very much that even with a SSD backed cluster you would get away with less than 2 hours for 4TB. To give you some real life numbers, I currently am building a new cluster but for the time being have only one storage node to play with. It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs and 8 actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it. So I took out one OSD (reweight 0 first, then the usual removal steps) because the actual disk was wonky. Replaced the disk and re-added the OSD. Both operations took about the same time, 4 minutes for evacuating the OSD (having 7 write targets clearly helped) for measly 12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the OSD. And that is on one node (thus no network latency) that has the default parameters (so a max_backfill of 10) which was otherwise totally idle. In other words, in this pretty ideal case it would have taken 22 hours to re-distribute 4TB. That makes sense to me :-) When I wrote 1h, I thought about what happens when an OSD becomes unavailable with no planning in advance. In the scenario you describe the risk of a data loss does not increase since the objects are evicted gradually from the disk being decommissioned and the number of replica stays the same at all times. There is not a sudden drop in the number of replica which is what I had in mind. That may be, but I'm rather certain that there is no difference in speed and priority of a rebalancing caused by an OSD set to weight 0 or one being set out. If the lost OSD was part of 100 PG, the other disks (let say 50 of them) will start transferring a new replica of the objects they have to the new OSD in their PG. The replacement will not be a single OSD although nothing prevents the same OSD to be used in more than one PG as a replacement for the lost one. If the cluster network is connected at 10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new duplicates do not originate from a single OSD but from at least dozens of them and since they target more than one OSD, I assume we can expect an actual throughput of 5Gb/s. I should have written 2h instead of 1h to account for the fact that the cluster network is never idle. Am I being too optimistic ? Vastly. Do you see another blocking factor that would significantly slow down recovery ? As Craig and I keep telling you, the network is not the limiting factor. Concurrent disk IO is, as I pointed out in the other thread. Completely agree. Thank you for that voice of reason, backing things up by a real life sizable cluster. ^o^ On a production cluster with OSDs backed by spindles, even with OSD journals on SSDs, it is insufficient to calculate single-disk replacement backfill time based solely on network throughput. IOPS will likely be the limiting factor when backfilling a single failed spinner in a production cluster. Last week I replaced a 3TB 7200rpm drive that was ~75% full in a 72-osd cluster, 24 hosts, rbd pool with 3 replicas, osd journals on SSDs (ratio of 3:1), with dual 1GbE bonded NICs. You're generous with your SSDs. ^o^ Using the only throughput math, backfill could have theoretically completed in a bit over 2.5 hours, but it actually took 15 hours. I've
Re: [ceph-users] Best practice K/M-parameters EC pool
On 8/28/2014 11:17 AM, Loic Dachary wrote: On 28/08/2014 16:29, Mike Dawson wrote: On 8/28/2014 12:23 AM, Christian Balzer wrote: On Wed, 27 Aug 2014 13:04:48 +0200 Loic Dachary wrote: On 27/08/2014 04:34, Christian Balzer wrote: Hello, On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote: Hi Craig, I assume the reason for the 48 hours recovery time is to keep the cost of the cluster low ? I wrote 1h recovery time because it is roughly the time it would take to move 4TB over a 10Gb/s link. Could you upgrade your hardware to reduce the recovery time to less than two hours ? Or are there factors other than cost that prevent this ? I doubt Craig is operating on a shoestring budget. And even if his network were to be just GbE, that would still make it only 10 hours according to your wishful thinking formula. He probably has set the max_backfills to 1 because that is the level of I/O his OSDs can handle w/o degrading cluster performance too much. The network is unlikely to be the limiting factor. The way I see it most Ceph clusters are in sort of steady state when operating normally, i.e. a few hundred VM RBD images ticking over, most actual OSD disk ops are writes, as nearly all hot objects that are being read are in the page cache of the storage nodes. Easy peasy. Until something happens that breaks this routine, like a deep scrub, all those VMs rebooting at the same time or a backfill caused by a failed OSD. Now all of a sudden client ops compete with the backfill ops, page caches are no longer hot, the spinners are seeking left and right. Pandemonium. I doubt very much that even with a SSD backed cluster you would get away with less than 2 hours for 4TB. To give you some real life numbers, I currently am building a new cluster but for the time being have only one storage node to play with. It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs and 8 actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it. So I took out one OSD (reweight 0 first, then the usual removal steps) because the actual disk was wonky. Replaced the disk and re-added the OSD. Both operations took about the same time, 4 minutes for evacuating the OSD (having 7 write targets clearly helped) for measly 12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the OSD. And that is on one node (thus no network latency) that has the default parameters (so a max_backfill of 10) which was otherwise totally idle. In other words, in this pretty ideal case it would have taken 22 hours to re-distribute 4TB. That makes sense to me :-) When I wrote 1h, I thought about what happens when an OSD becomes unavailable with no planning in advance. In the scenario you describe the risk of a data loss does not increase since the objects are evicted gradually from the disk being decommissioned and the number of replica stays the same at all times. There is not a sudden drop in the number of replica which is what I had in mind. That may be, but I'm rather certain that there is no difference in speed and priority of a rebalancing caused by an OSD set to weight 0 or one being set out. If the lost OSD was part of 100 PG, the other disks (let say 50 of them) will start transferring a new replica of the objects they have to the new OSD in their PG. The replacement will not be a single OSD although nothing prevents the same OSD to be used in more than one PG as a replacement for the lost one. If the cluster network is connected at 10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new duplicates do not originate from a single OSD but from at least dozens of them and since they target more than one OSD, I assume we can expect an actual throughput of 5Gb/s. I should have written 2h instead of 1h to account for the fact that the cluster network is never idle. Am I being too optimistic ? Vastly. Do you see another blocking factor that would significantly slow down recovery ? As Craig and I keep telling you, the network is not the limiting factor. Concurrent disk IO is, as I pointed out in the other thread. Completely agree. On a production cluster with OSDs backed by spindles, even with OSD journals on SSDs, it is insufficient to calculate single-disk replacement backfill time based solely on network throughput. IOPS will likely be the limiting factor when backfilling a single failed spinner in a production cluster. Last week I replaced a 3TB 7200rpm drive that was ~75% full in a 72-osd cluster, 24 hosts, rbd pool with 3 replicas, osd journals on SSDs (ratio of 3:1), with dual 1GbE bonded NICs. Using the only throughput math, backfill could have theoretically completed in a bit over 2.5 hours, but it actually took 15 hours. I've done this a few times with similar results. Why? Spindle contention on the replacement drive. Graph the '%util' metric from something like 'iostat -xt 2' during a single disk backfill to get a very clear view that spindle contention is the true limiting
Re: [ceph-users] Best practice K/M-parameters EC pool
My initial experience was similar to Mike's, causing a similar level of paranoia. :-) I'm dealing with RadosGW though, so I can tolerate higher latencies. I was running my cluster with noout and nodown set for weeks at a time. Recovery of a single OSD might cause other OSDs to crash. In the primary cluster, I was always able to get it under control before it cascaded too wide. In my secondary cluster, it did spiral out to 40% of the OSDs, with 2-5 OSDs down at any time. I traced my problems to a combination of osd max backfills was too high for my cluster, and my mkfs.xfs arguments were causing memory starvation issues. I lowered osd max backfills, added SSD journals, and reformatted every OSD with better mkfs.xfs arguments. Now both clusters are stable, and I don't want to break it. I only have 45 OSDs, so the risk with a 24-48 hours recovery time is acceptable to me. It will be a problem as I scale up, but scaling up will also help with the latency problems. On Thu, Aug 28, 2014 at 10:38 AM, Mike Dawson mike.daw...@cloudapt.com wrote: We use 3x replication and have drives that have relatively high steady-state IOPS. Therefore, we tend to prioritize client-side IO more than a reduction from 3 copies to 2 during the loss of one disk. The disruption to client io is so great on our cluster, we don't want our cluster to be in a recovery state without operator-supervision. Letting OSDs get marked out without operator intervention was a disaster in the early going of our cluster. For example, an OSD daemon crash would trigger automatic recovery where it was unneeded. Ironically, often times the unneeded recovery would often trigger additional daemons to crash, making a bad situation worse. During the recovery, rbd client io would often times go to 0. To deal with this issue, we set mon osd down out interval = 14400, so as operators we have 4 hours to intervene before Ceph attempts to self-heal. When hardware is at fault, we remove the osd, replace the drive, re-add the osd, then allow backfill to begin, thereby completely skipping step B in your timeline above. - Mike ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best practice K/M-parameters EC pool
On 8/28/2014 4:17 PM, Craig Lewis wrote: My initial experience was similar to Mike's, causing a similar level of paranoia. :-) I'm dealing with RadosGW though, so I can tolerate higher latencies. I was running my cluster with noout and nodown set for weeks at a time. I'm sure Craig will agree, but wanted to add this for other readers: I find value in the noout flag for temporary intervention, but prefer to set mon osd down out interval for dealing with events that may occur in the future to give an operator time to intervene. The nodown flag is another beast altogether. The nodown flag tends to be *a bad thing* when attempting to provide reliable client io. For our use case, we want OSDs to be marked down quickly if they are in fact unavailable for any reason, so client io doesn't hang waiting for them. If OSDs are flapping during recovery (i.e. the wrongly marked me down log messages), I've found far superior results by tuning the recovery knobs than by permanently setting the nodown flag. - Mike Recovery of a single OSD might cause other OSDs to crash. In the primary cluster, I was always able to get it under control before it cascaded too wide. In my secondary cluster, it did spiral out to 40% of the OSDs, with 2-5 OSDs down at any time. I traced my problems to a combination of osd max backfills was too high for my cluster, and my mkfs.xfs arguments were causing memory starvation issues. I lowered osd max backfills, added SSD journals, and reformatted every OSD with better mkfs.xfs arguments. Now both clusters are stable, and I don't want to break it. I only have 45 OSDs, so the risk with a 24-48 hours recovery time is acceptable to me. It will be a problem as I scale up, but scaling up will also help with the latency problems. On Thu, Aug 28, 2014 at 10:38 AM, Mike Dawson mike.daw...@cloudapt.com mailto:mike.daw...@cloudapt.com wrote: We use 3x replication and have drives that have relatively high steady-state IOPS. Therefore, we tend to prioritize client-side IO more than a reduction from 3 copies to 2 during the loss of one disk. The disruption to client io is so great on our cluster, we don't want our cluster to be in a recovery state without operator-supervision. Letting OSDs get marked out without operator intervention was a disaster in the early going of our cluster. For example, an OSD daemon crash would trigger automatic recovery where it was unneeded. Ironically, often times the unneeded recovery would often trigger additional daemons to crash, making a bad situation worse. During the recovery, rbd client io would often times go to 0. To deal with this issue, we set mon osd down out interval = 14400, so as operators we have 4 hours to intervene before Ceph attempts to self-heal. When hardware is at fault, we remove the osd, replace the drive, re-add the osd, then allow backfill to begin, thereby completely skipping step B in your timeline above. - Mike ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best practice K/M-parameters EC pool
On 27/08/2014 04:34, Christian Balzer wrote: Hello, On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote: Hi Craig, I assume the reason for the 48 hours recovery time is to keep the cost of the cluster low ? I wrote 1h recovery time because it is roughly the time it would take to move 4TB over a 10Gb/s link. Could you upgrade your hardware to reduce the recovery time to less than two hours ? Or are there factors other than cost that prevent this ? I doubt Craig is operating on a shoestring budget. And even if his network were to be just GbE, that would still make it only 10 hours according to your wishful thinking formula. He probably has set the max_backfills to 1 because that is the level of I/O his OSDs can handle w/o degrading cluster performance too much. The network is unlikely to be the limiting factor. The way I see it most Ceph clusters are in sort of steady state when operating normally, i.e. a few hundred VM RBD images ticking over, most actual OSD disk ops are writes, as nearly all hot objects that are being read are in the page cache of the storage nodes. Easy peasy. Until something happens that breaks this routine, like a deep scrub, all those VMs rebooting at the same time or a backfill caused by a failed OSD. Now all of a sudden client ops compete with the backfill ops, page caches are no longer hot, the spinners are seeking left and right. Pandemonium. I doubt very much that even with a SSD backed cluster you would get away with less than 2 hours for 4TB. To give you some real life numbers, I currently am building a new cluster but for the time being have only one storage node to play with. It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs and 8 actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it. So I took out one OSD (reweight 0 first, then the usual removal steps) because the actual disk was wonky. Replaced the disk and re-added the OSD. Both operations took about the same time, 4 minutes for evacuating the OSD (having 7 write targets clearly helped) for measly 12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the OSD. And that is on one node (thus no network latency) that has the default parameters (so a max_backfill of 10) which was otherwise totally idle. In other words, in this pretty ideal case it would have taken 22 hours to re-distribute 4TB. That makes sense to me :-) When I wrote 1h, I thought about what happens when an OSD becomes unavailable with no planning in advance. In the scenario you describe the risk of a data loss does not increase since the objects are evicted gradually from the disk being decommissioned and the number of replica stays the same at all times. There is not a sudden drop in the number of replica which is what I had in mind. If the lost OSD was part of 100 PG, the other disks (let say 50 of them) will start transferring a new replica of the objects they have to the new OSD in their PG. The replacement will not be a single OSD although nothing prevents the same OSD to be used in more than one PG as a replacement for the lost one. If the cluster network is connected at 10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new duplicates do not originate from a single OSD but from at least dozens of them and since they target more than one OSD, I assume we can expect an actual throughput of 5Gb/s. I should have written 2h instead of 1h to account for the fact that the cluster network is never idle. Am I being too optimistic ? Do you see another blocking factor that would significantly slow down recovery ? Cheers More in another reply. Cheers On 26/08/2014 19:37, Craig Lewis wrote: My OSD rebuild time is more like 48 hours (4TB disks, 60% full, osd max backfills = 1). I believe that increases my risk of failure by 48^2 . Since your numbers are failure rate per hour per disk, I need to consider the risk for the whole time for each disk. So more formally, rebuild time to the power of (replicas -1). So I'm at 2304/100,000,000, or approximately 1/43,000. That's a much higher risk than 1 / 10^8. A risk of 1/43,000 means that I'm more likely to lose data due to human error than disk failure. Still, I can put a small bit of effort in to optimize recovery speed, and lower this number. Managing human error is much harder. On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary l...@dachary.org mailto:l...@dachary.org wrote: Using percentages instead of numbers lead me to calculations errors. Here it is again using 1/100 instead of % for clarity ;-) Assuming that: * The pool is configured for three replicas (size = 3 which is the default) * It takes one hour for Ceph to recover from the loss of a single OSD * Any other disk has a 1/100,000 chance to fail within the hour following the failure of the first disk (assuming AFR https://en.wikipedia.org/wiki/Annualized_failure_rate
Re: [ceph-users] Best practice K/M-parameters EC pool
Hello, On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote: Hi Craig, I assume the reason for the 48 hours recovery time is to keep the cost of the cluster low ? I wrote 1h recovery time because it is roughly the time it would take to move 4TB over a 10Gb/s link. Could you upgrade your hardware to reduce the recovery time to less than two hours ? Or are there factors other than cost that prevent this ? I doubt Craig is operating on a shoestring budget. And even if his network were to be just GbE, that would still make it only 10 hours according to your wishful thinking formula. He probably has set the max_backfills to 1 because that is the level of I/O his OSDs can handle w/o degrading cluster performance too much. The network is unlikely to be the limiting factor. The way I see it most Ceph clusters are in sort of steady state when operating normally, i.e. a few hundred VM RBD images ticking over, most actual OSD disk ops are writes, as nearly all hot objects that are being read are in the page cache of the storage nodes. Easy peasy. Until something happens that breaks this routine, like a deep scrub, all those VMs rebooting at the same time or a backfill caused by a failed OSD. Now all of a sudden client ops compete with the backfill ops, page caches are no longer hot, the spinners are seeking left and right. Pandemonium. I doubt very much that even with a SSD backed cluster you would get away with less than 2 hours for 4TB. To give you some real life numbers, I currently am building a new cluster but for the time being have only one storage node to play with. It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs and 8 actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it. So I took out one OSD (reweight 0 first, then the usual removal steps) because the actual disk was wonky. Replaced the disk and re-added the OSD. Both operations took about the same time, 4 minutes for evacuating the OSD (having 7 write targets clearly helped) for measly 12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the OSD. And that is on one node (thus no network latency) that has the default parameters (so a max_backfill of 10) which was otherwise totally idle. In other words, in this pretty ideal case it would have taken 22 hours to re-distribute 4TB. More in another reply. Cheers On 26/08/2014 19:37, Craig Lewis wrote: My OSD rebuild time is more like 48 hours (4TB disks, 60% full, osd max backfills = 1). I believe that increases my risk of failure by 48^2 . Since your numbers are failure rate per hour per disk, I need to consider the risk for the whole time for each disk. So more formally, rebuild time to the power of (replicas -1). So I'm at 2304/100,000,000, or approximately 1/43,000. That's a much higher risk than 1 / 10^8. A risk of 1/43,000 means that I'm more likely to lose data due to human error than disk failure. Still, I can put a small bit of effort in to optimize recovery speed, and lower this number. Managing human error is much harder. On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary l...@dachary.org mailto:l...@dachary.org wrote: Using percentages instead of numbers lead me to calculations errors. Here it is again using 1/100 instead of % for clarity ;-) Assuming that: * The pool is configured for three replicas (size = 3 which is the default) * It takes one hour for Ceph to recover from the loss of a single OSD * Any other disk has a 1/100,000 chance to fail within the hour following the failure of the first disk (assuming AFR https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 8%, divided by the number of hours during a year == (0.08 / 8760) ~= 1/100,000 * A given disk does not participate in more than 100 PG -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best practice K/M-parameters EC pool
is higher than normal: this is just an example of a high probability event leading to data loss. In other words, I wonder if this 0.0001% chance of losing a PG within the hour following a disk failure matters or if it is dominated by other factors. What do you think ? Cheers On 26/08/2014 15:25, Loic Dachary wrote: Hi Blair, Assuming that: * The pool is configured for three replicas (size = 3 which is the default) * It takes one hour for Ceph to recover from the loss of a single OSD * Any other disk has a 0.001% chance to fail within the hour following the failure of the first disk (assuming AFR https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 10%, divided by the number of hours during a year). * A given disk does not participate in more than 100 PG Each time an OSD is lost, there is a 0.001*0.001 = 0.01% chance that two other disks are lost before recovery. Since the disk that failed initialy participates in 100 PG, that is 0.01% x 100 = 0.0001% chance that a PG is lost. Or the entire pool if it is used in a way that loosing a PG means loosing all data in the pool (as in your example, where it contains RBD volumes and each of the RBD volume uses all the available PG). If the pool is using at least two datacenters operated by two different organizations, this calculation makes sense to me. However, if the cluster is in a single datacenter, isn't it possible that some event independent of Ceph has a greater probability of permanently destroying the data ? A month ago I lost three machines in a Ceph cluster and realized on that occasion that the crushmap was not configured properly and that PG were lost as a result. Fortunately I was able to recover the disks and plug them in another machine to recover the lost PGs. I'm not a system administrator and the probability of me failing to do the right thing is higher than normal: this is just an example of a high probability event leading to data loss. In other words, I wonder if this 0.0001% chance of losing a PG within the hour following a disk failure matters or if it is dominated by other factors. What do you think ? Cheers On 26/08/2014 02:23, Blair Bethwaite wrote: Message: 25 Date: Fri, 15 Aug 2014 15:06:49 +0200 From: Loic Dachary l...@dachary.org To: Erik Logtenberg e...@logtenberg.eu, ceph-users@lists.ceph.com Subject: Re: [ceph-users] Best practice K/M-parameters EC pool Message-ID: 53ee05e9.1040...@dachary.org Content-Type: text/plain; charset=iso-8859-1 ... Here is how I reason about it, roughly: If the probability of loosing a disk is 0.1%, the probability of loosing two disks simultaneously (i.e. before the failure can be recovered) would be 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001% I watched this conversation and an older similar one (Failure probability with largish deployments) with interest as we are in the process of planning a pretty large Ceph cluster (~3.5 PB), so I have been trying to wrap my head around these issues. Loic's reasoning (above) seems sound as a naive approximation assuming independent probabilities for disk failures, which may not be quite true given potential for batch production issues, but should be okay for other sorts of correlations (assuming a sane crushmap that eliminates things like controllers and nodes as sources of correlation). One of the things that came up in the Failure probability with largish deployments thread and has raised its head again here is the idea that striped data (e.g., RADOS-GW objects and RBD volumes) might be somehow more prone to data-loss than non-striped. I don't think anyone has so far provided an answer on this, so here's my thinking... The level of atomicity that matters when looking at durability availability in Ceph is the Placement Group. For any non-trivial RBD it is likely that many RBDs will span all/most PGs, e.g., even a relatively small 50GiB volume would (with default 4MiB object size) span 12800 PGs - more than there are in many production clusters obeying the 100-200 PGs per drive rule of thumb. IMPORTANTLosing any one PG will cause data-loss. The failure-probability effects of striping across multiple PGs are immaterial considering that loss of any single PG is likely to damage all your RBDs/IMPORTANT. This might be why the reliability calculator doesn't consider total number of disks. Related to all this is the durability of 2 versus 3 replicas (or e.g. M=1 for Erasure Coding). It's easy to get caught up in the worrying fallacy that losing any M OSDs will cause data-loss, but this isn't true - they have to be members of the same PG for data-loss to occur. So then it's tempting to think the chances of that happening are so slim as to not matter and why would we ever even need 3 replicas. I mean
Re: [ceph-users] Best practice K/M-parameters EC pool
Hello, On Tue, 26 Aug 2014 10:23:43 +1000 Blair Bethwaite wrote: Message: 25 Date: Fri, 15 Aug 2014 15:06:49 +0200 From: Loic Dachary l...@dachary.org To: Erik Logtenberg e...@logtenberg.eu, ceph-users@lists.ceph.com Subject: Re: [ceph-users] Best practice K/M-parameters EC pool Message-ID: 53ee05e9.1040...@dachary.org Content-Type: text/plain; charset=iso-8859-1 ... Here is how I reason about it, roughly: If the probability of loosing a disk is 0.1%, the probability of loosing two disks simultaneously (i.e. before the failure can be recovered) would be 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001% I watched this conversation and an older similar one (Failure probability with largish deployments) with interest as we are in the process of planning a pretty large Ceph cluster (~3.5 PB), so I have been trying to wrap my head around these issues. As the OP of the Failure probability with largish deployments thread I have to thank Blair for raising this issue again and doing the hard math below. Which looks fine to me. At the end of that slightly inconclusive thread I walked away with the same impression as Blair, namely that the survival of PGs is the key factor and that they will likely be spread out over most, if not all the OSDs. Which in turn did reinforce my decision to deploy our first production Ceph cluster based on nodes with 2 OSDs backed by 11 disk RAID6 sets behind a HW RAID controller with 4GB cache AND SDD journals. I can live with the reduced performance (which is caused by the OSD code running out of steam long before the SSDs or the RAIDs do), because not only do I save 1/3rd of the space and 1/4th of the cost compared to a replication 3 cluster, the total of disks that need to fail within the recovery window to cause data loss is now 4. The next cluster I'm currently building is a classic Ceph design, replication of 3, 8 OSD HDDs and 4 journal SSDs per node, because with this cluster I won't have predictable I/O patterns and loads. OTOH, I don't see it growing much beyond 48 OSDs, so I'm happy enough with the odds here. I think doing the exact maths for a cluster of the size you're planning would be very interesting and also very much needed. 3.5PB usable space would be close to 3000 disks with a replication of 3, but even if you meant that as gross value it would probably mean that you're looking at frequent, if not daily disk failures. Regards, Christian Loic's reasoning (above) seems sound as a naive approximation assuming independent probabilities for disk failures, which may not be quite true given potential for batch production issues, but should be okay for other sorts of correlations (assuming a sane crushmap that eliminates things like controllers and nodes as sources of correlation). One of the things that came up in the Failure probability with largish deployments thread and has raised its head again here is the idea that striped data (e.g., RADOS-GW objects and RBD volumes) might be somehow more prone to data-loss than non-striped. I don't think anyone has so far provided an answer on this, so here's my thinking... The level of atomicity that matters when looking at durability availability in Ceph is the Placement Group. For any non-trivial RBD it is likely that many RBDs will span all/most PGs, e.g., even a relatively small 50GiB volume would (with default 4MiB object size) span 12800 PGs - more than there are in many production clusters obeying the 100-200 PGs per drive rule of thumb. IMPORTANTLosing any one PG will cause data-loss. The failure-probability effects of striping across multiple PGs are immaterial considering that loss of any single PG is likely to damage all your RBDs/IMPORTANT. This might be why the reliability calculator doesn't consider total number of disks. Related to all this is the durability of 2 versus 3 replicas (or e.g. M=1 for Erasure Coding). It's easy to get caught up in the worrying fallacy that losing any M OSDs will cause data-loss, but this isn't true - they have to be members of the same PG for data-loss to occur. So then it's tempting to think the chances of that happening are so slim as to not matter and why would we ever even need 3 replicas. I mean, what are the odds of exactly those 2 drives, out of the 100,200... in my cluster, failing in recovery window?! But therein lays the rub - you should be thinking about PGs. If a drive fails then the chance of a data-loss event resulting are dependent on the chances of losing further drives from the affected/degraded PGs. I've got a real cluster at hand, so let's use that as an example. We have 96 drives/OSDs - 8 nodes, 12 OSDs per node, 2 replicas, top-down failure domains: rack pairs (x2), nodes, OSDs... Let's say OSD 15 dies. How many PGs are now at risk: $ grep ^10\. pg.dump | awk '{print $15}' | grep 15 | wc 109 109 861 (NB: 10 is the pool id
Re: [ceph-users] Best practice K/M-parameters EC pool
Hi Blair, Assuming that: * The pool is configured for three replicas (size = 3 which is the default) * It takes one hour for Ceph to recover from the loss of a single OSD * Any other disk has a 0.001% chance to fail within the hour following the failure of the first disk (assuming AFR https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 10%, divided by the number of hours during a year). * A given disk does not participate in more than 100 PG Each time an OSD is lost, there is a 0.001*0.001 = 0.01% chance that two other disks are lost before recovery. Since the disk that failed initialy participates in 100 PG, that is 0.01% x 100 = 0.0001% chance that a PG is lost. Or the entire pool if it is used in a way that loosing a PG means loosing all data in the pool (as in your example, where it contains RBD volumes and each of the RBD volume uses all the available PG). If the pool is using at least two datacenters operated by two different organizations, this calculation makes sense to me. However, if the cluster is in a single datacenter, isn't it possible that some event independent of Ceph has a greater probability of permanently destroying the data ? A month ago I lost three machines in a Ceph cluster and realized on that occasion that the crushmap was not configured properly and that PG were lost as a result. Fortunately I was able to recover the disks and plug them in another machine to recover the lost PGs. I'm not a system administrator and the probability of me failing to do the right thing is higher than normal: this is just an example of a high probability event leading to data loss. In other words, I wonder if this 0.0001% chance of losing a PG within the hour following a disk failure matters or if it is dominated by other factors. What do you think ? Cheers On 26/08/2014 02:23, Blair Bethwaite wrote: Message: 25 Date: Fri, 15 Aug 2014 15:06:49 +0200 From: Loic Dachary l...@dachary.org To: Erik Logtenberg e...@logtenberg.eu, ceph-users@lists.ceph.com Subject: Re: [ceph-users] Best practice K/M-parameters EC pool Message-ID: 53ee05e9.1040...@dachary.org Content-Type: text/plain; charset=iso-8859-1 ... Here is how I reason about it, roughly: If the probability of loosing a disk is 0.1%, the probability of loosing two disks simultaneously (i.e. before the failure can be recovered) would be 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001% I watched this conversation and an older similar one (Failure probability with largish deployments) with interest as we are in the process of planning a pretty large Ceph cluster (~3.5 PB), so I have been trying to wrap my head around these issues. Loic's reasoning (above) seems sound as a naive approximation assuming independent probabilities for disk failures, which may not be quite true given potential for batch production issues, but should be okay for other sorts of correlations (assuming a sane crushmap that eliminates things like controllers and nodes as sources of correlation). One of the things that came up in the Failure probability with largish deployments thread and has raised its head again here is the idea that striped data (e.g., RADOS-GW objects and RBD volumes) might be somehow more prone to data-loss than non-striped. I don't think anyone has so far provided an answer on this, so here's my thinking... The level of atomicity that matters when looking at durability availability in Ceph is the Placement Group. For any non-trivial RBD it is likely that many RBDs will span all/most PGs, e.g., even a relatively small 50GiB volume would (with default 4MiB object size) span 12800 PGs - more than there are in many production clusters obeying the 100-200 PGs per drive rule of thumb. IMPORTANTLosing any one PG will cause data-loss. The failure-probability effects of striping across multiple PGs are immaterial considering that loss of any single PG is likely to damage all your RBDs/IMPORTANT. This might be why the reliability calculator doesn't consider total number of disks. Related to all this is the durability of 2 versus 3 replicas (or e.g. M=1 for Erasure Coding). It's easy to get caught up in the worrying fallacy that losing any M OSDs will cause data-loss, but this isn't true - they have to be members of the same PG for data-loss to occur. So then it's tempting to think the chances of that happening are so slim as to not matter and why would we ever even need 3 replicas. I mean, what are the odds of exactly those 2 drives, out of the 100,200... in my cluster, failing in recovery window?! But therein lays the rub - you should be thinking about PGs. If a drive fails then the chance of a data-loss event resulting are dependent on the chances of losing further drives from the affected/degraded PGs. I've got a real cluster at hand, so let's use that as an example. We have 96 drives/OSDs - 8 nodes, 12
Re: [ceph-users] Best practice K/M-parameters EC pool
). If the pool is using at least two datacenters operated by two different organizations, this calculation makes sense to me. However, if the cluster is in a single datacenter, isn't it possible that some event independent of Ceph has a greater probability of permanently destroying the data ? A month ago I lost three machines in a Ceph cluster and realized on that occasion that the crushmap was not configured properly and that PG were lost as a result. Fortunately I was able to recover the disks and plug them in another machine to recover the lost PGs. I'm not a system administrator and the probability of me failing to do the right thing is higher than normal: this is just an example of a high probability event leading to data loss. In other words, I wonder if this 0.0001% chance of losing a PG within the hour following a disk failure matters or if it is dominated by other factors. What do you think ? Cheers On 26/08/2014 02:23, Blair Bethwaite wrote: Message: 25 Date: Fri, 15 Aug 2014 15:06:49 +0200 From: Loic Dachary l...@dachary.org To: Erik Logtenberg e...@logtenberg.eu, ceph-users@lists.ceph.com Subject: Re: [ceph-users] Best practice K/M-parameters EC pool Message-ID: 53ee05e9.1040...@dachary.org Content-Type: text/plain; charset=iso-8859-1 ... Here is how I reason about it, roughly: If the probability of loosing a disk is 0.1%, the probability of loosing two disks simultaneously (i.e. before the failure can be recovered) would be 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001% I watched this conversation and an older similar one (Failure probability with largish deployments) with interest as we are in the process of planning a pretty large Ceph cluster (~3.5 PB), so I have been trying to wrap my head around these issues. Loic's reasoning (above) seems sound as a naive approximation assuming independent probabilities for disk failures, which may not be quite true given potential for batch production issues, but should be okay for other sorts of correlations (assuming a sane crushmap that eliminates things like controllers and nodes as sources of correlation). One of the things that came up in the Failure probability with largish deployments thread and has raised its head again here is the idea that striped data (e.g., RADOS-GW objects and RBD volumes) might be somehow more prone to data-loss than non-striped. I don't think anyone has so far provided an answer on this, so here's my thinking... The level of atomicity that matters when looking at durability availability in Ceph is the Placement Group. For any non-trivial RBD it is likely that many RBDs will span all/most PGs, e.g., even a relatively small 50GiB volume would (with default 4MiB object size) span 12800 PGs - more than there are in many production clusters obeying the 100-200 PGs per drive rule of thumb. IMPORTANTLosing any one PG will cause data-loss. The failure-probability effects of striping across multiple PGs are immaterial considering that loss of any single PG is likely to damage all your RBDs/IMPORTANT. This might be why the reliability calculator doesn't consider total number of disks. Related to all this is the durability of 2 versus 3 replicas (or e.g. M=1 for Erasure Coding). It's easy to get caught up in the worrying fallacy that losing any M OSDs will cause data-loss, but this isn't true - they have to be members of the same PG for data-loss to occur. So then it's tempting to think the chances of that happening are so slim as to not matter and why would we ever even need 3 replicas. I mean, what are the odds of exactly those 2 drives, out of the 100,200... in my cluster, failing in recovery window?! But therein lays the rub - you should be thinking about PGs. If a drive fails then the chance of a data-loss event resulting are dependent on the chances of losing further drives from the affected/degraded PGs. I've got a real cluster at hand, so let's use that as an example. We have 96 drives/OSDs - 8 nodes, 12 OSDs per node, 2 replicas, top-down failure domains: rack pairs (x2), nodes, OSDs... Let's say OSD 15 dies. How many PGs are now at risk: $ grep ^10\. pg.dump | awk '{print $15}' | grep 15 | wc 109 109 861 (NB: 10 is the pool id, pg.dump is a text file dump of ceph pg dump, $15 is the acting set column) 109 PGs now living on the edge. No surprises in that number as we used 100 * 96 / 2 = 4800 to arrive at the PG count for this pool, so on average any one OSD will be primary for 50 PGs and replica for another 50. But this doesn't tell me how exposed I am, for that I need to know how many neighbouring OSDs there are in these 109 PGs: $ grep ^10\. pg.dump | awk '{print $15}' | grep 15 | sed 's/\[15,\(.*\)\]/\1/' | sed 's/\[\(.*\),15\]/\1/' | sort | uniq | wc 67 67 193 (NB: grep-ing for OSD 15 and using sed to remove it and surrounding formatting to get just
Re: [ceph-users] Best practice K/M-parameters EC pool
My OSD rebuild time is more like 48 hours (4TB disks, 60% full, osd max backfills = 1). I believe that increases my risk of failure by 48^2 . Since your numbers are failure rate per hour per disk, I need to consider the risk for the whole time for each disk. So more formally, rebuild time to the power of (replicas -1). So I'm at 2304/100,000,000, or approximately 1/43,000. That's a much higher risk than 1 / 10^8. A risk of 1/43,000 means that I'm more likely to lose data due to human error than disk failure. Still, I can put a small bit of effort in to optimize recovery speed, and lower this number. Managing human error is much harder. On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary l...@dachary.org wrote: Using percentages instead of numbers lead me to calculations errors. Here it is again using 1/100 instead of % for clarity ;-) Assuming that: * The pool is configured for three replicas (size = 3 which is the default) * It takes one hour for Ceph to recover from the loss of a single OSD * Any other disk has a 1/100,000 chance to fail within the hour following the failure of the first disk (assuming AFR https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 8%, divided by the number of hours during a year == (0.08 / 8760) ~= 1/100,000 * A given disk does not participate in more than 100 PG ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best practice K/M-parameters EC pool
Hi Craig, I assume the reason for the 48 hours recovery time is to keep the cost of the cluster low ? I wrote 1h recovery time because it is roughly the time it would take to move 4TB over a 10Gb/s link. Could you upgrade your hardware to reduce the recovery time to less than two hours ? Or are there factors other than cost that prevent this ? Cheers On 26/08/2014 19:37, Craig Lewis wrote: My OSD rebuild time is more like 48 hours (4TB disks, 60% full, osd max backfills = 1). I believe that increases my risk of failure by 48^2 . Since your numbers are failure rate per hour per disk, I need to consider the risk for the whole time for each disk. So more formally, rebuild time to the power of (replicas -1). So I'm at 2304/100,000,000, or approximately 1/43,000. That's a much higher risk than 1 / 10^8. A risk of 1/43,000 means that I'm more likely to lose data due to human error than disk failure. Still, I can put a small bit of effort in to optimize recovery speed, and lower this number. Managing human error is much harder. On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary l...@dachary.org mailto:l...@dachary.org wrote: Using percentages instead of numbers lead me to calculations errors. Here it is again using 1/100 instead of % for clarity ;-) Assuming that: * The pool is configured for three replicas (size = 3 which is the default) * It takes one hour for Ceph to recover from the loss of a single OSD * Any other disk has a 1/100,000 chance to fail within the hour following the failure of the first disk (assuming AFR https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 8%, divided by the number of hours during a year == (0.08 / 8760) ~= 1/100,000 * A given disk does not participate in more than 100 PG -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best practice K/M-parameters EC pool
Message: 25 Date: Fri, 15 Aug 2014 15:06:49 +0200 From: Loic Dachary l...@dachary.org To: Erik Logtenberg e...@logtenberg.eu, ceph-users@lists.ceph.com Subject: Re: [ceph-users] Best practice K/M-parameters EC pool Message-ID: 53ee05e9.1040...@dachary.org Content-Type: text/plain; charset=iso-8859-1 ... Here is how I reason about it, roughly: If the probability of loosing a disk is 0.1%, the probability of loosing two disks simultaneously (i.e. before the failure can be recovered) would be 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001% I watched this conversation and an older similar one (Failure probability with largish deployments) with interest as we are in the process of planning a pretty large Ceph cluster (~3.5 PB), so I have been trying to wrap my head around these issues. Loic's reasoning (above) seems sound as a naive approximation assuming independent probabilities for disk failures, which may not be quite true given potential for batch production issues, but should be okay for other sorts of correlations (assuming a sane crushmap that eliminates things like controllers and nodes as sources of correlation). One of the things that came up in the Failure probability with largish deployments thread and has raised its head again here is the idea that striped data (e.g., RADOS-GW objects and RBD volumes) might be somehow more prone to data-loss than non-striped. I don't think anyone has so far provided an answer on this, so here's my thinking... The level of atomicity that matters when looking at durability availability in Ceph is the Placement Group. For any non-trivial RBD it is likely that many RBDs will span all/most PGs, e.g., even a relatively small 50GiB volume would (with default 4MiB object size) span 12800 PGs - more than there are in many production clusters obeying the 100-200 PGs per drive rule of thumb. IMPORTANTLosing any one PG will cause data-loss. The failure-probability effects of striping across multiple PGs are immaterial considering that loss of any single PG is likely to damage all your RBDs/IMPORTANT. This might be why the reliability calculator doesn't consider total number of disks. Related to all this is the durability of 2 versus 3 replicas (or e.g. M=1 for Erasure Coding). It's easy to get caught up in the worrying fallacy that losing any M OSDs will cause data-loss, but this isn't true - they have to be members of the same PG for data-loss to occur. So then it's tempting to think the chances of that happening are so slim as to not matter and why would we ever even need 3 replicas. I mean, what are the odds of exactly those 2 drives, out of the 100,200... in my cluster, failing in recovery window?! But therein lays the rub - you should be thinking about PGs. If a drive fails then the chance of a data-loss event resulting are dependent on the chances of losing further drives from the affected/degraded PGs. I've got a real cluster at hand, so let's use that as an example. We have 96 drives/OSDs - 8 nodes, 12 OSDs per node, 2 replicas, top-down failure domains: rack pairs (x2), nodes, OSDs... Let's say OSD 15 dies. How many PGs are now at risk: $ grep ^10\. pg.dump | awk '{print $15}' | grep 15 | wc 109 109 861 (NB: 10 is the pool id, pg.dump is a text file dump of ceph pg dump, $15 is the acting set column) 109 PGs now living on the edge. No surprises in that number as we used 100 * 96 / 2 = 4800 to arrive at the PG count for this pool, so on average any one OSD will be primary for 50 PGs and replica for another 50. But this doesn't tell me how exposed I am, for that I need to know how many neighbouring OSDs there are in these 109 PGs: $ grep ^10\. pg.dump | awk '{print $15}' | grep 15 | sed 's/\[15,\(.*\)\]/\1/' | sed 's/\[\(.*\),15\]/\1/' | sort | uniq | wc 67 67 193 (NB: grep-ing for OSD 15 and using sed to remove it and surrounding formatting to get just the neighbour id) Yikes! So if any one of those 67 drives fails during recovery of OSD 15, then we've lost data. On average we should expect this to be determined by our crushmap, which in this case splits the cluster up into 2 top-level failure domains, so I'd guess it's the probability of 1 in 48 drives failing on average for this cluster. But actually looking at the numbers for each OSD it is higher than that here - the lowest distinct neighbour count we have is 50. Note that we haven't tuned any of the options in our crushmap, so I guess maybe Ceph favours fewer repeat sets by default when coming up with PGs(?). Anyway, here's the average and top 10 neighbour counts (hope this scripting is right! ;-): $ for OSD in {0..95}; do echo -ne $OSD\t; grep ^10\. pg.dump | awk '{print $15}' | grep \[${OSD},\|,${OSD}\] | sed s/\[$OSD,\(.*\)\]/\1/ | sed s/\[\(.*\),$OSD\]/\1/ | sort | uniq | wc -l; done | awk '{ total += $2 } END { print total/NR }' 58.5208 $ for OSD in {0..95}; do echo -ne $OSD\t; grep ^10\. pg.dump | awk '{print
Re: [ceph-users] Best practice K/M-parameters EC pool
Hi Erik, On 15/08/2014 11:54, Erik Logtenberg wrote: Hi, With EC pools in Ceph you are free to choose any K and M parameters you like. The documentation explains what K and M do, so far so good. Now, there are certain combinations of K and M that appear to have more or less the same result. Do any of these combinations have pro's and con's that I should consider and/or are there best practices for choosing the right K/M-parameters? For instance, if I choose K = 3 and M = 2, then pg's in this pool will use 5 OSD's and sustain the loss of 2 OSD's. There is 40% overhead in this configuration. Now, if I were to choose K = 6 and M = 4, I would end up with pg's that use 10 OSD's and sustain the loss of 4 OSD's, which is statistically not so much different from the first configuration. Also there is the same 40% overhead. Although I don't have numbers in mind, I think the odds of loosing two OSD simultaneously are a lot smaller than the odds of loosing four OSD simultaneously. Or am I misunderstanding you when you write statistically not so much different from the first configuration ? Cheers One rather obvious difference between the two configurations is that the latter requires a cluster with at least 10 OSD's to make sense. But let's say we have such a cluster, which of the two configurations would be recommended, and why? Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best practice K/M-parameters EC pool
On 08/15/2014 12:23 PM, Loic Dachary wrote: Hi Erik, On 15/08/2014 11:54, Erik Logtenberg wrote: Hi, With EC pools in Ceph you are free to choose any K and M parameters you like. The documentation explains what K and M do, so far so good. Now, there are certain combinations of K and M that appear to have more or less the same result. Do any of these combinations have pro's and con's that I should consider and/or are there best practices for choosing the right K/M-parameters? Loic might have a better anwser, but I think that the more segments (K) you have, the heavier recovery. You have to contact more OSDs to reconstruct the whole object so that involves more disks doing seeks. I heard sombody from Fujitsu say that he thought 8/3 was best for most situations. That wasn't with Ceph though, but with a different system which implemented Erasure Coding. For instance, if I choose K = 3 and M = 2, then pg's in this pool will use 5 OSD's and sustain the loss of 2 OSD's. There is 40% overhead in this configuration. Now, if I were to choose K = 6 and M = 4, I would end up with pg's that use 10 OSD's and sustain the loss of 4 OSD's, which is statistically not so much different from the first configuration. Also there is the same 40% overhead. Although I don't have numbers in mind, I think the odds of loosing two OSD simultaneously are a lot smaller than the odds of loosing four OSD simultaneously. Or am I misunderstanding you when you write statistically not so much different from the first configuration ? Loosing two smaller then loosing four? Is that correct or did you mean it the other way around? I'd say that loosing four OSDs simultaneously is less likely to happen then two simultaneously. Cheers One rather obvious difference between the two configurations is that the latter requires a cluster with at least 10 OSD's to make sense. But let's say we have such a cluster, which of the two configurations would be recommended, and why? Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best practice K/M-parameters EC pool
On 08/15/2014 06:24 AM, Wido den Hollander wrote: On 08/15/2014 12:23 PM, Loic Dachary wrote: Hi Erik, On 15/08/2014 11:54, Erik Logtenberg wrote: Hi, With EC pools in Ceph you are free to choose any K and M parameters you like. The documentation explains what K and M do, so far so good. Now, there are certain combinations of K and M that appear to have more or less the same result. Do any of these combinations have pro's and con's that I should consider and/or are there best practices for choosing the right K/M-parameters? Loic might have a better anwser, but I think that the more segments (K) you have, the heavier recovery. You have to contact more OSDs to reconstruct the whole object so that involves more disks doing seeks. I heard sombody from Fujitsu say that he thought 8/3 was best for most situations. That wasn't with Ceph though, but with a different system which implemented Erasure Coding. Performance is definitely lower with more segments in Ceph. I kind of gravitate toward 4/2 or 6/2, though that's just my own preference. For instance, if I choose K = 3 and M = 2, then pg's in this pool will use 5 OSD's and sustain the loss of 2 OSD's. There is 40% overhead in this configuration. Now, if I were to choose K = 6 and M = 4, I would end up with pg's that use 10 OSD's and sustain the loss of 4 OSD's, which is statistically not so much different from the first configuration. Also there is the same 40% overhead. Although I don't have numbers in mind, I think the odds of loosing two OSD simultaneously are a lot smaller than the odds of loosing four OSD simultaneously. Or am I misunderstanding you when you write statistically not so much different from the first configuration ? Loosing two smaller then loosing four? Is that correct or did you mean it the other way around? I'd say that loosing four OSDs simultaneously is less likely to happen then two simultaneously. This is true, though the more disks you spread your objects across, the higher likelihood that any given object will be affected by a lost OSD. The extreme case being that every object is spread across every OSD and losing any given OSD affects all objects. I suppose the severity depends on the relative fraction of your erasure coding parameters relative to the total number of OSDs. I think this is perhaps what Erik was getting at. Cheers One rather obvious difference between the two configurations is that the latter requires a cluster with at least 10 OSD's to make sense. But let's say we have such a cluster, which of the two configurations would be recommended, and why? Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best practice K/M-parameters EC pool
Now, there are certain combinations of K and M that appear to have more or less the same result. Do any of these combinations have pro's and con's that I should consider and/or are there best practices for choosing the right K/M-parameters? Loic might have a better anwser, but I think that the more segments (K) you have, the heavier recovery. You have to contact more OSDs to reconstruct the whole object so that involves more disks doing seeks. I heard sombody from Fujitsu say that he thought 8/3 was best for most situations. That wasn't with Ceph though, but with a different system which implemented Erasure Coding. Performance is definitely lower with more segments in Ceph. I kind of gravitate toward 4/2 or 6/2, though that's just my own preference. This is indeed the kind of pro's and con's I was thinking about. Performance-wise, I would expect differences, but I can think of both positive and negative effects of bigger values for K. For instance, yes recovery takes more OSD's with bigger values of K, but it seems to me that there are also less or smaller items to recover. Also read-performance generally appears to benefit from having a bigger cluster (more parallellism), so I can imagine that bigger values of K also provide an increase in read-performance. Mark says more segments hurts performance though, are you referring just to rebuild-performance or also basic operational performance (read/write)? For instance, if I choose K = 3 and M = 2, then pg's in this pool will use 5 OSD's and sustain the loss of 2 OSD's. There is 40% overhead in this configuration. Now, if I were to choose K = 6 and M = 4, I would end up with pg's that use 10 OSD's and sustain the loss of 4 OSD's, which is statistically not so much different from the first configuration. Also there is the same 40% overhead. Although I don't have numbers in mind, I think the odds of loosing two OSD simultaneously are a lot smaller than the odds of loosing four OSD simultaneously. Or am I misunderstanding you when you write statistically not so much different from the first configuration ? Loosing two smaller then loosing four? Is that correct or did you mean it the other way around? I'd say that loosing four OSDs simultaneously is less likely to happen then two simultaneously. This is true, though the more disks you spread your objects across, the higher likelihood that any given object will be affected by a lost OSD. The extreme case being that every object is spread across every OSD and losing any given OSD affects all objects. I suppose the severity depends on the relative fraction of your erasure coding parameters relative to the total number of OSDs. I think this is perhaps what Erik was getting at. I haven't done the actual calculations, but given some % chance of disk failure, I would assume that losing x out of y disks has roughly the same chance as losing 2*x out of 2*y disks over the same period. That's also why you generally want to limit RAID5 arrays to maybe 6 disks or so and move to RAID6 for bigger arrays. For arrays bigger than 20 disks you would usually split those into separate arrays, just to keep the (parity disks / total disks) fraction high enough. With regard to data safety I would guess that 3+2 and 6+4 are roughly equal, although the behaviour of 6+4 is probably easier to predict because bigger numbers makes your calculations less dependent on individual deviations in reliability. Do you guys feel this argument is valid? Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best practice K/M-parameters EC pool
On 15/08/2014 13:24, Wido den Hollander wrote: On 08/15/2014 12:23 PM, Loic Dachary wrote: Hi Erik, On 15/08/2014 11:54, Erik Logtenberg wrote: Hi, With EC pools in Ceph you are free to choose any K and M parameters you like. The documentation explains what K and M do, so far so good. Now, there are certain combinations of K and M that appear to have more or less the same result. Do any of these combinations have pro's and con's that I should consider and/or are there best practices for choosing the right K/M-parameters? Loic might have a better anwser, but I think that the more segments (K) you have, the heavier recovery. You have to contact more OSDs to reconstruct the whole object so that involves more disks doing seeks. I heard sombody from Fujitsu say that he thought 8/3 was best for most situations. That wasn't with Ceph though, but with a different system which implemented Erasure Coding. For instance, if I choose K = 3 and M = 2, then pg's in this pool will use 5 OSD's and sustain the loss of 2 OSD's. There is 40% overhead in this configuration. Now, if I were to choose K = 6 and M = 4, I would end up with pg's that use 10 OSD's and sustain the loss of 4 OSD's, which is statistically not so much different from the first configuration. Also there is the same 40% overhead. Although I don't have numbers in mind, I think the odds of loosing two OSD simultaneously are a lot smaller than the odds of loosing four OSD simultaneously. Or am I misunderstanding you when you write statistically not so much different from the first configuration ? Loosing two smaller then loosing four? Is that correct or did you mean it the other way around? Right, sorry for the confusion, I meant the other way around :-) I'd say that loosing four OSDs simultaneously is less likely to happen then two simultaneously. Cheers One rather obvious difference between the two configurations is that the latter requires a cluster with at least 10 OSD's to make sense. But let's say we have such a cluster, which of the two configurations would be recommended, and why? Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best practice K/M-parameters EC pool
On 15/08/2014 14:36, Erik Logtenberg wrote: Now, there are certain combinations of K and M that appear to have more or less the same result. Do any of these combinations have pro's and con's that I should consider and/or are there best practices for choosing the right K/M-parameters? Loic might have a better anwser, but I think that the more segments (K) you have, the heavier recovery. You have to contact more OSDs to reconstruct the whole object so that involves more disks doing seeks. I heard sombody from Fujitsu say that he thought 8/3 was best for most situations. That wasn't with Ceph though, but with a different system which implemented Erasure Coding. Performance is definitely lower with more segments in Ceph. I kind of gravitate toward 4/2 or 6/2, though that's just my own preference. This is indeed the kind of pro's and con's I was thinking about. Performance-wise, I would expect differences, but I can think of both positive and negative effects of bigger values for K. For instance, yes recovery takes more OSD's with bigger values of K, but it seems to me that there are also less or smaller items to recover. Also read-performance generally appears to benefit from having a bigger cluster (more parallellism), so I can imagine that bigger values of K also provide an increase in read-performance. Mark says more segments hurts performance though, are you referring just to rebuild-performance or also basic operational performance (read/write)? For instance, if I choose K = 3 and M = 2, then pg's in this pool will use 5 OSD's and sustain the loss of 2 OSD's. There is 40% overhead in this configuration. Now, if I were to choose K = 6 and M = 4, I would end up with pg's that use 10 OSD's and sustain the loss of 4 OSD's, which is statistically not so much different from the first configuration. Also there is the same 40% overhead. Although I don't have numbers in mind, I think the odds of loosing two OSD simultaneously are a lot smaller than the odds of loosing four OSD simultaneously. Or am I misunderstanding you when you write statistically not so much different from the first configuration ? Loosing two smaller then loosing four? Is that correct or did you mean it the other way around? I'd say that loosing four OSDs simultaneously is less likely to happen then two simultaneously. This is true, though the more disks you spread your objects across, the higher likelihood that any given object will be affected by a lost OSD. The extreme case being that every object is spread across every OSD and losing any given OSD affects all objects. I suppose the severity depends on the relative fraction of your erasure coding parameters relative to the total number of OSDs. I think this is perhaps what Erik was getting at. I haven't done the actual calculations, but given some % chance of disk failure, I would assume that losing x out of y disks has roughly the same chance as losing 2*x out of 2*y disks over the same period. That's also why you generally want to limit RAID5 arrays to maybe 6 disks or so and move to RAID6 for bigger arrays. For arrays bigger than 20 disks you would usually split those into separate arrays, just to keep the (parity disks / total disks) fraction high enough. With regard to data safety I would guess that 3+2 and 6+4 are roughly equal, although the behaviour of 6+4 is probably easier to predict because bigger numbers makes your calculations less dependent on individual deviations in reliability. Do you guys feel this argument is valid? Here is how I reason about it, roughly: If the probability of loosing a disk is 0.1%, the probability of loosing two disks simultaneously (i.e. before the failure can be recovered) would be 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001% Accurately calculating the reliability of the system as a whole is a lot more complex (see https://wiki.ceph.com/Development/Add_erasure_coding_to_the_durability_model/ for more information). Cheers Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best practice K/M-parameters EC pool
I haven't done the actual calculations, but given some % chance of disk failure, I would assume that losing x out of y disks has roughly the same chance as losing 2*x out of 2*y disks over the same period. That's also why you generally want to limit RAID5 arrays to maybe 6 disks or so and move to RAID6 for bigger arrays. For arrays bigger than 20 disks you would usually split those into separate arrays, just to keep the (parity disks / total disks) fraction high enough. With regard to data safety I would guess that 3+2 and 6+4 are roughly equal, although the behaviour of 6+4 is probably easier to predict because bigger numbers makes your calculations less dependent on individual deviations in reliability. Do you guys feel this argument is valid? Here is how I reason about it, roughly: If the probability of loosing a disk is 0.1%, the probability of loosing two disks simultaneously (i.e. before the failure can be recovered) would be 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001% Accurately calculating the reliability of the system as a whole is a lot more complex (see https://wiki.ceph.com/Development/Add_erasure_coding_to_the_durability_model/ for more information). Cheers Okay, I see that in your calculation, you leave the total amount of disks completely out of the equation. The link you provided is very useful indeed and does some actual calculations. Interestingly, the example in the details page [1] use k=32 and m=32 for a total of 64 blocks. Those are very much bigger values than Mark Nelson mentioned earlier. Is that example merely meant to demonstrate the theoretical advantages, or would you actually recommend using those numbers in practice. Let's assume that we have at least 64 OSD's available, would you recommend k=32 and m=32? [1] https://wiki.ceph.com/Development/Add_erasure_coding_to_the_durability_model/Technical_details_on_the_model ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best practice K/M-parameters EC pool
On 15/08/2014 15:42, Erik Logtenberg wrote: I haven't done the actual calculations, but given some % chance of disk failure, I would assume that losing x out of y disks has roughly the same chance as losing 2*x out of 2*y disks over the same period. That's also why you generally want to limit RAID5 arrays to maybe 6 disks or so and move to RAID6 for bigger arrays. For arrays bigger than 20 disks you would usually split those into separate arrays, just to keep the (parity disks / total disks) fraction high enough. With regard to data safety I would guess that 3+2 and 6+4 are roughly equal, although the behaviour of 6+4 is probably easier to predict because bigger numbers makes your calculations less dependent on individual deviations in reliability. Do you guys feel this argument is valid? Here is how I reason about it, roughly: If the probability of loosing a disk is 0.1%, the probability of loosing two disks simultaneously (i.e. before the failure can be recovered) would be 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001% Accurately calculating the reliability of the system as a whole is a lot more complex (see https://wiki.ceph.com/Development/Add_erasure_coding_to_the_durability_model/ for more information). Cheers Okay, I see that in your calculation, you leave the total amount of disks completely out of the equation. Yes. If you have a small number of disks I'm not sure how to calculate the durability. For instance if I have 50 disk cluster within a rack, the durability is dominated by the probability that the rack is set on fire and increasing m from 3 to 5 is most certainly pointless ;-) The link you provided is very useful indeed and does some actual calculations. Interestingly, the example in the details page [1] use k=32 and m=32 for a total of 64 blocks. Those are very much bigger values than Mark Nelson mentioned earlier. Is that example merely meant to demonstrate the theoretical advantages, or would you actually recommend using those numbers in practice. Let's assume that we have at least 64 OSD's available, would you recommend k=32 and m=32? It is theoretical, I'm not aware of any Ceph use case requiring that kind of setting. There may be a use case though, it's not absurd, just not common. I would be happy to hear about it. Cheers [1] https://wiki.ceph.com/Development/Add_erasure_coding_to_the_durability_model/Technical_details_on_the_model ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com