Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-28 Thread Loic Dachary


On 28/08/2014 16:29, Mike Dawson wrote:
 On 8/28/2014 12:23 AM, Christian Balzer wrote:
 On Wed, 27 Aug 2014 13:04:48 +0200 Loic Dachary wrote:



 On 27/08/2014 04:34, Christian Balzer wrote:

 Hello,

 On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote:

 Hi Craig,

 I assume the reason for the 48 hours recovery time is to keep the cost
 of the cluster low ? I wrote 1h recovery time because it is roughly
 the time it would take to move 4TB over a 10Gb/s link. Could you
 upgrade your hardware to reduce the recovery time to less than two
 hours ? Or are there factors other than cost that prevent this ?


 I doubt Craig is operating on a shoestring budget.
 And even if his network were to be just GbE, that would still make it
 only 10 hours according to your wishful thinking formula.

 He probably has set the max_backfills to 1 because that is the level of
 I/O his OSDs can handle w/o degrading cluster performance too much.
 The network is unlikely to be the limiting factor.

 The way I see it most Ceph clusters are in sort of steady state when
 operating normally, i.e. a few hundred VM RBD images ticking over, most
 actual OSD disk ops are writes, as nearly all hot objects that are
 being read are in the page cache of the storage nodes.
 Easy peasy.

 Until something happens that breaks this routine, like a deep scrub,
 all those VMs rebooting at the same time or a backfill caused by a
 failed OSD. Now all of a sudden client ops compete with the backfill
 ops, page caches are no longer hot, the spinners are seeking left and
 right. Pandemonium.

 I doubt very much that even with a SSD backed cluster you would get
 away with less than 2 hours for 4TB.

 To give you some real life numbers, I currently am building a new
 cluster but for the time being have only one storage node to play with.
 It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs  and 8
 actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.

 So I took out one OSD (reweight 0 first, then the usual removal steps)
 because the actual disk was wonky. Replaced the disk and re-added the
 OSD. Both operations took about the same time, 4 minutes for
 evacuating the OSD (having 7 write targets clearly helped) for measly
 12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the
 OSD. And that is on one node (thus no network latency) that has the
 default parameters (so a max_backfill of 10) which was otherwise
 totally idle.

 In other words, in this pretty ideal case it would have taken 22 hours
 to re-distribute 4TB.

 That makes sense to me :-)

 When I wrote 1h, I thought about what happens when an OSD becomes
 unavailable with no planning in advance. In the scenario you describe
 the risk of a data loss does not increase since the objects are evicted
 gradually from the disk being decommissioned and the number of replica
 stays the same at all times. There is not a sudden drop in the number of
 replica  which is what I had in mind.

 That may be, but I'm rather certain that there is no difference in speed
 and priority of a rebalancing caused by an OSD set to weight 0 or one
 being set out.

 If the lost OSD was part of 100 PG, the other disks (let say 50 of them)
 will start transferring a new replica of the objects they have to the
 new OSD in their PG. The replacement will not be a single OSD although
 nothing prevents the same OSD to be used in more than one PG as a
 replacement for the lost one. If the cluster network is connected at
 10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new
 duplicates do not originate from a single OSD but from at least dozens
 of them and since they target more than one OSD, I assume we can expect
 an actual throughput of 5Gb/s. I should have written 2h instead of 1h to
 account for the fact that the cluster network is never idle.

 Am I being too optimistic ?
 Vastly.

 Do you see another blocking factor that
 would significantly slow down recovery ?

 As Craig and I keep telling you, the network is not the limiting factor.
 Concurrent disk IO is, as I pointed out in the other thread.
 
 Completely agree.
 
 On a production cluster with OSDs backed by spindles, even with OSD journals 
 on SSDs, it is insufficient to calculate single-disk replacement backfill 
 time based solely on network throughput. IOPS will likely be the limiting 
 factor when backfilling a single failed spinner in a production cluster.
 
 Last week I replaced a 3TB 7200rpm drive that was ~75% full in a 72-osd 
 cluster, 24 hosts, rbd pool with 3 replicas, osd journals on SSDs (ratio of 
 3:1), with dual 1GbE bonded NICs.
 
 Using the only throughput math, backfill could have theoretically completed 
 in a bit over 2.5 hours, but it actually took 15 hours. I've done this a few 
 times with similar results.
 
 Why? Spindle contention on the replacement drive. Graph the '%util' metric 
 from something like 'iostat -xt 2' during a single disk backfill to get a 
 very clear view that spindle 

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-28 Thread Christian Balzer
On Thu, 28 Aug 2014 10:29:20 -0400 Mike Dawson wrote:

 On 8/28/2014 12:23 AM, Christian Balzer wrote:
  On Wed, 27 Aug 2014 13:04:48 +0200 Loic Dachary wrote:
 
 
 
  On 27/08/2014 04:34, Christian Balzer wrote:
 
  Hello,
 
  On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote:
 
  Hi Craig,
 
  I assume the reason for the 48 hours recovery time is to keep the
  cost of the cluster low ? I wrote 1h recovery time because it is
  roughly the time it would take to move 4TB over a 10Gb/s link.
  Could you upgrade your hardware to reduce the recovery time to less
  than two hours ? Or are there factors other than cost that prevent
  this ?
 
 
  I doubt Craig is operating on a shoestring budget.
  And even if his network were to be just GbE, that would still make it
  only 10 hours according to your wishful thinking formula.
 
  He probably has set the max_backfills to 1 because that is the level
  of I/O his OSDs can handle w/o degrading cluster performance too
  much. The network is unlikely to be the limiting factor.
 
  The way I see it most Ceph clusters are in sort of steady state when
  operating normally, i.e. a few hundred VM RBD images ticking over,
  most actual OSD disk ops are writes, as nearly all hot objects that
  are being read are in the page cache of the storage nodes.
  Easy peasy.
 
  Until something happens that breaks this routine, like a deep scrub,
  all those VMs rebooting at the same time or a backfill caused by a
  failed OSD. Now all of a sudden client ops compete with the backfill
  ops, page caches are no longer hot, the spinners are seeking left and
  right. Pandemonium.
 
  I doubt very much that even with a SSD backed cluster you would get
  away with less than 2 hours for 4TB.
 
  To give you some real life numbers, I currently am building a new
  cluster but for the time being have only one storage node to play
  with. It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs
  and 8 actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.
 
  So I took out one OSD (reweight 0 first, then the usual removal
  steps) because the actual disk was wonky. Replaced the disk and
  re-added the OSD. Both operations took about the same time, 4
  minutes for evacuating the OSD (having 7 write targets clearly
  helped) for measly 12GB or about 50MB/s and 5 minutes or about 35MB/
  for refilling the OSD. And that is on one node (thus no network
  latency) that has the default parameters (so a max_backfill of 10)
  which was otherwise totally idle.
 
  In other words, in this pretty ideal case it would have taken 22
  hours to re-distribute 4TB.
 
  That makes sense to me :-)
 
  When I wrote 1h, I thought about what happens when an OSD becomes
  unavailable with no planning in advance. In the scenario you describe
  the risk of a data loss does not increase since the objects are
  evicted gradually from the disk being decommissioned and the number
  of replica stays the same at all times. There is not a sudden drop in
  the number of replica  which is what I had in mind.
 
  That may be, but I'm rather certain that there is no difference in
  speed and priority of a rebalancing caused by an OSD set to weight 0
  or one being set out.
 
  If the lost OSD was part of 100 PG, the other disks (let say 50 of
  them) will start transferring a new replica of the objects they have
  to the new OSD in their PG. The replacement will not be a single OSD
  although nothing prevents the same OSD to be used in more than one PG
  as a replacement for the lost one. If the cluster network is
  connected at 10Gb/s and is 50% busy at all times, that leaves 5Gb/s.
  Since the new duplicates do not originate from a single OSD but from
  at least dozens of them and since they target more than one OSD, I
  assume we can expect an actual throughput of 5Gb/s. I should have
  written 2h instead of 1h to account for the fact that the cluster
  network is never idle.
 
  Am I being too optimistic ?
  Vastly.
 
  Do you see another blocking factor that
  would significantly slow down recovery ?
 
  As Craig and I keep telling you, the network is not the limiting
  factor. Concurrent disk IO is, as I pointed out in the other thread.
 
 Completely agree.
 
Thank you for that voice of reason, backing things up by a real life
sizable cluster. ^o^

 On a production cluster with OSDs backed by spindles, even with OSD 
 journals on SSDs, it is insufficient to calculate single-disk 
 replacement backfill time based solely on network throughput. IOPS will 
 likely be the limiting factor when backfilling a single failed spinner 
 in a production cluster.
 
 Last week I replaced a 3TB 7200rpm drive that was ~75% full in a 72-osd 
 cluster, 24 hosts, rbd pool with 3 replicas, osd journals on SSDs (ratio 
 of 3:1), with dual 1GbE bonded NICs.
 
You're generous with your SSDs. ^o^

 Using the only throughput math, backfill could have theoretically 
 completed in a bit over 2.5 hours, but it actually took 15 hours. I've 
 

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-28 Thread Mike Dawson

On 8/28/2014 11:17 AM, Loic Dachary wrote:



On 28/08/2014 16:29, Mike Dawson wrote:

On 8/28/2014 12:23 AM, Christian Balzer wrote:

On Wed, 27 Aug 2014 13:04:48 +0200 Loic Dachary wrote:




On 27/08/2014 04:34, Christian Balzer wrote:


Hello,

On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote:


Hi Craig,

I assume the reason for the 48 hours recovery time is to keep the cost
of the cluster low ? I wrote 1h recovery time because it is roughly
the time it would take to move 4TB over a 10Gb/s link. Could you
upgrade your hardware to reduce the recovery time to less than two
hours ? Or are there factors other than cost that prevent this ?



I doubt Craig is operating on a shoestring budget.
And even if his network were to be just GbE, that would still make it
only 10 hours according to your wishful thinking formula.

He probably has set the max_backfills to 1 because that is the level of
I/O his OSDs can handle w/o degrading cluster performance too much.
The network is unlikely to be the limiting factor.

The way I see it most Ceph clusters are in sort of steady state when
operating normally, i.e. a few hundred VM RBD images ticking over, most
actual OSD disk ops are writes, as nearly all hot objects that are
being read are in the page cache of the storage nodes.
Easy peasy.

Until something happens that breaks this routine, like a deep scrub,
all those VMs rebooting at the same time or a backfill caused by a
failed OSD. Now all of a sudden client ops compete with the backfill
ops, page caches are no longer hot, the spinners are seeking left and
right. Pandemonium.

I doubt very much that even with a SSD backed cluster you would get
away with less than 2 hours for 4TB.

To give you some real life numbers, I currently am building a new
cluster but for the time being have only one storage node to play with.
It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs  and 8
actual OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.

So I took out one OSD (reweight 0 first, then the usual removal steps)
because the actual disk was wonky. Replaced the disk and re-added the
OSD. Both operations took about the same time, 4 minutes for
evacuating the OSD (having 7 write targets clearly helped) for measly
12GB or about 50MB/s and 5 minutes or about 35MB/ for refilling the
OSD. And that is on one node (thus no network latency) that has the
default parameters (so a max_backfill of 10) which was otherwise
totally idle.

In other words, in this pretty ideal case it would have taken 22 hours
to re-distribute 4TB.


That makes sense to me :-)

When I wrote 1h, I thought about what happens when an OSD becomes
unavailable with no planning in advance. In the scenario you describe
the risk of a data loss does not increase since the objects are evicted
gradually from the disk being decommissioned and the number of replica
stays the same at all times. There is not a sudden drop in the number of
replica  which is what I had in mind.


That may be, but I'm rather certain that there is no difference in speed
and priority of a rebalancing caused by an OSD set to weight 0 or one
being set out.


If the lost OSD was part of 100 PG, the other disks (let say 50 of them)
will start transferring a new replica of the objects they have to the
new OSD in their PG. The replacement will not be a single OSD although
nothing prevents the same OSD to be used in more than one PG as a
replacement for the lost one. If the cluster network is connected at
10Gb/s and is 50% busy at all times, that leaves 5Gb/s. Since the new
duplicates do not originate from a single OSD but from at least dozens
of them and since they target more than one OSD, I assume we can expect
an actual throughput of 5Gb/s. I should have written 2h instead of 1h to
account for the fact that the cluster network is never idle.

Am I being too optimistic ?

Vastly.


Do you see another blocking factor that
would significantly slow down recovery ?


As Craig and I keep telling you, the network is not the limiting factor.
Concurrent disk IO is, as I pointed out in the other thread.


Completely agree.

On a production cluster with OSDs backed by spindles, even with OSD journals on 
SSDs, it is insufficient to calculate single-disk replacement backfill time 
based solely on network throughput. IOPS will likely be the limiting factor 
when backfilling a single failed spinner in a production cluster.

Last week I replaced a 3TB 7200rpm drive that was ~75% full in a 72-osd 
cluster, 24 hosts, rbd pool with 3 replicas, osd journals on SSDs (ratio of 
3:1), with dual 1GbE bonded NICs.

Using the only throughput math, backfill could have theoretically completed in 
a bit over 2.5 hours, but it actually took 15 hours. I've done this a few times 
with similar results.

Why? Spindle contention on the replacement drive. Graph the '%util' metric from 
something like 'iostat -xt 2' during a single disk backfill to get a very clear 
view that spindle contention is the true limiting 

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-28 Thread Craig Lewis
My initial experience was similar to Mike's, causing a similar level of
paranoia.  :-)  I'm dealing with RadosGW though, so I can tolerate higher
latencies.

I was running my cluster with noout and nodown set for weeks at a time.
 Recovery of a single OSD might cause other OSDs to crash.  In the primary
cluster, I was always able to get it under control before it cascaded too
wide.  In my secondary cluster, it did spiral out to 40% of the OSDs, with
2-5 OSDs down at any time.

I traced my problems to a combination of osd max backfills was too high for
my cluster, and my mkfs.xfs arguments were causing memory starvation
issues.  I lowered osd max backfills, added SSD journals, and reformatted
every OSD with better mkfs.xfs arguments.  Now both clusters are stable,
and I don't want to break it.

I only have 45 OSDs, so the risk with a 24-48 hours recovery time is
acceptable to me.  It will be a problem as I scale up, but scaling up will
also help with the latency problems.




On Thu, Aug 28, 2014 at 10:38 AM, Mike Dawson mike.daw...@cloudapt.com
wrote:


 We use 3x replication and have drives that have relatively high
 steady-state IOPS. Therefore, we tend to prioritize client-side IO more
 than a reduction from 3 copies to 2 during the loss of one disk. The
 disruption to client io is so great on our cluster, we don't want our
 cluster to be in a recovery state without operator-supervision.

 Letting OSDs get marked out without operator intervention was a disaster
 in the early going of our cluster. For example, an OSD daemon crash would
 trigger automatic recovery where it was unneeded. Ironically, often times
 the unneeded recovery would often trigger additional daemons to crash,
 making a bad situation worse. During the recovery, rbd client io would
 often times go to 0.

 To deal with this issue, we set mon osd down out interval = 14400, so as
 operators we have 4 hours to intervene before Ceph attempts to self-heal.
 When hardware is at fault, we remove the osd, replace the drive, re-add the
 osd, then allow backfill to begin, thereby completely skipping step B in
 your timeline above.

 - Mike


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-28 Thread Mike Dawson


On 8/28/2014 4:17 PM, Craig Lewis wrote:

My initial experience was similar to Mike's, causing a similar level of
paranoia.  :-)  I'm dealing with RadosGW though, so I can tolerate
higher latencies.

I was running my cluster with noout and nodown set for weeks at a time.


I'm sure Craig will agree, but wanted to add this for other readers:

I find value in the noout flag for temporary intervention, but prefer to 
set mon osd down out interval for dealing with events that may occur 
in the future to give an operator time to intervene.


The nodown flag is another beast altogether. The nodown flag tends to be 
*a bad thing* when attempting to provide reliable client io. For our use 
case, we want OSDs to be marked down quickly if they are in fact 
unavailable for any reason, so client io doesn't hang waiting for them.


If OSDs are flapping during recovery (i.e. the wrongly marked me down 
log messages), I've found far superior results by tuning the recovery 
knobs than by permanently setting the nodown flag.


- Mike



  Recovery of a single OSD might cause other OSDs to crash. In the
primary cluster, I was always able to get it under control before it
cascaded too wide.  In my secondary cluster, it did spiral out to 40% of
the OSDs, with 2-5 OSDs down at any time.






I traced my problems to a combination of osd max backfills was too high
for my cluster, and my mkfs.xfs arguments were causing memory starvation
issues.  I lowered osd max backfills, added SSD journals,
and reformatted every OSD with better mkfs.xfs arguments.  Now both
clusters are stable, and I don't want to break it.

I only have 45 OSDs, so the risk with a 24-48 hours recovery time is
acceptable to me.  It will be a problem as I scale up, but scaling up
will also help with the latency problems.




On Thu, Aug 28, 2014 at 10:38 AM, Mike Dawson mike.daw...@cloudapt.com
mailto:mike.daw...@cloudapt.com wrote:


We use 3x replication and have drives that have relatively high
steady-state IOPS. Therefore, we tend to prioritize client-side IO
more than a reduction from 3 copies to 2 during the loss of one
disk. The disruption to client io is so great on our cluster, we
don't want our cluster to be in a recovery state without
operator-supervision.

Letting OSDs get marked out without operator intervention was a
disaster in the early going of our cluster. For example, an OSD
daemon crash would trigger automatic recovery where it was unneeded.
Ironically, often times the unneeded recovery would often trigger
additional daemons to crash, making a bad situation worse. During
the recovery, rbd client io would often times go to 0.

To deal with this issue, we set mon osd down out interval = 14400,
so as operators we have 4 hours to intervene before Ceph attempts to
self-heal. When hardware is at fault, we remove the osd, replace the
drive, re-add the osd, then allow backfill to begin, thereby
completely skipping step B in your timeline above.

- Mike



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-27 Thread Loic Dachary


On 27/08/2014 04:34, Christian Balzer wrote:
 
 Hello,
 
 On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote:
 
 Hi Craig,

 I assume the reason for the 48 hours recovery time is to keep the cost
 of the cluster low ? I wrote 1h recovery time because it is roughly
 the time it would take to move 4TB over a 10Gb/s link. Could you upgrade
 your hardware to reduce the recovery time to less than two hours ? Or
 are there factors other than cost that prevent this ?

 
 I doubt Craig is operating on a shoestring budget.
 And even if his network were to be just GbE, that would still make it only
 10 hours according to your wishful thinking formula.
 
 He probably has set the max_backfills to 1 because that is the level of
 I/O his OSDs can handle w/o degrading cluster performance too much.
 The network is unlikely to be the limiting factor.
 
 The way I see it most Ceph clusters are in sort of steady state when
 operating normally, i.e. a few hundred VM RBD images ticking over, most
 actual OSD disk ops are writes, as nearly all hot objects that are being
 read are in the page cache of the storage nodes.
 Easy peasy.
 
 Until something happens that breaks this routine, like a deep scrub, all
 those VMs rebooting at the same time or a backfill caused by a failed OSD.
 Now all of a sudden client ops compete with the backfill ops, page caches
 are no longer hot, the spinners are seeking left and right. 
 Pandemonium.
 
 I doubt very much that even with a SSD backed cluster you would get away
 with less than 2 hours for 4TB.
 
 To give you some real life numbers, I currently am building a new cluster
 but for the time being have only one storage node to play with.
 It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs  and 8 actual
 OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.
 
 So I took out one OSD (reweight 0 first, then the usual removal steps)
 because the actual disk was wonky. Replaced the disk and re-added the OSD.
 Both operations took about the same time, 4 minutes for evacuating the OSD
 (having 7 write targets clearly helped) for measly 12GB or about 50MB/s
 and 5 minutes or about 35MB/ for refilling the OSD. 
 And that is on one node (thus no network latency) that has the default
 parameters (so a max_backfill of 10) which was otherwise totally idle. 
 
 In other words, in this pretty ideal case it would have taken 22 hours
 to re-distribute 4TB.

That makes sense to me :-) 

When I wrote 1h, I thought about what happens when an OSD becomes unavailable 
with no planning in advance. In the scenario you describe the risk of a data 
loss does not increase since the objects are evicted gradually from the disk 
being decommissioned and the number of replica stays the same at all times. 
There is not a sudden drop in the number of replica  which is what I had in 
mind.

If the lost OSD was part of 100 PG, the other disks (let say 50 of them) will 
start transferring a new replica of the objects they have to the new OSD in 
their PG. The replacement will not be a single OSD although nothing prevents 
the same OSD to be used in more than one PG as a replacement for the lost one. 
If the cluster network is connected at 10Gb/s and is 50% busy at all times, 
that leaves 5Gb/s. Since the new duplicates do not originate from a single OSD 
but from at least dozens of them and since they target more than one OSD, I 
assume we can expect an actual throughput of 5Gb/s. I should have written 2h 
instead of 1h to account for the fact that the cluster network is never idle.

Am I being too optimistic ? Do you see another blocking factor that would 
significantly slow down recovery ?

Cheers

 More in another reply.
 
 Cheers

 On 26/08/2014 19:37, Craig Lewis wrote:
 My OSD rebuild time is more like 48 hours (4TB disks, 60% full, osd
 max backfills = 1).   I believe that increases my risk of failure by
 48^2 .  Since your numbers are failure rate per hour per disk, I need
 to consider the risk for the whole time for each disk.  So more
 formally, rebuild time to the power of (replicas -1).

 So I'm at 2304/100,000,000, or  approximately 1/43,000.  That's a much
 higher risk than 1 / 10^8.


 A risk of 1/43,000 means that I'm more likely to lose data due to
 human error than disk failure.  Still, I can put a small bit of effort
 in to optimize recovery speed, and lower this number.  Managing human
 error is much harder.






 On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary l...@dachary.org
 mailto:l...@dachary.org wrote:

 Using percentages instead of numbers lead me to calculations
 errors. Here it is again using 1/100 instead of % for clarity ;-)

 Assuming that:

 * The pool is configured for three replicas (size = 3 which is the
 default)
 * It takes one hour for Ceph to recover from the loss of a single
 OSD
 * Any other disk has a 1/100,000 chance to fail within the hour
 following the failure of the first disk (assuming AFR
 https://en.wikipedia.org/wiki/Annualized_failure_rate 

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Christian Balzer

Hello,

On Tue, 26 Aug 2014 20:21:39 +0200 Loic Dachary wrote:

 Hi Craig,
 
 I assume the reason for the 48 hours recovery time is to keep the cost
 of the cluster low ? I wrote 1h recovery time because it is roughly
 the time it would take to move 4TB over a 10Gb/s link. Could you upgrade
 your hardware to reduce the recovery time to less than two hours ? Or
 are there factors other than cost that prevent this ?
 

I doubt Craig is operating on a shoestring budget.
And even if his network were to be just GbE, that would still make it only
10 hours according to your wishful thinking formula.

He probably has set the max_backfills to 1 because that is the level of
I/O his OSDs can handle w/o degrading cluster performance too much.
The network is unlikely to be the limiting factor.

The way I see it most Ceph clusters are in sort of steady state when
operating normally, i.e. a few hundred VM RBD images ticking over, most
actual OSD disk ops are writes, as nearly all hot objects that are being
read are in the page cache of the storage nodes.
Easy peasy.

Until something happens that breaks this routine, like a deep scrub, all
those VMs rebooting at the same time or a backfill caused by a failed OSD.
Now all of a sudden client ops compete with the backfill ops, page caches
are no longer hot, the spinners are seeking left and right. 
Pandemonium.

I doubt very much that even with a SSD backed cluster you would get away
with less than 2 hours for 4TB.

To give you some real life numbers, I currently am building a new cluster
but for the time being have only one storage node to play with.
It consists of 32GB RAM, plenty of CPU oomph, 4 journal SSDs  and 8 actual
OSD HDDs (3TB, 7200RPM). 90GB of (test) data on it.

So I took out one OSD (reweight 0 first, then the usual removal steps)
because the actual disk was wonky. Replaced the disk and re-added the OSD.
Both operations took about the same time, 4 minutes for evacuating the OSD
(having 7 write targets clearly helped) for measly 12GB or about 50MB/s
and 5 minutes or about 35MB/ for refilling the OSD. 
And that is on one node (thus no network latency) that has the default
parameters (so a max_backfill of 10) which was otherwise totally idle. 

In other words, in this pretty ideal case it would have taken 22 hours
to re-distribute 4TB.

More in another reply.

 Cheers
 
 On 26/08/2014 19:37, Craig Lewis wrote:
  My OSD rebuild time is more like 48 hours (4TB disks, 60% full, osd
  max backfills = 1).   I believe that increases my risk of failure by
  48^2 .  Since your numbers are failure rate per hour per disk, I need
  to consider the risk for the whole time for each disk.  So more
  formally, rebuild time to the power of (replicas -1).
  
  So I'm at 2304/100,000,000, or  approximately 1/43,000.  That's a much
  higher risk than 1 / 10^8.
  
  
  A risk of 1/43,000 means that I'm more likely to lose data due to
  human error than disk failure.  Still, I can put a small bit of effort
  in to optimize recovery speed, and lower this number.  Managing human
  error is much harder.
  
  
  
  
  
  
  On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary l...@dachary.org
  mailto:l...@dachary.org wrote:
  
  Using percentages instead of numbers lead me to calculations
  errors. Here it is again using 1/100 instead of % for clarity ;-)
  
  Assuming that:
  
  * The pool is configured for three replicas (size = 3 which is the
  default)
  * It takes one hour for Ceph to recover from the loss of a single
  OSD
  * Any other disk has a 1/100,000 chance to fail within the hour
  following the failure of the first disk (assuming AFR
  https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
  8%, divided by the number of hours during a year == (0.08 / 8760) ~=
  1/100,000
  * A given disk does not participate in more than 100 PG
  
 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Christian Balzer
 is higher than normal:
  this is just an example of a high probability event leading to data
  loss. In other words, I wonder if this 0.0001% chance of losing a PG
  within the hour following a disk failure matters or if it is dominated
  by other factors. What do you think ?
  
  Cheers
 
 On 26/08/2014 15:25, Loic Dachary wrote: Hi Blair,
  
  Assuming that:
  
  * The pool is configured for three replicas (size = 3 which is the
  default)
  * It takes one hour for Ceph to recover from the loss of a single OSD
  * Any other disk has a 0.001% chance to fail within the hour following
  the failure of the first disk (assuming AFR
  https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
  10%, divided by the number of hours during a year).
  * A given disk does not participate in more than 100 PG
  
  Each time an OSD is lost, there is a 0.001*0.001 = 0.01% chance
  that two other disks are lost before recovery. Since the disk that
  failed initialy participates in 100 PG, that is 0.01% x 100 =
  0.0001% chance that a PG is lost. Or the entire pool if it is used in
  a way that loosing a PG means loosing all data in the pool (as in your
  example, where it contains RBD volumes and each of the RBD volume uses
  all the available PG).
  
  If the pool is using at least two datacenters operated by two
  different organizations, this calculation makes sense to me. However,
  if the cluster is in a single datacenter, isn't it possible that some
  event independent of Ceph has a greater probability of permanently
  destroying the data ? A month ago I lost three machines in a Ceph
  cluster and realized on that occasion that the crushmap was not
  configured properly and that PG were lost as a result. Fortunately I
  was able to recover the disks and plug them in another machine to
  recover the lost PGs. I'm not a system administrator and the
  probability of me failing to do the right thing is higher than normal:
  this is just an example of a high probability event leading to data
  loss. In other words, I wonder if this 0.0001% chance of losing a PG
  within the hour following a disk failure matters or if it is dominated
  by other factors. What do you think ?
  
  Cheers
  
  On 26/08/2014 02:23, Blair Bethwaite wrote:
  Message: 25
  Date: Fri, 15 Aug 2014 15:06:49 +0200
  From: Loic Dachary l...@dachary.org
  To: Erik Logtenberg e...@logtenberg.eu, ceph-users@lists.ceph.com
  Subject: Re: [ceph-users] Best practice K/M-parameters EC pool
  Message-ID: 53ee05e9.1040...@dachary.org
  Content-Type: text/plain; charset=iso-8859-1
  ...
  Here is how I reason about it, roughly:
 
  If the probability of loosing a disk is 0.1%, the probability of
  loosing two disks simultaneously (i.e. before the failure can be
  recovered) would be 0.1*0.1 = 0.01% and three disks becomes
  0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001%
 
  I watched this conversation and an older similar one (Failure
  probability with largish deployments) with interest as we are in the
  process of planning a pretty large Ceph cluster (~3.5 PB), so I have
  been trying to wrap my head around these issues.
 
  Loic's reasoning (above) seems sound as a naive approximation assuming
  independent probabilities for disk failures, which may not be quite
  true given potential for batch production issues, but should be okay
  for other sorts of correlations (assuming a sane crushmap that
  eliminates things like controllers and nodes as sources of
  correlation).
 
  One of the things that came up in the Failure probability with
  largish deployments thread and has raised its head again here is the
  idea that striped data (e.g., RADOS-GW objects and RBD volumes) might
  be somehow more prone to data-loss than non-striped. I don't think
  anyone has so far provided an answer on this, so here's my thinking...
 
  The level of atomicity that matters when looking at durability 
  availability in Ceph is the Placement Group. For any non-trivial RBD
  it is likely that many RBDs will span all/most PGs, e.g., even a
  relatively small 50GiB volume would (with default 4MiB object size)
  span 12800 PGs - more than there are in many production clusters
  obeying the 100-200 PGs per drive rule of thumb. IMPORTANTLosing any
  one PG will cause data-loss. The failure-probability effects of
  striping across multiple PGs are immaterial considering that loss of
  any single PG is likely to damage all your RBDs/IMPORTANT. This
  might be why the reliability calculator doesn't consider total number
  of disks.
 
  Related to all this is the durability of 2 versus 3 replicas (or e.g.
  M=1 for Erasure Coding). It's easy to get caught up in the worrying
  fallacy that losing any M OSDs will cause data-loss, but this isn't
  true - they have to be members of the same PG for data-loss to occur.
  So then it's tempting to think the chances of that happening are so
  slim as to not matter and why would we ever even need 3 replicas. I
  mean

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Christian Balzer

Hello,

On Tue, 26 Aug 2014 10:23:43 +1000 Blair Bethwaite wrote:

  Message: 25
  Date: Fri, 15 Aug 2014 15:06:49 +0200
  From: Loic Dachary l...@dachary.org
  To: Erik Logtenberg e...@logtenberg.eu, ceph-users@lists.ceph.com
  Subject: Re: [ceph-users] Best practice K/M-parameters EC pool
  Message-ID: 53ee05e9.1040...@dachary.org
  Content-Type: text/plain; charset=iso-8859-1
  ...
  Here is how I reason about it, roughly:
 
  If the probability of loosing a disk is 0.1%, the probability of
  loosing two disks simultaneously (i.e. before the failure can be
  recovered) would be 0.1*0.1 = 0.01% and three disks becomes
  0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001%
 
 I watched this conversation and an older similar one (Failure
 probability with largish deployments) with interest as we are in the
 process of planning a pretty large Ceph cluster (~3.5 PB), so I have
 been trying to wrap my head around these issues.

As the OP of the Failure probability with largish deployments thread I
have to thank Blair for raising this issue again and doing the hard math
below. Which looks fine to me.

At the end of that slightly inconclusive thread I walked away with the
same impression as Blair, namely that the survival of PGs is the key
factor and that they will likely be spread out over most, if not all the
OSDs.

Which in turn did reinforce my decision to deploy our first production
Ceph cluster based on nodes with 2 OSDs backed by 11 disk RAID6 sets behind
a HW RAID controller with 4GB cache AND SDD journals. 
I can live with the reduced performance (which is caused by the OSD code
running out of steam long before the SSDs or the RAIDs do), because not
only do I save 1/3rd of the space and 1/4th of the cost compared to a
replication 3 cluster, the total of disks that need to fail within the
recovery window to cause data loss is now 4.

The next cluster I'm currently building is a classic Ceph design,
replication of 3, 8 OSD HDDs and 4 journal SSDs per node, because with
this cluster I won't have predictable I/O patterns and loads.
OTOH, I don't see it growing much beyond 48 OSDs, so I'm happy enough with
the odds here.

I think doing the exact maths for a cluster of the size you're planning
would be very interesting and also very much needed. 
3.5PB usable space would be close to 3000 disks with a replication of 3,
but even if you meant that as gross value it would probably mean that
you're looking at frequent, if not daily disk failures.


Regards,

Christian
 Loic's reasoning (above) seems sound as a naive approximation assuming
 independent probabilities for disk failures, which may not be quite
 true given potential for batch production issues, but should be okay
 for other sorts of correlations (assuming a sane crushmap that
 eliminates things like controllers and nodes as sources of
 correlation).
 
 One of the things that came up in the Failure probability with
 largish deployments thread and has raised its head again here is the
 idea that striped data (e.g., RADOS-GW objects and RBD volumes) might
 be somehow more prone to data-loss than non-striped. I don't think
 anyone has so far provided an answer on this, so here's my thinking...
 
 The level of atomicity that matters when looking at durability 
 availability in Ceph is the Placement Group. For any non-trivial RBD
 it is likely that many RBDs will span all/most PGs, e.g., even a
 relatively small 50GiB volume would (with default 4MiB object size)
 span 12800 PGs - more than there are in many production clusters
 obeying the 100-200 PGs per drive rule of thumb. IMPORTANTLosing any
 one PG will cause data-loss. The failure-probability effects of
 striping across multiple PGs are immaterial considering that loss of
 any single PG is likely to damage all your RBDs/IMPORTANT. This
 might be why the reliability calculator doesn't consider total number
 of disks.
 
 Related to all this is the durability of 2 versus 3 replicas (or e.g.
 M=1 for Erasure Coding). It's easy to get caught up in the worrying
 fallacy that losing any M OSDs will cause data-loss, but this isn't
 true - they have to be members of the same PG for data-loss to occur.
 So then it's tempting to think the chances of that happening are so
 slim as to not matter and why would we ever even need 3 replicas. I
 mean, what are the odds of exactly those 2 drives, out of the
 100,200... in my cluster, failing in recovery window?! But therein
 lays the rub - you should be thinking about PGs. If a drive fails then
 the chance of a data-loss event resulting are dependent on the chances
 of losing further drives from the affected/degraded PGs.
 
 I've got a real cluster at hand, so let's use that as an example. We
 have 96 drives/OSDs - 8 nodes, 12 OSDs per node, 2 replicas, top-down
 failure domains: rack pairs (x2), nodes, OSDs... Let's say OSD 15
 dies. How many PGs are now at risk:
 $ grep ^10\. pg.dump | awk '{print $15}' | grep 15 | wc
 109 109 861
 (NB: 10 is the pool id

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Loic Dachary
Hi Blair,

Assuming that:

* The pool is configured for three replicas (size = 3 which is the default)
* It takes one hour for Ceph to recover from the loss of a single OSD
* Any other disk has a 0.001% chance to fail within the hour following the 
failure of the first disk (assuming AFR 
https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 10%, 
divided by the number of hours during a year).
* A given disk does not participate in more than 100 PG

Each time an OSD is lost, there is a 0.001*0.001 = 0.01% chance that two 
other disks are lost before recovery. Since the disk that failed initialy 
participates in 100 PG, that is 0.01% x 100 = 0.0001% chance that a PG is 
lost. Or the entire pool if it is used in a way that loosing a PG means loosing 
all data in the pool (as in your example, where it contains RBD volumes and 
each of the RBD volume uses all the available PG).

If the pool is using at least two datacenters operated by two different 
organizations, this calculation makes sense to me. However, if the cluster is 
in a single datacenter, isn't it possible that some event independent of Ceph 
has a greater probability of permanently destroying the data ? A month ago I 
lost three machines in a Ceph cluster and realized on that occasion that the 
crushmap was not configured properly and that PG were lost as a result. 
Fortunately I was able to recover the disks and plug them in another machine to 
recover the lost PGs. I'm not a system administrator and the probability of me 
failing to do the right thing is higher than normal: this is just an example of 
a high probability event leading to data loss. In other words, I wonder if this 
0.0001% chance of losing a PG within the hour following a disk failure matters 
or if it is dominated by other factors. What do you think ?

Cheers

On 26/08/2014 02:23, Blair Bethwaite wrote:
 Message: 25
 Date: Fri, 15 Aug 2014 15:06:49 +0200
 From: Loic Dachary l...@dachary.org
 To: Erik Logtenberg e...@logtenberg.eu, ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Best practice K/M-parameters EC pool
 Message-ID: 53ee05e9.1040...@dachary.org
 Content-Type: text/plain; charset=iso-8859-1
 ...
 Here is how I reason about it, roughly:

 If the probability of loosing a disk is 0.1%, the probability of loosing two 
 disks simultaneously (i.e. before the failure can be recovered) would be 
 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks 
 becomes 0.0001%
 
 I watched this conversation and an older similar one (Failure
 probability with largish deployments) with interest as we are in the
 process of planning a pretty large Ceph cluster (~3.5 PB), so I have
 been trying to wrap my head around these issues.
 
 Loic's reasoning (above) seems sound as a naive approximation assuming
 independent probabilities for disk failures, which may not be quite
 true given potential for batch production issues, but should be okay
 for other sorts of correlations (assuming a sane crushmap that
 eliminates things like controllers and nodes as sources of
 correlation).
 
 One of the things that came up in the Failure probability with
 largish deployments thread and has raised its head again here is the
 idea that striped data (e.g., RADOS-GW objects and RBD volumes) might
 be somehow more prone to data-loss than non-striped. I don't think
 anyone has so far provided an answer on this, so here's my thinking...
 
 The level of atomicity that matters when looking at durability 
 availability in Ceph is the Placement Group. For any non-trivial RBD
 it is likely that many RBDs will span all/most PGs, e.g., even a
 relatively small 50GiB volume would (with default 4MiB object size)
 span 12800 PGs - more than there are in many production clusters
 obeying the 100-200 PGs per drive rule of thumb. IMPORTANTLosing any
 one PG will cause data-loss. The failure-probability effects of
 striping across multiple PGs are immaterial considering that loss of
 any single PG is likely to damage all your RBDs/IMPORTANT. This
 might be why the reliability calculator doesn't consider total number
 of disks.
 
 Related to all this is the durability of 2 versus 3 replicas (or e.g.
 M=1 for Erasure Coding). It's easy to get caught up in the worrying
 fallacy that losing any M OSDs will cause data-loss, but this isn't
 true - they have to be members of the same PG for data-loss to occur.
 So then it's tempting to think the chances of that happening are so
 slim as to not matter and why would we ever even need 3 replicas. I
 mean, what are the odds of exactly those 2 drives, out of the
 100,200... in my cluster, failing in recovery window?! But therein
 lays the rub - you should be thinking about PGs. If a drive fails then
 the chance of a data-loss event resulting are dependent on the chances
 of losing further drives from the affected/degraded PGs.
 
 I've got a real cluster at hand, so let's use that as an example. We
 have 96 drives/OSDs - 8 nodes, 12

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Loic Dachary
).
 
 If the pool is using at least two datacenters operated by two different 
 organizations, this calculation makes sense to me. However, if the cluster is 
 in a single datacenter, isn't it possible that some event independent of Ceph 
 has a greater probability of permanently destroying the data ? A month ago I 
 lost three machines in a Ceph cluster and realized on that occasion that the 
 crushmap was not configured properly and that PG were lost as a result. 
 Fortunately I was able to recover the disks and plug them in another machine 
 to recover the lost PGs. I'm not a system administrator and the probability 
 of me failing to do the right thing is higher than normal: this is just an 
 example of a high probability event leading to data loss. In other words, I 
 wonder if this 0.0001% chance of losing a PG within the hour following a disk 
 failure matters or if it is dominated by other factors. What do you think ?
 
 Cheers
 
 On 26/08/2014 02:23, Blair Bethwaite wrote:
 Message: 25
 Date: Fri, 15 Aug 2014 15:06:49 +0200
 From: Loic Dachary l...@dachary.org
 To: Erik Logtenberg e...@logtenberg.eu, ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Best practice K/M-parameters EC pool
 Message-ID: 53ee05e9.1040...@dachary.org
 Content-Type: text/plain; charset=iso-8859-1
 ...
 Here is how I reason about it, roughly:

 If the probability of loosing a disk is 0.1%, the probability of loosing 
 two disks simultaneously (i.e. before the failure can be recovered) would 
 be 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four 
 disks becomes 0.0001%

 I watched this conversation and an older similar one (Failure
 probability with largish deployments) with interest as we are in the
 process of planning a pretty large Ceph cluster (~3.5 PB), so I have
 been trying to wrap my head around these issues.

 Loic's reasoning (above) seems sound as a naive approximation assuming
 independent probabilities for disk failures, which may not be quite
 true given potential for batch production issues, but should be okay
 for other sorts of correlations (assuming a sane crushmap that
 eliminates things like controllers and nodes as sources of
 correlation).

 One of the things that came up in the Failure probability with
 largish deployments thread and has raised its head again here is the
 idea that striped data (e.g., RADOS-GW objects and RBD volumes) might
 be somehow more prone to data-loss than non-striped. I don't think
 anyone has so far provided an answer on this, so here's my thinking...

 The level of atomicity that matters when looking at durability 
 availability in Ceph is the Placement Group. For any non-trivial RBD
 it is likely that many RBDs will span all/most PGs, e.g., even a
 relatively small 50GiB volume would (with default 4MiB object size)
 span 12800 PGs - more than there are in many production clusters
 obeying the 100-200 PGs per drive rule of thumb. IMPORTANTLosing any
 one PG will cause data-loss. The failure-probability effects of
 striping across multiple PGs are immaterial considering that loss of
 any single PG is likely to damage all your RBDs/IMPORTANT. This
 might be why the reliability calculator doesn't consider total number
 of disks.

 Related to all this is the durability of 2 versus 3 replicas (or e.g.
 M=1 for Erasure Coding). It's easy to get caught up in the worrying
 fallacy that losing any M OSDs will cause data-loss, but this isn't
 true - they have to be members of the same PG for data-loss to occur.
 So then it's tempting to think the chances of that happening are so
 slim as to not matter and why would we ever even need 3 replicas. I
 mean, what are the odds of exactly those 2 drives, out of the
 100,200... in my cluster, failing in recovery window?! But therein
 lays the rub - you should be thinking about PGs. If a drive fails then
 the chance of a data-loss event resulting are dependent on the chances
 of losing further drives from the affected/degraded PGs.

 I've got a real cluster at hand, so let's use that as an example. We
 have 96 drives/OSDs - 8 nodes, 12 OSDs per node, 2 replicas, top-down
 failure domains: rack pairs (x2), nodes, OSDs... Let's say OSD 15
 dies. How many PGs are now at risk:
 $ grep ^10\. pg.dump | awk '{print $15}' | grep 15 | wc
 109 109 861
 (NB: 10 is the pool id, pg.dump is a text file dump of ceph pg dump,
 $15 is the acting set column)

 109 PGs now living on the edge. No surprises in that number as we
 used 100 * 96 / 2 = 4800 to arrive at the PG count for this pool, so
 on average any one OSD will be primary for 50 PGs and replica for
 another 50. But this doesn't tell me how exposed I am, for that I need
 to know how many neighbouring OSDs there are in these 109 PGs:
 $ grep ^10\. pg.dump | awk '{print $15}' | grep 15 | sed
 's/\[15,\(.*\)\]/\1/' | sed 's/\[\(.*\),15\]/\1/' | sort | uniq | wc
  67  67 193
 (NB: grep-ing for OSD 15 and using sed to remove it and surrounding
 formatting to get just

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Craig Lewis
My OSD rebuild time is more like 48 hours (4TB disks, 60% full, osd max
backfills = 1).   I believe that increases my risk of failure by 48^2 .
 Since your numbers are failure rate per hour per disk, I need to consider
the risk for the whole time for each disk.  So more formally, rebuild time
to the power of (replicas -1).

So I'm at 2304/100,000,000, or  approximately 1/43,000.  That's a much
higher risk than 1 / 10^8.


A risk of 1/43,000 means that I'm more likely to lose data due to human
error than disk failure.  Still, I can put a small bit of effort in to
optimize recovery speed, and lower this number.  Managing human error is
much harder.






On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary l...@dachary.org wrote:

 Using percentages instead of numbers lead me to calculations errors. Here
 it is again using 1/100 instead of % for clarity ;-)

 Assuming that:

 * The pool is configured for three replicas (size = 3 which is the default)
 * It takes one hour for Ceph to recover from the loss of a single OSD
 * Any other disk has a 1/100,000 chance to fail within the hour following
 the failure of the first disk (assuming AFR
 https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is
 8%, divided by the number of hours during a year == (0.08 / 8760) ~=
 1/100,000
 * A given disk does not participate in more than 100 PG

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-26 Thread Loic Dachary
Hi Craig,

I assume the reason for the 48 hours recovery time is to keep the cost of the 
cluster low ? I wrote 1h recovery time because it is roughly the time it 
would take to move 4TB over a 10Gb/s link. Could you upgrade your hardware to 
reduce the recovery time to less than two hours ? Or are there factors other 
than cost that prevent this ?

Cheers

On 26/08/2014 19:37, Craig Lewis wrote:
 My OSD rebuild time is more like 48 hours (4TB disks, 60% full, osd max 
 backfills = 1).   I believe that increases my risk of failure by 48^2 .  
 Since your numbers are failure rate per hour per disk, I need to consider the 
 risk for the whole time for each disk.  So more formally, rebuild time to the 
 power of (replicas -1).
 
 So I'm at 2304/100,000,000, or  approximately 1/43,000.  That's a much higher 
 risk than 1 / 10^8.
 
 
 A risk of 1/43,000 means that I'm more likely to lose data due to human error 
 than disk failure.  Still, I can put a small bit of effort in to optimize 
 recovery speed, and lower this number.  Managing human error is much harder.
 
 
 
 
 
 
 On Tue, Aug 26, 2014 at 7:12 AM, Loic Dachary l...@dachary.org 
 mailto:l...@dachary.org wrote:
 
 Using percentages instead of numbers lead me to calculations errors. Here 
 it is again using 1/100 instead of % for clarity ;-)
 
 Assuming that:
 
 * The pool is configured for three replicas (size = 3 which is the 
 default)
 * It takes one hour for Ceph to recover from the loss of a single OSD
 * Any other disk has a 1/100,000 chance to fail within the hour following 
 the failure of the first disk (assuming AFR 
 https://en.wikipedia.org/wiki/Annualized_failure_rate of every disk is 8%, 
 divided by the number of hours during a year == (0.08 / 8760) ~= 1/100,000
 * A given disk does not participate in more than 100 PG
 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-25 Thread Blair Bethwaite
 Message: 25
 Date: Fri, 15 Aug 2014 15:06:49 +0200
 From: Loic Dachary l...@dachary.org
 To: Erik Logtenberg e...@logtenberg.eu, ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Best practice K/M-parameters EC pool
 Message-ID: 53ee05e9.1040...@dachary.org
 Content-Type: text/plain; charset=iso-8859-1
 ...
 Here is how I reason about it, roughly:

 If the probability of loosing a disk is 0.1%, the probability of loosing two 
 disks simultaneously (i.e. before the failure can be recovered) would be 
 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks 
 becomes 0.0001%

I watched this conversation and an older similar one (Failure
probability with largish deployments) with interest as we are in the
process of planning a pretty large Ceph cluster (~3.5 PB), so I have
been trying to wrap my head around these issues.

Loic's reasoning (above) seems sound as a naive approximation assuming
independent probabilities for disk failures, which may not be quite
true given potential for batch production issues, but should be okay
for other sorts of correlations (assuming a sane crushmap that
eliminates things like controllers and nodes as sources of
correlation).

One of the things that came up in the Failure probability with
largish deployments thread and has raised its head again here is the
idea that striped data (e.g., RADOS-GW objects and RBD volumes) might
be somehow more prone to data-loss than non-striped. I don't think
anyone has so far provided an answer on this, so here's my thinking...

The level of atomicity that matters when looking at durability 
availability in Ceph is the Placement Group. For any non-trivial RBD
it is likely that many RBDs will span all/most PGs, e.g., even a
relatively small 50GiB volume would (with default 4MiB object size)
span 12800 PGs - more than there are in many production clusters
obeying the 100-200 PGs per drive rule of thumb. IMPORTANTLosing any
one PG will cause data-loss. The failure-probability effects of
striping across multiple PGs are immaterial considering that loss of
any single PG is likely to damage all your RBDs/IMPORTANT. This
might be why the reliability calculator doesn't consider total number
of disks.

Related to all this is the durability of 2 versus 3 replicas (or e.g.
M=1 for Erasure Coding). It's easy to get caught up in the worrying
fallacy that losing any M OSDs will cause data-loss, but this isn't
true - they have to be members of the same PG for data-loss to occur.
So then it's tempting to think the chances of that happening are so
slim as to not matter and why would we ever even need 3 replicas. I
mean, what are the odds of exactly those 2 drives, out of the
100,200... in my cluster, failing in recovery window?! But therein
lays the rub - you should be thinking about PGs. If a drive fails then
the chance of a data-loss event resulting are dependent on the chances
of losing further drives from the affected/degraded PGs.

I've got a real cluster at hand, so let's use that as an example. We
have 96 drives/OSDs - 8 nodes, 12 OSDs per node, 2 replicas, top-down
failure domains: rack pairs (x2), nodes, OSDs... Let's say OSD 15
dies. How many PGs are now at risk:
$ grep ^10\. pg.dump | awk '{print $15}' | grep 15 | wc
109 109 861
(NB: 10 is the pool id, pg.dump is a text file dump of ceph pg dump,
$15 is the acting set column)

109 PGs now living on the edge. No surprises in that number as we
used 100 * 96 / 2 = 4800 to arrive at the PG count for this pool, so
on average any one OSD will be primary for 50 PGs and replica for
another 50. But this doesn't tell me how exposed I am, for that I need
to know how many neighbouring OSDs there are in these 109 PGs:
$ grep ^10\. pg.dump | awk '{print $15}' | grep 15 | sed
's/\[15,\(.*\)\]/\1/' | sed 's/\[\(.*\),15\]/\1/' | sort | uniq | wc
 67  67 193
(NB: grep-ing for OSD 15 and using sed to remove it and surrounding
formatting to get just the neighbour id)

Yikes! So if any one of those 67 drives fails during recovery of OSD
15, then we've lost data. On average we should expect this to be
determined by our crushmap, which in this case splits the cluster up
into 2 top-level failure domains, so I'd guess it's the probability of
1 in 48 drives failing on average for this cluster. But actually
looking at the numbers for each OSD it is higher than that here - the
lowest distinct neighbour count we have is 50. Note that we haven't
tuned any of the options in our crushmap, so I guess maybe Ceph
favours fewer repeat sets by default when coming up with PGs(?).

Anyway, here's the average and top 10 neighbour counts (hope this
scripting is right! ;-):

$ for OSD in {0..95}; do echo -ne $OSD\t; grep ^10\. pg.dump | awk
'{print $15}' | grep \[${OSD},\|,${OSD}\] | sed
s/\[$OSD,\(.*\)\]/\1/ | sed s/\[\(.*\),$OSD\]/\1/ | sort | uniq |
wc -l; done | awk '{ total += $2 } END { print total/NR }'
58.5208

$ for OSD in {0..95}; do echo -ne $OSD\t; grep ^10\. pg.dump | awk
'{print

Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-15 Thread Loic Dachary
Hi Erik,

On 15/08/2014 11:54, Erik Logtenberg wrote:
 Hi,
 
 With EC pools in Ceph you are free to choose any K and M parameters you
 like. The documentation explains what K and M do, so far so good.
 
 Now, there are certain combinations of K and M that appear to have more
 or less the same result. Do any of these combinations have pro's and
 con's that I should consider and/or are there best practices for
 choosing the right K/M-parameters?
 
 For instance, if I choose K = 3 and M = 2, then pg's in this pool will
 use 5 OSD's and sustain the loss of 2 OSD's. There is 40% overhead in
 this configuration.
 
 Now, if I were to choose K = 6 and M = 4, I would end up with pg's that
 use 10 OSD's and sustain the loss of 4 OSD's, which is statistically not
 so much different from the first configuration. Also there is the same
 40% overhead.

Although I don't have numbers in mind, I think the odds of loosing two OSD 
simultaneously are a lot smaller than the odds of loosing four OSD 
simultaneously. Or am I misunderstanding you when you write statistically not 
so much different from the first configuration ?

Cheers

 One rather obvious difference between the two configurations is that the
 latter requires a cluster with at least 10 OSD's to make sense. But
 let's say we have such a cluster, which of the two configurations would
 be recommended, and why?
 
 Thanks,
 
 Erik.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-15 Thread Wido den Hollander

On 08/15/2014 12:23 PM, Loic Dachary wrote:

Hi Erik,

On 15/08/2014 11:54, Erik Logtenberg wrote:

Hi,

With EC pools in Ceph you are free to choose any K and M parameters you
like. The documentation explains what K and M do, so far so good.

Now, there are certain combinations of K and M that appear to have more
or less the same result. Do any of these combinations have pro's and
con's that I should consider and/or are there best practices for
choosing the right K/M-parameters?



Loic might have a better anwser, but I think that the more segments (K) 
you have, the heavier recovery. You have to contact more OSDs to 
reconstruct the whole object so that involves more disks doing seeks.


I heard sombody from Fujitsu say that he thought 8/3 was best for most 
situations. That wasn't with Ceph though, but with a different system 
which implemented Erasure Coding.



For instance, if I choose K = 3 and M = 2, then pg's in this pool will
use 5 OSD's and sustain the loss of 2 OSD's. There is 40% overhead in
this configuration.

Now, if I were to choose K = 6 and M = 4, I would end up with pg's that
use 10 OSD's and sustain the loss of 4 OSD's, which is statistically not
so much different from the first configuration. Also there is the same
40% overhead.


Although I don't have numbers in mind, I think the odds of loosing two OSD simultaneously 
are a lot smaller than the odds of loosing four OSD simultaneously. Or am I 
misunderstanding you when you write statistically not so much different from the 
first configuration ?



Loosing two smaller then loosing four? Is that correct or did you mean 
it the other way around?


I'd say that loosing four OSDs simultaneously is less likely to happen 
then two simultaneously.



Cheers


One rather obvious difference between the two configurations is that the
latter requires a cluster with at least 10 OSD's to make sense. But
let's say we have such a cluster, which of the two configurations would
be recommended, and why?

Thanks,

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-15 Thread Mark Nelson

On 08/15/2014 06:24 AM, Wido den Hollander wrote:

On 08/15/2014 12:23 PM, Loic Dachary wrote:

Hi Erik,

On 15/08/2014 11:54, Erik Logtenberg wrote:

Hi,

With EC pools in Ceph you are free to choose any K and M parameters you
like. The documentation explains what K and M do, so far so good.

Now, there are certain combinations of K and M that appear to have more
or less the same result. Do any of these combinations have pro's and
con's that I should consider and/or are there best practices for
choosing the right K/M-parameters?



Loic might have a better anwser, but I think that the more segments (K)
you have, the heavier recovery. You have to contact more OSDs to
reconstruct the whole object so that involves more disks doing seeks.

I heard sombody from Fujitsu say that he thought 8/3 was best for most
situations. That wasn't with Ceph though, but with a different system
which implemented Erasure Coding.


Performance is definitely lower with more segments in Ceph.  I kind of 
gravitate toward 4/2 or 6/2, though that's just my own preference.





For instance, if I choose K = 3 and M = 2, then pg's in this pool will
use 5 OSD's and sustain the loss of 2 OSD's. There is 40% overhead in
this configuration.

Now, if I were to choose K = 6 and M = 4, I would end up with pg's that
use 10 OSD's and sustain the loss of 4 OSD's, which is statistically not
so much different from the first configuration. Also there is the same
40% overhead.


Although I don't have numbers in mind, I think the odds of loosing two
OSD simultaneously are a lot smaller than the odds of loosing four OSD
simultaneously. Or am I misunderstanding you when you write
statistically not so much different from the first configuration ?



Loosing two smaller then loosing four? Is that correct or did you mean
it the other way around?

I'd say that loosing four OSDs simultaneously is less likely to happen
then two simultaneously.


This is true, though the more disks you spread your objects across, the 
higher likelihood that any given object will be affected by a lost OSD. 
 The extreme case being that every object is spread across every OSD 
and losing any given OSD affects all objects.  I suppose the severity 
depends on the relative fraction of your erasure coding parameters 
relative to the total number of OSDs.  I think this is perhaps what Erik 
was getting at.





Cheers


One rather obvious difference between the two configurations is that the
latter requires a cluster with at least 10 OSD's to make sense. But
let's say we have such a cluster, which of the two configurations would
be recommended, and why?

Thanks,

Erik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-15 Thread Erik Logtenberg
 Now, there are certain combinations of K and M that appear to have more
 or less the same result. Do any of these combinations have pro's and
 con's that I should consider and/or are there best practices for
 choosing the right K/M-parameters?


 Loic might have a better anwser, but I think that the more segments (K)
 you have, the heavier recovery. You have to contact more OSDs to
 reconstruct the whole object so that involves more disks doing seeks.

 I heard sombody from Fujitsu say that he thought 8/3 was best for most
 situations. That wasn't with Ceph though, but with a different system
 which implemented Erasure Coding.
 
 Performance is definitely lower with more segments in Ceph.  I kind of
 gravitate toward 4/2 or 6/2, though that's just my own preference.

This is indeed the kind of pro's and con's I was thinking about.
Performance-wise, I would expect differences, but I can think of both
positive and negative effects of bigger values for K.

For instance, yes recovery takes more OSD's with bigger values of K, but
it seems to me that there are also less or smaller items to recover.
Also read-performance generally appears to benefit from having a bigger
cluster (more parallellism), so I can imagine that bigger values of K
also provide an increase in read-performance.

Mark says more segments hurts performance though, are you referring just
to rebuild-performance or also basic operational performance (read/write)?

 For instance, if I choose K = 3 and M = 2, then pg's in this pool will
 use 5 OSD's and sustain the loss of 2 OSD's. There is 40% overhead in
 this configuration.

 Now, if I were to choose K = 6 and M = 4, I would end up with pg's that
 use 10 OSD's and sustain the loss of 4 OSD's, which is statistically
 not
 so much different from the first configuration. Also there is the same
 40% overhead.

 Although I don't have numbers in mind, I think the odds of loosing two
 OSD simultaneously are a lot smaller than the odds of loosing four OSD
 simultaneously. Or am I misunderstanding you when you write
 statistically not so much different from the first configuration ?


 Loosing two smaller then loosing four? Is that correct or did you mean
 it the other way around?

 I'd say that loosing four OSDs simultaneously is less likely to happen
 then two simultaneously.
 
 This is true, though the more disks you spread your objects across, the
 higher likelihood that any given object will be affected by a lost OSD.
  The extreme case being that every object is spread across every OSD and
 losing any given OSD affects all objects.  I suppose the severity
 depends on the relative fraction of your erasure coding parameters
 relative to the total number of OSDs.  I think this is perhaps what Erik
 was getting at.

I haven't done the actual calculations, but given some % chance of disk
failure, I would assume that losing x out of y disks has roughly the
same chance as losing 2*x out of 2*y disks over the same period.

That's also why you generally want to limit RAID5 arrays to maybe 6
disks or so and move to RAID6 for bigger arrays. For arrays bigger than
20 disks you would usually split those into separate arrays, just to
keep the (parity disks / total disks) fraction high enough.

With regard to data safety I would guess that 3+2 and 6+4 are roughly
equal, although the behaviour of 6+4 is probably easier to predict
because bigger numbers makes your calculations less dependent on
individual deviations in reliability.

Do you guys feel this argument is valid?

Erik.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-15 Thread Loic Dachary


On 15/08/2014 13:24, Wido den Hollander wrote:
 On 08/15/2014 12:23 PM, Loic Dachary wrote:
 Hi Erik,

 On 15/08/2014 11:54, Erik Logtenberg wrote:
 Hi,

 With EC pools in Ceph you are free to choose any K and M parameters you
 like. The documentation explains what K and M do, so far so good.

 Now, there are certain combinations of K and M that appear to have more
 or less the same result. Do any of these combinations have pro's and
 con's that I should consider and/or are there best practices for
 choosing the right K/M-parameters?

 
 Loic might have a better anwser, but I think that the more segments (K) you 
 have, the heavier recovery. You have to contact more OSDs to reconstruct the 
 whole object so that involves more disks doing seeks.
 
 I heard sombody from Fujitsu say that he thought 8/3 was best for most 
 situations. That wasn't with Ceph though, but with a different system which 
 implemented Erasure Coding.
 
 For instance, if I choose K = 3 and M = 2, then pg's in this pool will
 use 5 OSD's and sustain the loss of 2 OSD's. There is 40% overhead in
 this configuration.

 Now, if I were to choose K = 6 and M = 4, I would end up with pg's that
 use 10 OSD's and sustain the loss of 4 OSD's, which is statistically not
 so much different from the first configuration. Also there is the same
 40% overhead.

 Although I don't have numbers in mind, I think the odds of loosing two OSD 
 simultaneously are a lot smaller than the odds of loosing four OSD 
 simultaneously. Or am I misunderstanding you when you write statistically 
 not so much different from the first configuration ?

 
 Loosing two smaller then loosing four? Is that correct or did you mean it the 
 other way around?


Right, sorry for the confusion, I meant the other way around :-)

 
 I'd say that loosing four OSDs simultaneously is less likely to happen then 
 two simultaneously.
 
 Cheers

 One rather obvious difference between the two configurations is that the
 latter requires a cluster with at least 10 OSD's to make sense. But
 let's say we have such a cluster, which of the two configurations would
 be recommended, and why?

 Thanks,

 Erik.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 
 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-15 Thread Loic Dachary


On 15/08/2014 14:36, Erik Logtenberg wrote:
 Now, there are certain combinations of K and M that appear to have more
 or less the same result. Do any of these combinations have pro's and
 con's that I should consider and/or are there best practices for
 choosing the right K/M-parameters?


 Loic might have a better anwser, but I think that the more segments (K)
 you have, the heavier recovery. You have to contact more OSDs to
 reconstruct the whole object so that involves more disks doing seeks.

 I heard sombody from Fujitsu say that he thought 8/3 was best for most
 situations. That wasn't with Ceph though, but with a different system
 which implemented Erasure Coding.

 Performance is definitely lower with more segments in Ceph.  I kind of
 gravitate toward 4/2 or 6/2, though that's just my own preference.
 
 This is indeed the kind of pro's and con's I was thinking about.
 Performance-wise, I would expect differences, but I can think of both
 positive and negative effects of bigger values for K.
 
 For instance, yes recovery takes more OSD's with bigger values of K, but
 it seems to me that there are also less or smaller items to recover.
 Also read-performance generally appears to benefit from having a bigger
 cluster (more parallellism), so I can imagine that bigger values of K
 also provide an increase in read-performance.
 
 Mark says more segments hurts performance though, are you referring just
 to rebuild-performance or also basic operational performance (read/write)?
 
 For instance, if I choose K = 3 and M = 2, then pg's in this pool will
 use 5 OSD's and sustain the loss of 2 OSD's. There is 40% overhead in
 this configuration.

 Now, if I were to choose K = 6 and M = 4, I would end up with pg's that
 use 10 OSD's and sustain the loss of 4 OSD's, which is statistically
 not
 so much different from the first configuration. Also there is the same
 40% overhead.

 Although I don't have numbers in mind, I think the odds of loosing two
 OSD simultaneously are a lot smaller than the odds of loosing four OSD
 simultaneously. Or am I misunderstanding you when you write
 statistically not so much different from the first configuration ?


 Loosing two smaller then loosing four? Is that correct or did you mean
 it the other way around?

 I'd say that loosing four OSDs simultaneously is less likely to happen
 then two simultaneously.

 This is true, though the more disks you spread your objects across, the
 higher likelihood that any given object will be affected by a lost OSD.
  The extreme case being that every object is spread across every OSD and
 losing any given OSD affects all objects.  I suppose the severity
 depends on the relative fraction of your erasure coding parameters
 relative to the total number of OSDs.  I think this is perhaps what Erik
 was getting at.
 
 I haven't done the actual calculations, but given some % chance of disk
 failure, I would assume that losing x out of y disks has roughly the
 same chance as losing 2*x out of 2*y disks over the same period.
 
 That's also why you generally want to limit RAID5 arrays to maybe 6
 disks or so and move to RAID6 for bigger arrays. For arrays bigger than
 20 disks you would usually split those into separate arrays, just to
 keep the (parity disks / total disks) fraction high enough.
 
 With regard to data safety I would guess that 3+2 and 6+4 are roughly
 equal, although the behaviour of 6+4 is probably easier to predict
 because bigger numbers makes your calculations less dependent on
 individual deviations in reliability.
 
 Do you guys feel this argument is valid?

Here is how I reason about it, roughly:

If the probability of loosing a disk is 0.1%, the probability of loosing two 
disks simultaneously (i.e. before the failure can be recovered) would be 
0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks 
becomes 0.0001% 

Accurately calculating the reliability of the system as a whole is a lot more 
complex (see 
https://wiki.ceph.com/Development/Add_erasure_coding_to_the_durability_model/ 
for more information).

Cheers

 
 Erik.
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-15 Thread Erik Logtenberg

 I haven't done the actual calculations, but given some % chance of disk
 failure, I would assume that losing x out of y disks has roughly the
 same chance as losing 2*x out of 2*y disks over the same period.

 That's also why you generally want to limit RAID5 arrays to maybe 6
 disks or so and move to RAID6 for bigger arrays. For arrays bigger than
 20 disks you would usually split those into separate arrays, just to
 keep the (parity disks / total disks) fraction high enough.

 With regard to data safety I would guess that 3+2 and 6+4 are roughly
 equal, although the behaviour of 6+4 is probably easier to predict
 because bigger numbers makes your calculations less dependent on
 individual deviations in reliability.

 Do you guys feel this argument is valid?
 
 Here is how I reason about it, roughly:
 
 If the probability of loosing a disk is 0.1%, the probability of loosing two 
 disks simultaneously (i.e. before the failure can be recovered) would be 
 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks 
 becomes 0.0001% 
 
 Accurately calculating the reliability of the system as a whole is a lot more 
 complex (see 
 https://wiki.ceph.com/Development/Add_erasure_coding_to_the_durability_model/ 
 for more information).
 
 Cheers

Okay, I see that in your calculation, you leave the total amount of
disks completely out of the equation. The link you provided is very
useful indeed and does some actual calculations. Interestingly, the
example in the details page [1] use k=32 and m=32 for a total of 64 blocks.
Those are very much bigger values than Mark Nelson mentioned earlier. Is
that example merely meant to demonstrate the theoretical advantages, or
would you actually recommend using those numbers in practice.
Let's assume that we have at least 64 OSD's available, would you
recommend k=32 and m=32?

[1]
https://wiki.ceph.com/Development/Add_erasure_coding_to_the_durability_model/Technical_details_on_the_model

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practice K/M-parameters EC pool

2014-08-15 Thread Loic Dachary


On 15/08/2014 15:42, Erik Logtenberg wrote:

 I haven't done the actual calculations, but given some % chance of disk
 failure, I would assume that losing x out of y disks has roughly the
 same chance as losing 2*x out of 2*y disks over the same period.

 That's also why you generally want to limit RAID5 arrays to maybe 6
 disks or so and move to RAID6 for bigger arrays. For arrays bigger than
 20 disks you would usually split those into separate arrays, just to
 keep the (parity disks / total disks) fraction high enough.

 With regard to data safety I would guess that 3+2 and 6+4 are roughly
 equal, although the behaviour of 6+4 is probably easier to predict
 because bigger numbers makes your calculations less dependent on
 individual deviations in reliability.

 Do you guys feel this argument is valid?

 Here is how I reason about it, roughly:

 If the probability of loosing a disk is 0.1%, the probability of loosing two 
 disks simultaneously (i.e. before the failure can be recovered) would be 
 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks 
 becomes 0.0001% 

 Accurately calculating the reliability of the system as a whole is a lot 
 more complex (see 
 https://wiki.ceph.com/Development/Add_erasure_coding_to_the_durability_model/
  for more information).

 Cheers
 
 Okay, I see that in your calculation, you leave the total amount of
 disks completely out of the equation. 

Yes. If you have a small number of disks I'm not sure how to calculate the 
durability. For instance if I have 50 disk cluster within a rack, the 
durability is dominated by the probability that the rack is set on fire and 
increasing m from 3 to 5 is most certainly pointless ;-)

 The link you provided is very
 useful indeed and does some actual calculations. Interestingly, the
 example in the details page [1] use k=32 and m=32 for a total of 64 blocks.
 Those are very much bigger values than Mark Nelson mentioned earlier. Is
 that example merely meant to demonstrate the theoretical advantages, or
 would you actually recommend using those numbers in practice.
 Let's assume that we have at least 64 OSD's available, would you
 recommend k=32 and m=32?

It is theoretical, I'm not aware of any Ceph use case requiring that kind of 
setting. There may be a use case though, it's not absurd, just not common. I 
would be happy to hear about it.

Cheers

 
 [1]
 https://wiki.ceph.com/Development/Add_erasure_coding_to_the_durability_model/Technical_details_on_the_model
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com