Re: [ceph-users] Ceph replication factor of 2

2018-05-25 Thread Paul Emmerich
If you are so worried about the storage efficiency: why not use erasure
coding?
EC performs really well with Luminous in our experience.
Yes, you generate more IOPS and somewhat more CPU load and a higher latency.
But it's often worth a try.

Simple example for everyone considering 2/1 replicas: consider 2/2 erasure
coding.

* Data durability and availability of 3/2 replicas
* Storage efficiency of 2/1 replicas
* 33% more write IOPS than 3/2 replicas
* 100% more read IOPS than any replica setup (400% more to reduce latency
with fast_read)

Of course, 2/2 erasure coding might seem stupid. We typically use 4/2, 5/2,
or 5/3.

So If you are worried about reducing storage overhead: try it out and see
for yourself how it performs
for your use case.

I've rescued several clusters that were configured with 2/1 replica and
broke down in various ways... it's not pretty
and can be annoying and time-consuming to fix. As in tracking down a broken
disk where the OSD doesn't start
up properly and trying get the last copy of a PG off it with
ceph-objectstore-tool...



Paul


2018-05-25 9:48 GMT+02:00 Janne Johansson :

>
>
> Den fre 25 maj 2018 kl 00:20 skrev Jack :
>
>> On 05/24/2018 11:40 PM, Stefan Kooman wrote:
>> >> What are your thoughts, would you run 2x replication factor in
>> >> Production and in what scenarios?
>> Me neither, mostly because I have yet to read a technical point of view,
>> from someone who read and understand the code
>>
>> I do not buy Janne's "trust me, I am an engineer", whom btw confirmed
>> that the "replica 3" stuff is subject to probability and function to the
>> cluster size, thus is not a generic "always-true" rule
>>
>
> I did not call for trust on _my_ experience or value, but on the ones
> posting the
> first "everyone should probably use 3 replicas" over which you showed
> doubt.
> I agree with them, but did not intend to claim that my post had extra
> value because
> it was written by me.
>
> Also, the last part of my post was very much intended to add "not
> everything in 3x is true for everyone",
> but if you value your data, it would be very prudent to listen to
> experienced people who took risks and lost data before.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph replication factor of 2

2018-05-25 Thread Donny Davis
Nobody cares about their data until they don't have it anymore. Using
replica 3 is the same logic as RAID6. Its likely if one drive has crapped
out, more will meet the maker soon. If you care about your data, then do
what you can to keep it around. If its a lab like mine, who cares its
all ephemeral to me. The decision is about your use case and workload.

If it was my production data, i would spend the money.

On Fri, May 25, 2018 at 3:48 AM, Janne Johansson 
wrote:

>
>
> Den fre 25 maj 2018 kl 00:20 skrev Jack :
>
>> On 05/24/2018 11:40 PM, Stefan Kooman wrote:
>> >> What are your thoughts, would you run 2x replication factor in
>> >> Production and in what scenarios?
>> Me neither, mostly because I have yet to read a technical point of view,
>> from someone who read and understand the code
>>
>> I do not buy Janne's "trust me, I am an engineer", whom btw confirmed
>> that the "replica 3" stuff is subject to probability and function to the
>> cluster size, thus is not a generic "always-true" rule
>>
>
> I did not call for trust on _my_ experience or value, but on the ones
> posting the
> first "everyone should probably use 3 replicas" over which you showed
> doubt.
> I agree with them, but did not intend to claim that my post had extra
> value because
> it was written by me.
>
> Also, the last part of my post was very much intended to add "not
> everything in 3x is true for everyone",
> but if you value your data, it would be very prudent to listen to
> experienced people who took risks and lost data before.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph replication factor of 2

2018-05-25 Thread Janne Johansson
Den fre 25 maj 2018 kl 00:20 skrev Jack :

> On 05/24/2018 11:40 PM, Stefan Kooman wrote:
> >> What are your thoughts, would you run 2x replication factor in
> >> Production and in what scenarios?
> Me neither, mostly because I have yet to read a technical point of view,
> from someone who read and understand the code
>
> I do not buy Janne's "trust me, I am an engineer", whom btw confirmed
> that the "replica 3" stuff is subject to probability and function to the
> cluster size, thus is not a generic "always-true" rule
>

I did not call for trust on _my_ experience or value, but on the ones
posting the
first "everyone should probably use 3 replicas" over which you showed doubt.
I agree with them, but did not intend to claim that my post had extra value
because
it was written by me.

Also, the last part of my post was very much intended to add "not
everything in 3x is true for everyone",
but if you value your data, it would be very prudent to listen to
experienced people who took risks and lost data before.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph replication factor of 2

2018-05-24 Thread Jack
On 05/24/2018 11:40 PM, Stefan Kooman wrote:
>> What are your thoughts, would you run 2x replication factor in
>> Production and in what scenarios?
Me neither, mostly because I have yet to read a technical point of view,
from someone who read and understand the code

I do not buy Janne's "trust me, I am an engineer", whom btw confirmed
that the "replica 3" stuff is subject to probability and function to the
cluster size, thus is not a generic "always-true" rule



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph replication factor of 2

2018-05-24 Thread Stefan Kooman
Quoting Anthony Verevkin (anth...@verevkin.ca):
> My thoughts on the subject are that even though checksums do allow to
> find which replica is corrupt without having to figure which 2 out of
> 3 copies are the same, this is not the only reason min_size=2 was
> required. Even if you are running all SSD which are more reliable than
> HDD and are keeping the disk size small so you could backfill quickly
> in case of a single disk failure, you would still occasionally have
> longer periods of degraded operation. To name a couple - a full node
> going down; or operator deliberately wiping an OSD to rebuild it.
> min_size=1 in this case would leave you running with no redundancy at
> all. DR scenario with pool-to-pool mirroring probably means that you
> can not just replace the lost or incomplete PGs in your main site from
> your DR, cause DR is likely to have a different PG layout, so full
> resync from DR would be required in case of one disk lost during such
> unprotected times.

... "min_size=1 in this case would leave you running with no redundancy
at all.". Exactly. And that would be the reason not to do it. DR is
asynchronous. What if the PG that gets lost has ACK'ed a WRITE but has
not been synchronised? Doing a "full resync" would bring you back in
time.

The DR site is not for free either, so I doubt that you actually really
win a lot here. I would opt for three datacenters: size=3, min_size=2

> 
> What are your thoughts, would you run 2x replication factor in
> Production and in what scenarios?

Not for me.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph replication factor of 2

2018-05-24 Thread Alexandre DERUMIER
Hi,

>>My thoughts on the subject are that even though checksums do allow to find 
>>which replica is corrupt without having to figure which 2 out of 3 copies are 
>>the same, this is not the only reason min_size=2 was required.

AFAIK, 

compare copies (like 2 out of 3 copies are the same) has never been implemented.
pg_repair for example, still copy the primary pg to replicas. (even if it's 
corrupt).


and old topic about this:
http://ceph-users.ceph.narkive.com/zS2yZ2FL/how-safe-is-ceph-pg-repair-these-days

- Mail original -
De: "Janne Johansson" <icepic...@gmail.com>
À: c...@jack.fr.eu.org
Cc: "ceph-users" <ceph-users@lists.ceph.com>
Envoyé: Jeudi 24 Mai 2018 08:33:32
Objet: Re: [ceph-users] Ceph replication factor of 2

Den tors 24 maj 2018 kl 00:20 skrev Jack < [ mailto:c...@jack.fr.eu.org | 
c...@jack.fr.eu.org ] >: 


Hi, 

I have to say, this is a common yet worthless argument 
If I have 3000 OSD, using 2 or 3 replica will not change much : the 
probability of losing 2 devices is still "high" 
On the other hand, if I have a small cluster, less than a hundred OSD, 
that same probability become "low" 



It's about losing the 2 or 3 OSDs that any particular PG is on that matters, 
not if there are 1000 other OSDs in the next rack. 
Losing data is rather binary, its not a from 0.0 -> 1.0 scale. Either a piece 
of data is lost because its storage units are not there 
or its not. Murphys law will make it so that this lost piece of data is rather 
important to you. And Murphy will of course pick the 
2-3 OSDs that are the worst case for you. 

BQ_BEGIN

I do not buy the "if someone is making a maintenance and a device fails" 
either : this is a no-limit goal: what is X servers burns at the same 
time ? What if an admin make a mistake and drop 5 OSD ? What is some 
network tor or routers blow away ? 
Should we do one replica par OSD ? 


BQ_END

From my viewpoint, maintenance must happen. Unplanned maintenance will happen 
even if I wish it not to. 
So the 2-vs-3 is about what situation you end up in when one replica is under 
(planned or not) maintenance. 
Is this a "any surprise makes me lose data now" mode, or is it "many surprises 
need to occur"? 

BQ_BEGIN

I would like people, especially the Ceph's devs and other people who 
knows how it works deeply (read the code!) to give us their advices 

BQ_END

How about listening to people who have lost data during 20+ year long careers 
in storage? 
They will know a lot more on how the very improbable or "impossible" still 
happened to them 
at the most unfortunate moment, regardless of what the code readers say. 

This is all about weighing risks. If the risk for you is "ok, then I have to 
redownload that lost ubuntu-ISO again" its fine 
to stick to data in only one place. 

If the company goes out of business or at least faces 2 days total stop while 
some sleep-deprived admin tries the 
bare metal restores for the first time of her life then the price of SATA disks 
to cover 3 replicas will be literally 
nothing compared to that. 

To me it sounds like you are chasing some kind of validation of an answer you 
already have while asking the questions, 
so if you want to go 2-replicas, then just do it. But you don't get to complain 
to ceph or ceph-users when you also figure 
that the Mean-Time-Between-Failure ratings on the stickers of the disks is 
bogus and what you really needed was 
"mean time between surprises", and thats always less than MTBF. 

-- 
May the most significant bit of your life be positive. 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph replication factor of 2

2018-05-24 Thread Daniel Baumann
Hi,

I coudn't agree more, but just to re-emphasize what others already said:

  the point of replica 3 is not to have extra safety for
  (human|software|server) failures, but to have enough data around to
  allow rebalancing the cluster when disks fail.

after a certain amount of disks in a cluster, you're going to get disks
failures all the time. if you don't pay extra attention (and wasting
lots and lots of time/money) to carefully arrange/choose disks of
different vendor productions lines/dates, simultaneous disk failures
happen within minutes.


example from our past:

on our (at that time small) cluster of 72 disks spread over 6 storage
nodes, half of them were seagate enterprice capacity disks, the other
half western digitial red pro. for each disk manufacturer, we bought
only half of the disks from the same production. so.. we had..

  * 18 disks wd, production charge A
  * 18 disks wd, production charge B
  * 18 disks seagate, production charge C
  * 18 disks seagate, production charge D

one day, 6 disks failed simultaneously spread over two storage nodes.
had we had replica 2, we couldn't recover and would have lost data.
instead, because of replica 3, we didn't loose any data and ceph
automatically rebalanced all data before further disks were failing.


so: if re-creating data stored on the cluster is valuable (because it
costs much time and effort to 're-collect' it, or you can't accept the
time it takes to restore from backup, or worse to re-create it from
scratch), you have to assume that whatever manufacturer/production
charge of HDs you're using, they *can* fail all at the same time because
you could have hit a faulty production.

the only way out here is replica >=3.

(of course, the whole MTBF and why raid doesn't scale applies as well)

Regards,
Daniel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph replication factor of 2

2018-05-24 Thread Janne Johansson
Den tors 24 maj 2018 kl 00:20 skrev Jack :

> Hi,
>
> I have to say, this is a common yet worthless argument
> If I have 3000 OSD, using 2 or 3 replica will not change much : the
> probability of losing 2 devices is still "high"
> On the other hand, if I have a small cluster, less than a hundred OSD,
> that same probability become "low"
>

It's about losing the 2 or 3 OSDs that any particular PG is on that
matters, not if there are 1000 other OSDs in the next rack.
Losing data is rather binary, its not a from 0.0 -> 1.0 scale. Either a
piece of data is lost because its storage units are not there
or its not. Murphys law will make it so that this lost piece of data is
rather important to you. And Murphy will of course pick the
2-3 OSDs that are the worst case for you.


>
> I do not buy the "if someone is making a maintenance and a device fails"
> either : this is a no-limit goal: what is X servers burns at the same
> time ? What if an admin make a mistake and drop 5 OSD ? What is some
> network tor or routers blow away ?
> Should we do one replica par OSD ?
>
>
>From my viewpoint, maintenance must happen. Unplanned maintenance will
happen even if I wish it not to.
So the 2-vs-3 is about what situation you end up in when one replica is
under (planned or not) maintenance.
Is this a "any surprise makes me lose data now" mode, or is it "many
surprises need to occur"?


>
> I would like people, especially the Ceph's devs and other people who
> knows how it works deeply (read the code!) to give us their advices
>

How about listening to people who have lost data during 20+ year long
careers in storage?
They will know a lot more on how the very improbable or "impossible" still
happened to them
at the most unfortunate moment, regardless of what the code readers say.

This is all about weighing risks. If the risk for you is "ok, then I have
to redownload that lost ubuntu-ISO again" its fine
to stick to data in only one place.

If the company goes out of business or at least faces 2 days total stop
while some sleep-deprived admin tries the
bare metal restores for the first time of her life then the price of SATA
disks to cover 3 replicas will be literally
nothing compared to that.

To me it sounds like you are chasing some kind of validation of an answer
you already have while asking the questions,
so if you want to go 2-replicas, then just do it. But you don't get to
complain to ceph or ceph-users when you also figure
that the Mean-Time-Between-Failure ratings on the stickers of the disks is
bogus and what you really needed was
"mean time between surprises", and thats always less than MTBF.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph replication factor of 2

2018-05-23 Thread Jack
Hi,

About Bluestore, sure there are checksum, but are they fully used ?
Rumors said that on a replicated pool, during recovery, they are not


> My thoughts on the subject are that even though checksums do allow to find 
> which replica is corrupt without having to figure which 2 out of 3 copies are 
> the same, this is not the only reason min_size=2 was required. Even if you 
> are running all SSD which are more reliable than HDD and are keeping the disk 
> size small so you could backfill quickly in case of a single disk failure, 
> you would still occasionally have longer periods of degraded operation. To 
> name a couple - a full node going down; or operator deliberately wiping an 
> OSD to rebuild it. min_size=1 in this case would leave you running with no 
> redundancy at all. DR scenario with pool-to-pool mirroring probably means 
> that you can not just replace the lost or incomplete PGs in your main site 
> from your DR, cause DR is likely to have a different PG layout, so full 
> resync from DR would be required in case of one disk lost during such 
> unprotected times.

I have to say, this is a common yet worthless argument
If I have 3000 OSD, using 2 or 3 replica will not change much : the
probability of losing 2 devices is still "high"

On the other hand, if I have a small cluster, less than a hundred OSD,
that same probability become "low"

I do not buy the "if someone is making a maintenance and a device fails"
either : this is a no-limit goal: what is X servers burns at the same
time ? What if an admin make a mistake and drop 5 OSD ? What is some
network tor or routers blow away ?
Should we do one replica par OSD ?


Thus, I would like to emphasis the technical sanity of using 2 replica,
versus the organisational sanity of doing so

Organisational stuff if specific to everybody, technical is shared by
all clusters

I would like people, especially the Ceph's devs and other people who
knows how it works deeply (read the code!) to give us their advices

Regards,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph replication factor of 2

2018-05-23 Thread Anthony Verevkin
This week at the OpenStackSummit Vancouver I can hear people entertaining the 
idea of running Ceph with replication factor of 2.

Karl Vietmeier of Intel suggested that we use 2x replication because Bluestore 
comes with checksums.
https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/21370/supporting-highly-transactional-and-low-latency-workloads-on-ceph

Later, there was a question from the audience during the Ceph DR/mirroring talk 
on whether we could use 2x replication if we also mirror to DR.
https://www.openstack.org/summit/vancouver-2018/summit-schedule/events/20749/how-to-survive-an-openstack-cloud-meltdown-with-ceph

So the interest is definitely there: not losing 1/3 of your disk space and 
performance is promising. But on the other hand it comes with higher risks.

I wonder if we as the community could come up to some consensus, now that the 
established practice of requiring size=3, min_size=2 is being challenged.


My thoughts on the subject are that even though checksums do allow to find 
which replica is corrupt without having to figure which 2 out of 3 copies are 
the same, this is not the only reason min_size=2 was required. Even if you are 
running all SSD which are more reliable than HDD and are keeping the disk size 
small so you could backfill quickly in case of a single disk failure, you would 
still occasionally have longer periods of degraded operation. To name a couple 
- a full node going down; or operator deliberately wiping an OSD to rebuild it. 
min_size=1 in this case would leave you running with no redundancy at all. DR 
scenario with pool-to-pool mirroring probably means that you can not just 
replace the lost or incomplete PGs in your main site from your DR, cause DR is 
likely to have a different PG layout, so full resync from DR would be required 
in case of one disk lost during such unprotected times.

What are your thoughts, would you run 2x replication factor in Production and 
in what scenarios?

Regards,
Anthony
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com