Re: [ceph-users] how possible is that ceph cluster crash

2016-11-23 Thread Samuel Just
Seems like that would be helpful.  I'm not really familiar with
ceph-disk though.
-Sam

On Wed, Nov 23, 2016 at 5:24 AM, Nick Fisk <n...@fisk.me.uk> wrote:
> Hi Sam,
>
> Would a check in ceph-disk for "nobarrier" in the osd_mount_options_{fstype} 
> variable be a good idea? It good either strip it out or
> fail to start the OSD unless an override flag is specified somewhere.
>
> Looking at ceph-disk code, I would imagine around here would be the right 
> place to put the check
> https://github.com/ceph/ceph/blob/master/src/ceph-disk/ceph_disk/main.py#L2642
>
> I don't mind trying to get this done if its felt to be worthwhile.
>
> Nick
>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
>> Samuel Just
>> Sent: 19 November 2016 00:31
>> To: Nick Fisk <n...@fisk.me.uk>
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] how possible is that ceph cluster crash
>>
>> Many reasons:
>>
>> 1) You will eventually get a DC wide power event anyway at which point 
>> probably most of the OSDs will have hopelessly corrupted
>> internal xfs structures (yes, I have seen this happen to a poor soul with a 
>> DC with redundant power).
>> 2) Even in the case of a single rack/node power failure, the biggest danger 
>> isn't that the OSDs don't start.  It's that they *do
> start*, but
>> forgot or arbitrarily corrupted a random subset of transactions they told 
>> other osds and clients that they committed.  The exact
> impact
>> would be random, but for sure, any guarantees Ceph normally provides would 
>> be out the window.  RBD devices could have random
>> byte ranges zapped back in time (not great if they're the offsets assigned 
>> to your database or fs journal...) for instance.
>> 3) Deliberately powercycling a node counts as a power failure if you don't 
>> stop services and sync etc first.
>>
>> In other words, don't mess with the definition of "committing a transaction" 
>> if you value your data.
>> -Sam "just say no" Just
>>
>> On Fri, Nov 18, 2016 at 4:04 PM, Nick Fisk <n...@fisk.me.uk> wrote:
>> > Yes, because these things happen
>> >
>> > http://www.theregister.co.uk/2016/11/15/memset_power_cut_service_inter
>> > ruption/
>> >
>> > We had customers who had kit in this DC.
>> >
>> > To use your analogy, it's like crossing the road at traffic lights but
>> > not checking cars have stopped. You might be OK 99%of the time, but
>> > sooner or later it will bite you in the arse and it won't be pretty.
>> >
>> > 
>> > From: "Brian ::" <b...@iptel.co>
>> > Sent: 18 Nov 2016 11:52 p.m.
>> > To: sj...@redhat.com
>> > Cc: Craig Chi; ceph-users@lists.ceph.com; Nick Fisk
>> > Subject: Re: [ceph-users] how possible is that ceph cluster crash
>> >
>> >> X-Assp-URIBLcache failed: '1e100.net'(black.uribl.com)
>> >> X-Assp-Spam-Level: *
>> >> X-Assp-Envelope-From: b...@iptel.co
>> >> X-Assp-Intended-For: n...@fisk.me.uk
>> >> X-Assp-ID: ASSP.fisk.me.uk (47951-11296)
>> >> X-Assp-Version: 1.9.1.4(1.0.00)
>> >>
>> >>
>> >> This is like your mother telling not to cross the road when you were
>> >> 4 years of age but not telling you it was because you could be
>> >> flattened by a car :)
>> >>
>> >> Can you expand on your answer? If you are in a DC with AB power,
>> >> redundant UPS, dual feed from the electric company, onsite
>> >> generators, dual PSU servers, is it still a bad idea?
>> >>
>> >>
>> >>
>> >>
>> >> On Fri, Nov 18, 2016 at 6:52 PM, Samuel Just <sj...@redhat.com> wrote:
>> >>>
>> >>> Never *ever* use nobarrier with ceph under *any* circumstances.  I
>> >>> cannot stress this enough.
>> >>> -Sam
>> >>>
>> >>> On Fri, Nov 18, 2016 at 10:39 AM, Craig Chi <craig...@synology.com>
>> >>> wrote:
>> >>>>
>> >>>> Hi Nick and other Cephers,
>> >>>>
>> >>>> Thanks for your reply.
>> >>>>
>> >>>>> 2) Config Errors
>> >>>>> This can be an easy one to say you are safe from. But I would say
>> >>>>> most outages and data loss incidents I have seen on the mailing
>> >>>&

Re: [ceph-users] how possible is that ceph cluster crash

2016-11-23 Thread Nick Fisk
Hi Sam,

Would a check in ceph-disk for "nobarrier" in the osd_mount_options_{fstype} 
variable be a good idea? It good either strip it out or
fail to start the OSD unless an override flag is specified somewhere.

Looking at ceph-disk code, I would imagine around here would be the right place 
to put the check
https://github.com/ceph/ceph/blob/master/src/ceph-disk/ceph_disk/main.py#L2642

I don't mind trying to get this done if its felt to be worthwhile.

Nick

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Samuel Just
> Sent: 19 November 2016 00:31
> To: Nick Fisk <n...@fisk.me.uk>
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] how possible is that ceph cluster crash
> 
> Many reasons:
> 
> 1) You will eventually get a DC wide power event anyway at which point 
> probably most of the OSDs will have hopelessly corrupted
> internal xfs structures (yes, I have seen this happen to a poor soul with a 
> DC with redundant power).
> 2) Even in the case of a single rack/node power failure, the biggest danger 
> isn't that the OSDs don't start.  It's that they *do
start*, but
> forgot or arbitrarily corrupted a random subset of transactions they told 
> other osds and clients that they committed.  The exact
impact
> would be random, but for sure, any guarantees Ceph normally provides would be 
> out the window.  RBD devices could have random
> byte ranges zapped back in time (not great if they're the offsets assigned to 
> your database or fs journal...) for instance.
> 3) Deliberately powercycling a node counts as a power failure if you don't 
> stop services and sync etc first.
> 
> In other words, don't mess with the definition of "committing a transaction" 
> if you value your data.
> -Sam "just say no" Just
> 
> On Fri, Nov 18, 2016 at 4:04 PM, Nick Fisk <n...@fisk.me.uk> wrote:
> > Yes, because these things happen
> >
> > http://www.theregister.co.uk/2016/11/15/memset_power_cut_service_inter
> > ruption/
> >
> > We had customers who had kit in this DC.
> >
> > To use your analogy, it's like crossing the road at traffic lights but
> > not checking cars have stopped. You might be OK 99%of the time, but
> > sooner or later it will bite you in the arse and it won't be pretty.
> >
> > ____________
> > From: "Brian ::" <b...@iptel.co>
> > Sent: 18 Nov 2016 11:52 p.m.
> > To: sj...@redhat.com
> > Cc: Craig Chi; ceph-users@lists.ceph.com; Nick Fisk
> > Subject: Re: [ceph-users] how possible is that ceph cluster crash
> >
> >> X-Assp-URIBLcache failed: '1e100.net'(black.uribl.com)
> >> X-Assp-Spam-Level: *
> >> X-Assp-Envelope-From: b...@iptel.co
> >> X-Assp-Intended-For: n...@fisk.me.uk
> >> X-Assp-ID: ASSP.fisk.me.uk (47951-11296)
> >> X-Assp-Version: 1.9.1.4(1.0.00)
> >>
> >>
> >> This is like your mother telling not to cross the road when you were
> >> 4 years of age but not telling you it was because you could be
> >> flattened by a car :)
> >>
> >> Can you expand on your answer? If you are in a DC with AB power,
> >> redundant UPS, dual feed from the electric company, onsite
> >> generators, dual PSU servers, is it still a bad idea?
> >>
> >>
> >>
> >>
> >> On Fri, Nov 18, 2016 at 6:52 PM, Samuel Just <sj...@redhat.com> wrote:
> >>>
> >>> Never *ever* use nobarrier with ceph under *any* circumstances.  I
> >>> cannot stress this enough.
> >>> -Sam
> >>>
> >>> On Fri, Nov 18, 2016 at 10:39 AM, Craig Chi <craig...@synology.com>
> >>> wrote:
> >>>>
> >>>> Hi Nick and other Cephers,
> >>>>
> >>>> Thanks for your reply.
> >>>>
> >>>>> 2) Config Errors
> >>>>> This can be an easy one to say you are safe from. But I would say
> >>>>> most outages and data loss incidents I have seen on the mailing
> >>>>> lists have been due to poor hardware choice or configuring options
> >>>>> such as size=2, min_size=1 or enabling stuff like nobarriers.
> >>>>
> >>>>
> >>>> I am wondering the pros and cons of the nobarrier option used by Ceph.
> >>>>
> >>>> It is well known that nobarrier is dangerous when power outage
> >>>> happens, but if we already have replicas in different racks or
> >>>> PDUs, will Ceph reduce the risk of data lost with th

Re: [ceph-users] how possible is that ceph cluster crash

2016-11-19 Thread Brian ::
HI Lionel,

Mega Ouch - I've recently seen the act of measuring power consumption
in a data centre (they clamp a probe onto the cable for an AMP reading
seemingly) take out a cabinet which had *redundant* power feeds - so
anything is possible I guess.

Regards
Brian


On Sat, Nov 19, 2016 at 11:20 AM, Lionel Bouton
 wrote:
> Le 19/11/2016 à 00:52, Brian :: a écrit :
>> This is like your mother telling not to cross the road when you were 4
>> years of age but not telling you it was because you could be flattened
>> by a car :)
>>
>> Can you expand on your answer? If you are in a DC with AB power,
>> redundant UPS, dual feed from the electric company, onsite generators,
>> dual PSU servers, is it still a bad idea?
>
> Yes it is.
>
> In such a datacenter where we have a Ceph cluster there was a complete
> shutdown because of a design error : the probes used by the solution
> responsible for starting and stopping the generators were installed
> before the breakers installed on the feeds. After a blackout where
> generators kicked in the breakers opened due to a surge when power was
> restored. The generators were stopped because power was restored, and
> the UPS systems failed 3 minutes later. Closing the breakers couldn't be
> done in time (you don't approach them without being heavily protected,
> putting on the suit to protect you needs more time than simply closing
> the breaker).
>
> There's no such thing as uninterruptible power supply.
>
> Best regards,
>
> Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how possible is that ceph cluster crash

2016-11-19 Thread Lionel Bouton
Le 19/11/2016 à 00:52, Brian :: a écrit :
> This is like your mother telling not to cross the road when you were 4
> years of age but not telling you it was because you could be flattened
> by a car :)
>
> Can you expand on your answer? If you are in a DC with AB power,
> redundant UPS, dual feed from the electric company, onsite generators,
> dual PSU servers, is it still a bad idea?

Yes it is.

In such a datacenter where we have a Ceph cluster there was a complete
shutdown because of a design error : the probes used by the solution
responsible for starting and stopping the generators were installed
before the breakers installed on the feeds. After a blackout where
generators kicked in the breakers opened due to a surge when power was
restored. The generators were stopped because power was restored, and
the UPS systems failed 3 minutes later. Closing the breakers couldn't be
done in time (you don't approach them without being heavily protected,
putting on the suit to protect you needs more time than simply closing
the breaker).

There's no such thing as uninterruptible power supply.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how possible is that ceph cluster crash

2016-11-18 Thread Samuel Just
Many reasons:

1) You will eventually get a DC wide power event anyway at which point
probably most of the OSDs will have hopelessly corrupted internal xfs
structures (yes, I have seen this happen to a poor soul with a DC with
redundant power).
2) Even in the case of a single rack/node power failure, the biggest
danger isn't that the OSDs don't start.  It's that they *do start*,
but forgot or arbitrarily corrupted a random subset of transactions
they told other osds and clients that they committed.  The exact
impact would be random, but for sure, any guarantees Ceph normally
provides would be out the window.  RBD devices could have random byte
ranges zapped back in time (not great if they're the offsets assigned
to your database or fs journal...) for instance.
3) Deliberately powercycling a node counts as a power failure if you
don't stop services and sync etc first.

In other words, don't mess with the definition of "committing a
transaction" if you value your data.
-Sam "just say no" Just

On Fri, Nov 18, 2016 at 4:04 PM, Nick Fisk <n...@fisk.me.uk> wrote:
> Yes, because these things happen
>
> http://www.theregister.co.uk/2016/11/15/memset_power_cut_service_interruption/
>
> We had customers who had kit in this DC.
>
> To use your analogy, it's like crossing the road at traffic lights but not
> checking cars have stopped. You might be OK 99%of the time, but sooner or
> later it will bite you in the arse and it won't be pretty.
>
> 
> From: "Brian ::" <b...@iptel.co>
> Sent: 18 Nov 2016 11:52 p.m.
> To: sj...@redhat.com
> Cc: Craig Chi; ceph-users@lists.ceph.com; Nick Fisk
> Subject: Re: [ceph-users] how possible is that ceph cluster crash
>
>> X-Assp-URIBLcache failed: '1e100.net'(black.uribl.com)
>> X-Assp-Spam-Level: *
>> X-Assp-Envelope-From: b...@iptel.co
>> X-Assp-Intended-For: n...@fisk.me.uk
>> X-Assp-ID: ASSP.fisk.me.uk (47951-11296)
>> X-Assp-Version: 1.9.1.4(1.0.00)
>>
>>
>> This is like your mother telling not to cross the road when you were 4
>> years of age but not telling you it was because you could be flattened
>> by a car :)
>>
>> Can you expand on your answer? If you are in a DC with AB power,
>> redundant UPS, dual feed from the electric company, onsite generators,
>> dual PSU servers, is it still a bad idea?
>>
>>
>>
>>
>> On Fri, Nov 18, 2016 at 6:52 PM, Samuel Just <sj...@redhat.com> wrote:
>>>
>>> Never *ever* use nobarrier with ceph under *any* circumstances.  I
>>> cannot stress this enough.
>>> -Sam
>>>
>>> On Fri, Nov 18, 2016 at 10:39 AM, Craig Chi <craig...@synology.com>
>>> wrote:
>>>>
>>>> Hi Nick and other Cephers,
>>>>
>>>> Thanks for your reply.
>>>>
>>>>> 2) Config Errors
>>>>> This can be an easy one to say you are safe from. But I would say most
>>>>> outages and data loss incidents I have seen on the mailing
>>>>> lists have been due to poor hardware choice or configuring options such
>>>>> as
>>>>> size=2, min_size=1 or enabling stuff like nobarriers.
>>>>
>>>>
>>>> I am wondering the pros and cons of the nobarrier option used by Ceph.
>>>>
>>>> It is well known that nobarrier is dangerous when power outage happens,
>>>> but
>>>> if we already have replicas in different racks or PDUs, will Ceph reduce
>>>> the
>>>> risk of data lost with this option?
>>>>
>>>> I have seen many performance tuning articles providing nobarrier option
>>>> in
>>>> xfs, but there are not many of then mention the trade-off of nobarrier.
>>>>
>>>> Is it really unacceptable to use nobarrier in production environment? I
>>>> will
>>>> be much grateful if you guys are willing to share any experiences about
>>>> nobarrier and xfs.
>>>>
>>>> Sincerely,
>>>> Craig Chi (Product Developer)
>>>> Synology Inc. Taipei, Taiwan. Ext. 361
>>>>
>>>> On 2016-11-17 05:04, Nick Fisk <n...@fisk.me.uk> wrote:
>>>>
>>>>> -Original Message-
>>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>>>>> Of
>>>>> Pedro Benites
>>>>> Sent: 16 November 2016 17:51
>>>>> To: ceph-users@lists.ceph.com
>>>>> Subject: [ceph-users] how possible is that ceph cluster crash
>>>>>
>>>>> Hi,
&

Re: [ceph-users] how possible is that ceph cluster crash

2016-11-18 Thread Nick Fisk

Yes, because these things happen

http://www.theregister.co.uk/2016/11/15/memset_power_cut_service_interruption/

We had customers who had kit in this DC.

To use your analogy, it's like crossing the road at traffic lights but  
not checking cars have stopped. You might be OK 99%of the time, but  
sooner or later it will bite you in the arse and it won't be pretty.



From: "Brian ::" <b...@iptel.co>
Sent: 18 Nov 2016 11:52 p.m.
To: sj...@redhat.com
Cc: Craig Chi; ceph-users@lists.ceph.com; Nick Fisk
Subject: Re: [ceph-users] how possible is that ceph cluster crash


X-Assp-URIBLcache failed: '1e100.net'(black.uribl.com)
X-Assp-Spam-Level: *
X-Assp-Envelope-From: b...@iptel.co
X-Assp-Intended-For: n...@fisk.me.uk
X-Assp-ID: ASSP.fisk.me.uk (47951-11296)
X-Assp-Version: 1.9.1.4(1.0.00)

This is like your mother telling not to cross the road when you were 4
years of age but not telling you it was because you could be flattened
by a car :)

Can you expand on your answer? If you are in a DC with AB power,
redundant UPS, dual feed from the electric company, onsite generators,
dual PSU servers, is it still a bad idea?




On Fri, Nov 18, 2016 at 6:52 PM, Samuel Just <sj...@redhat.com> wrote:

Never *ever* use nobarrier with ceph under *any* circumstances.  I
cannot stress this enough.
-Sam

On Fri, Nov 18, 2016 at 10:39 AM, Craig Chi <craig...@synology.com> wrote:

Hi Nick and other Cephers,

Thanks for your reply.


2) Config Errors
This can be an easy one to say you are safe from. But I would say most
outages and data loss incidents I have seen on the mailing
lists have been due to poor hardware choice or configuring options such as
size=2, min_size=1 or enabling stuff like nobarriers.


I am wondering the pros and cons of the nobarrier option used by Ceph.

It is well known that nobarrier is dangerous when power outage happens, but
if we already have replicas in different racks or PDUs, will Ceph  
reduce the

risk of data lost with this option?

I have seen many performance tuning articles providing nobarrier option in
xfs, but there are not many of then mention the trade-off of nobarrier.

Is it really unacceptable to use nobarrier in production  
environment? I will

be much grateful if you guys are willing to share any experiences about
nobarrier and xfs.

Sincerely,
Craig Chi (Product Developer)
Synology Inc. Taipei, Taiwan. Ext. 361

On 2016-11-17 05:04, Nick Fisk <n...@fisk.me.uk> wrote:


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Pedro Benites
Sent: 16 November 2016 17:51
To: ceph-users@lists.ceph.com
Subject: [ceph-users] how possible is that ceph cluster crash

Hi,

I have a ceph cluster with 50 TB, with 15 osds, it is working fine for one
year and I would like to grow it and migrate all my old

storage,

about 100 TB to ceph, but I have a doubt. How possible is that the cluster
fail and everything went very bad?


Everything is possible, I think there are 3 main risks

1) Hardware failure
I would say Ceph is probably one of the safest options in regards to
hardware failures, certainly if you start using 4TB+ disks.

2) Config Errors
This can be an easy one to say you are safe from. But I would say most
outages and data loss incidents I have seen on the mailing
lists have been due to poor hardware choice or configuring options such as
size=2, min_size=1 or enabling stuff like nobarriers.

3) Ceph Bugs
Probably the rarest, but potentially the most scary as you have less
control. They do happen and it's something to be aware of

How reliable is ceph?

What is the risk about lose my data.? is necessary backup my data?


Yes, always backup your data, no matter solution you use. Just like RAID !=
Backup, neither does ceph.



Regards.
Pedro.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




Sent from Synology MailPlus
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how possible is that ceph cluster crash

2016-11-18 Thread Brian ::
This is like your mother telling not to cross the road when you were 4
years of age but not telling you it was because you could be flattened
by a car :)

Can you expand on your answer? If you are in a DC with AB power,
redundant UPS, dual feed from the electric company, onsite generators,
dual PSU servers, is it still a bad idea?




On Fri, Nov 18, 2016 at 6:52 PM, Samuel Just <sj...@redhat.com> wrote:
> Never *ever* use nobarrier with ceph under *any* circumstances.  I
> cannot stress this enough.
> -Sam
>
> On Fri, Nov 18, 2016 at 10:39 AM, Craig Chi <craig...@synology.com> wrote:
>> Hi Nick and other Cephers,
>>
>> Thanks for your reply.
>>
>>> 2) Config Errors
>>> This can be an easy one to say you are safe from. But I would say most
>>> outages and data loss incidents I have seen on the mailing
>>> lists have been due to poor hardware choice or configuring options such as
>>> size=2, min_size=1 or enabling stuff like nobarriers.
>>
>> I am wondering the pros and cons of the nobarrier option used by Ceph.
>>
>> It is well known that nobarrier is dangerous when power outage happens, but
>> if we already have replicas in different racks or PDUs, will Ceph reduce the
>> risk of data lost with this option?
>>
>> I have seen many performance tuning articles providing nobarrier option in
>> xfs, but there are not many of then mention the trade-off of nobarrier.
>>
>> Is it really unacceptable to use nobarrier in production environment? I will
>> be much grateful if you guys are willing to share any experiences about
>> nobarrier and xfs.
>>
>> Sincerely,
>> Craig Chi (Product Developer)
>> Synology Inc. Taipei, Taiwan. Ext. 361
>>
>> On 2016-11-17 05:04, Nick Fisk <n...@fisk.me.uk> wrote:
>>
>>> -Original Message-----
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>>> Pedro Benites
>>> Sent: 16 November 2016 17:51
>>> To: ceph-users@lists.ceph.com
>>> Subject: [ceph-users] how possible is that ceph cluster crash
>>>
>>> Hi,
>>>
>>> I have a ceph cluster with 50 TB, with 15 osds, it is working fine for one
>>> year and I would like to grow it and migrate all my old
>> storage,
>>> about 100 TB to ceph, but I have a doubt. How possible is that the cluster
>>> fail and everything went very bad?
>>
>> Everything is possible, I think there are 3 main risks
>>
>> 1) Hardware failure
>> I would say Ceph is probably one of the safest options in regards to
>> hardware failures, certainly if you start using 4TB+ disks.
>>
>> 2) Config Errors
>> This can be an easy one to say you are safe from. But I would say most
>> outages and data loss incidents I have seen on the mailing
>> lists have been due to poor hardware choice or configuring options such as
>> size=2, min_size=1 or enabling stuff like nobarriers.
>>
>> 3) Ceph Bugs
>> Probably the rarest, but potentially the most scary as you have less
>> control. They do happen and it's something to be aware of
>>
>> How reliable is ceph?
>>> What is the risk about lose my data.? is necessary backup my data?
>>
>> Yes, always backup your data, no matter solution you use. Just like RAID !=
>> Backup, neither does ceph.
>>
>>>
>>> Regards.
>>> Pedro.
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
>> Sent from Synology MailPlus
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how possible is that ceph cluster crash

2016-11-18 Thread Craig Chi
Hi Nick and other Cephers,

Thanks for your reply.
>2) Config Errors>This can be an easy one to say you are safe from. But I would 
>say most outages and data loss incidents I have seen on the mailing>lists have 
>been due to poor hardware choice or configuring options such as size=2, 
>min_size=1 or enabling stuff like nobarriers.

I am wondering the pros and cons of the nobarrier option used by Ceph.

It is well known that nobarrier is dangerous when power outage happens, but if 
we already have replicas in different racks or PDUs, will Ceph reduce the risk 
of data lost with this option?

I have seen many performance tuning articles providing nobarrier option in xfs, 
but there are not many of then mention the trade-off of nobarrier.

Is it really unacceptable to use nobarrier in production environment? I will be 
much grateful if you guys are willing to share any experiences about nobarrier 
and xfs.

Sincerely,
Craig Chi (Product Developer)
Synology Inc. Taipei, Taiwan. Ext. 361

On 2016-11-17 05:04, Nick Fisk<n...@fisk.me.uk>wrote:
> >-Original Message->From: ceph-users 
> >[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Pedro Benites>Sent: 
> >16 November 2016 17:51>To: ceph-users@lists.ceph.com>Subject: [ceph-users] 
> >how possible is that ceph cluster crash>>Hi,>>I have a ceph cluster with 50 
> >TB, with 15 osds, it is working fine for one year and I would like to grow 
> >it and migrate all my old storage,>about 100 TB to ceph, but I have a doubt. 
> >How possible is that the cluster fail and everything went very bad? 
> >Everything is possible, I think there are 3 main risks 1) Hardware failure I 
> >would say Ceph is probably one of the safest options in regards to hardware 
> >failures, certainly if you start using 4TB+ disks. 2) Config Errors This can 
> >be an easy one to say you are safe from. But I would say most outages and 
> >data loss incidents I have seen on the mailing lists have been due to poor 
> >hardware choice or configuring options such as size=2, min_size=1 or 
> >enabling stuff like nobarriers. 3) Ceph Bugs Pro
 bably th

e rarest, but potentially the most scary as you have less control. They do 
happen and it's something to be aware of How reliable is ceph?>What is the risk 
about lose my data.? is necessary backup my data? Yes, always backup your data, 
no matter solution you use. Just like RAID != Backup, neither does 
ceph.>>Regards.>Pedro.>___>ceph-users
 mailing 
list>ceph-users@lists.ceph.com>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___ ceph-users mailing list 
ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how possible is that ceph cluster crash

2016-11-16 Thread Goncalo Borges
Olá Pedro...

These are extremely generic questions, and therefore, hard to answer.  Nick did 
a good job in defining the risks.

In our case, we are running a Ceph/CephFS system in production for over an 
year, and before that, we tried to understand Ceph for a year also.

Ceph is incredibility good is dealing with hardware failures so it is a 
powerfull tool if you are using commodity hardware. If your disks fail or even 
if a fraction of your hosts fail, it is able to cope and recover properly 
(until a given extent) if you have the proper crush rules in place (the default 
ones do a good job on that) and free space available. To be on the safe side:
- decouple mons from osds servers
- check the RAM requirement for your osds servers (depend in the number of osds 
in each server)
- have, at least, 3 mons in a production system
- use a 3x replica 
There is a good info page on hardware requirements in the ceph wikis.

However, the devil is on the details. Ceph is a complex system still in 
permanent development. Wrong configurations might lead to performance problems. 
If your network is not reliable, that might lead to flapping osds, which on its 
turn, might lead to problems in your pgs. When your osds starts to become full 
(a single full osd freezes all I/O to the cluster) many problems may start to 
appear. Finally there are bugs. Their number is not huge and there is a real 
good effort form the developers and from the community to address those in a 
fast and reliable way. However, sometimes it is difficult to diagnose what 
could be wrong because of the so many layers involved. It is not infrequent 
that we have to go and look to the source code to figure out (when possible) 
what may be happening. So, I would say that there is a learning curve that 
myself and others are still going through.

Abraço
Gonçalo






From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Pedro Benites 
[pbeni...@litholaser.com]
Sent: 17 November 2016 04:50
To: ceph-users@lists.ceph.com
Subject: [ceph-users] how possible is that ceph cluster crash

Hi,

I have a ceph cluster with 50 TB, with 15 osds, it is working fine for
one year and I would like to grow it and migrate all my old storage,
about 100 TB to ceph, but I have a doubt. How possible is that the
cluster fail and everything went very bad? How reliable is ceph? What is
the risk about lose my data.? is necessary backup my data?

Regards.
Pedro.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how possible is that ceph cluster crash

2016-11-16 Thread Nick Fisk


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Pedro Benites
> Sent: 16 November 2016 17:51
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] how possible is that ceph cluster crash
> 
> Hi,
> 
> I have a ceph cluster with 50 TB, with 15 osds, it is working fine for one 
> year and I would like to grow it and migrate all my old
storage,
> about 100 TB to ceph, but I have a doubt. How possible is that the cluster 
> fail and everything went very bad? 

Everything is possible, I think there are 3 main risks

1) Hardware failure
I would say Ceph is probably one of the safest options in regards to hardware 
failures, certainly if you start using 4TB+ disks.

2) Config Errors
This can be an easy one to say you are safe from. But I would say most outages 
and data loss incidents I have seen on the mailing
lists have been due to poor hardware choice or configuring options such as 
size=2, min_size=1 or enabling stuff like nobarriers.

3) Ceph Bugs
Probably the rarest, but potentially the most scary as you have less control. 
They do happen and it's something to be aware of

How reliable is ceph?
> What is the risk about lose my data.? is necessary backup my data?

Yes, always backup your data, no matter solution you use. Just like RAID != 
Backup, neither does ceph.

> 
> Regards.
> Pedro.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] how possible is that ceph cluster crash

2016-11-16 Thread Pedro Benites

Hi,

I have a ceph cluster with 50 TB, with 15 osds, it is working fine for 
one year and I would like to grow it and migrate all my old storage, 
about 100 TB to ceph, but I have a doubt. How possible is that the 
cluster fail and everything went very bad? How reliable is ceph? What is 
the risk about lose my data.? is necessary backup my data?


Regards.
Pedro.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com