Re: [ceph-users] how possible is that ceph cluster crash
Seems like that would be helpful. I'm not really familiar with ceph-disk though. -Sam On Wed, Nov 23, 2016 at 5:24 AM, Nick Fisk <n...@fisk.me.uk> wrote: > Hi Sam, > > Would a check in ceph-disk for "nobarrier" in the osd_mount_options_{fstype} > variable be a good idea? It good either strip it out or > fail to start the OSD unless an override flag is specified somewhere. > > Looking at ceph-disk code, I would imagine around here would be the right > place to put the check > https://github.com/ceph/ceph/blob/master/src/ceph-disk/ceph_disk/main.py#L2642 > > I don't mind trying to get this done if its felt to be worthwhile. > > Nick > >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Samuel Just >> Sent: 19 November 2016 00:31 >> To: Nick Fisk <n...@fisk.me.uk> >> Cc: ceph-users@lists.ceph.com >> Subject: Re: [ceph-users] how possible is that ceph cluster crash >> >> Many reasons: >> >> 1) You will eventually get a DC wide power event anyway at which point >> probably most of the OSDs will have hopelessly corrupted >> internal xfs structures (yes, I have seen this happen to a poor soul with a >> DC with redundant power). >> 2) Even in the case of a single rack/node power failure, the biggest danger >> isn't that the OSDs don't start. It's that they *do > start*, but >> forgot or arbitrarily corrupted a random subset of transactions they told >> other osds and clients that they committed. The exact > impact >> would be random, but for sure, any guarantees Ceph normally provides would >> be out the window. RBD devices could have random >> byte ranges zapped back in time (not great if they're the offsets assigned >> to your database or fs journal...) for instance. >> 3) Deliberately powercycling a node counts as a power failure if you don't >> stop services and sync etc first. >> >> In other words, don't mess with the definition of "committing a transaction" >> if you value your data. >> -Sam "just say no" Just >> >> On Fri, Nov 18, 2016 at 4:04 PM, Nick Fisk <n...@fisk.me.uk> wrote: >> > Yes, because these things happen >> > >> > http://www.theregister.co.uk/2016/11/15/memset_power_cut_service_inter >> > ruption/ >> > >> > We had customers who had kit in this DC. >> > >> > To use your analogy, it's like crossing the road at traffic lights but >> > not checking cars have stopped. You might be OK 99%of the time, but >> > sooner or later it will bite you in the arse and it won't be pretty. >> > >> > >> > From: "Brian ::" <b...@iptel.co> >> > Sent: 18 Nov 2016 11:52 p.m. >> > To: sj...@redhat.com >> > Cc: Craig Chi; ceph-users@lists.ceph.com; Nick Fisk >> > Subject: Re: [ceph-users] how possible is that ceph cluster crash >> > >> >> X-Assp-URIBLcache failed: '1e100.net'(black.uribl.com) >> >> X-Assp-Spam-Level: * >> >> X-Assp-Envelope-From: b...@iptel.co >> >> X-Assp-Intended-For: n...@fisk.me.uk >> >> X-Assp-ID: ASSP.fisk.me.uk (47951-11296) >> >> X-Assp-Version: 1.9.1.4(1.0.00) >> >> >> >> >> >> This is like your mother telling not to cross the road when you were >> >> 4 years of age but not telling you it was because you could be >> >> flattened by a car :) >> >> >> >> Can you expand on your answer? If you are in a DC with AB power, >> >> redundant UPS, dual feed from the electric company, onsite >> >> generators, dual PSU servers, is it still a bad idea? >> >> >> >> >> >> >> >> >> >> On Fri, Nov 18, 2016 at 6:52 PM, Samuel Just <sj...@redhat.com> wrote: >> >>> >> >>> Never *ever* use nobarrier with ceph under *any* circumstances. I >> >>> cannot stress this enough. >> >>> -Sam >> >>> >> >>> On Fri, Nov 18, 2016 at 10:39 AM, Craig Chi <craig...@synology.com> >> >>> wrote: >> >>>> >> >>>> Hi Nick and other Cephers, >> >>>> >> >>>> Thanks for your reply. >> >>>> >> >>>>> 2) Config Errors >> >>>>> This can be an easy one to say you are safe from. But I would say >> >>>>> most outages and data loss incidents I have seen on the mailing >> >>>&
Re: [ceph-users] how possible is that ceph cluster crash
Hi Sam, Would a check in ceph-disk for "nobarrier" in the osd_mount_options_{fstype} variable be a good idea? It good either strip it out or fail to start the OSD unless an override flag is specified somewhere. Looking at ceph-disk code, I would imagine around here would be the right place to put the check https://github.com/ceph/ceph/blob/master/src/ceph-disk/ceph_disk/main.py#L2642 I don't mind trying to get this done if its felt to be worthwhile. Nick > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Samuel Just > Sent: 19 November 2016 00:31 > To: Nick Fisk <n...@fisk.me.uk> > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] how possible is that ceph cluster crash > > Many reasons: > > 1) You will eventually get a DC wide power event anyway at which point > probably most of the OSDs will have hopelessly corrupted > internal xfs structures (yes, I have seen this happen to a poor soul with a > DC with redundant power). > 2) Even in the case of a single rack/node power failure, the biggest danger > isn't that the OSDs don't start. It's that they *do start*, but > forgot or arbitrarily corrupted a random subset of transactions they told > other osds and clients that they committed. The exact impact > would be random, but for sure, any guarantees Ceph normally provides would be > out the window. RBD devices could have random > byte ranges zapped back in time (not great if they're the offsets assigned to > your database or fs journal...) for instance. > 3) Deliberately powercycling a node counts as a power failure if you don't > stop services and sync etc first. > > In other words, don't mess with the definition of "committing a transaction" > if you value your data. > -Sam "just say no" Just > > On Fri, Nov 18, 2016 at 4:04 PM, Nick Fisk <n...@fisk.me.uk> wrote: > > Yes, because these things happen > > > > http://www.theregister.co.uk/2016/11/15/memset_power_cut_service_inter > > ruption/ > > > > We had customers who had kit in this DC. > > > > To use your analogy, it's like crossing the road at traffic lights but > > not checking cars have stopped. You might be OK 99%of the time, but > > sooner or later it will bite you in the arse and it won't be pretty. > > > > ____________ > > From: "Brian ::" <b...@iptel.co> > > Sent: 18 Nov 2016 11:52 p.m. > > To: sj...@redhat.com > > Cc: Craig Chi; ceph-users@lists.ceph.com; Nick Fisk > > Subject: Re: [ceph-users] how possible is that ceph cluster crash > > > >> X-Assp-URIBLcache failed: '1e100.net'(black.uribl.com) > >> X-Assp-Spam-Level: * > >> X-Assp-Envelope-From: b...@iptel.co > >> X-Assp-Intended-For: n...@fisk.me.uk > >> X-Assp-ID: ASSP.fisk.me.uk (47951-11296) > >> X-Assp-Version: 1.9.1.4(1.0.00) > >> > >> > >> This is like your mother telling not to cross the road when you were > >> 4 years of age but not telling you it was because you could be > >> flattened by a car :) > >> > >> Can you expand on your answer? If you are in a DC with AB power, > >> redundant UPS, dual feed from the electric company, onsite > >> generators, dual PSU servers, is it still a bad idea? > >> > >> > >> > >> > >> On Fri, Nov 18, 2016 at 6:52 PM, Samuel Just <sj...@redhat.com> wrote: > >>> > >>> Never *ever* use nobarrier with ceph under *any* circumstances. I > >>> cannot stress this enough. > >>> -Sam > >>> > >>> On Fri, Nov 18, 2016 at 10:39 AM, Craig Chi <craig...@synology.com> > >>> wrote: > >>>> > >>>> Hi Nick and other Cephers, > >>>> > >>>> Thanks for your reply. > >>>> > >>>>> 2) Config Errors > >>>>> This can be an easy one to say you are safe from. But I would say > >>>>> most outages and data loss incidents I have seen on the mailing > >>>>> lists have been due to poor hardware choice or configuring options > >>>>> such as size=2, min_size=1 or enabling stuff like nobarriers. > >>>> > >>>> > >>>> I am wondering the pros and cons of the nobarrier option used by Ceph. > >>>> > >>>> It is well known that nobarrier is dangerous when power outage > >>>> happens, but if we already have replicas in different racks or > >>>> PDUs, will Ceph reduce the risk of data lost with th
Re: [ceph-users] how possible is that ceph cluster crash
HI Lionel, Mega Ouch - I've recently seen the act of measuring power consumption in a data centre (they clamp a probe onto the cable for an AMP reading seemingly) take out a cabinet which had *redundant* power feeds - so anything is possible I guess. Regards Brian On Sat, Nov 19, 2016 at 11:20 AM, Lionel Boutonwrote: > Le 19/11/2016 à 00:52, Brian :: a écrit : >> This is like your mother telling not to cross the road when you were 4 >> years of age but not telling you it was because you could be flattened >> by a car :) >> >> Can you expand on your answer? If you are in a DC with AB power, >> redundant UPS, dual feed from the electric company, onsite generators, >> dual PSU servers, is it still a bad idea? > > Yes it is. > > In such a datacenter where we have a Ceph cluster there was a complete > shutdown because of a design error : the probes used by the solution > responsible for starting and stopping the generators were installed > before the breakers installed on the feeds. After a blackout where > generators kicked in the breakers opened due to a surge when power was > restored. The generators were stopped because power was restored, and > the UPS systems failed 3 minutes later. Closing the breakers couldn't be > done in time (you don't approach them without being heavily protected, > putting on the suit to protect you needs more time than simply closing > the breaker). > > There's no such thing as uninterruptible power supply. > > Best regards, > > Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how possible is that ceph cluster crash
Le 19/11/2016 à 00:52, Brian :: a écrit : > This is like your mother telling not to cross the road when you were 4 > years of age but not telling you it was because you could be flattened > by a car :) > > Can you expand on your answer? If you are in a DC with AB power, > redundant UPS, dual feed from the electric company, onsite generators, > dual PSU servers, is it still a bad idea? Yes it is. In such a datacenter where we have a Ceph cluster there was a complete shutdown because of a design error : the probes used by the solution responsible for starting and stopping the generators were installed before the breakers installed on the feeds. After a blackout where generators kicked in the breakers opened due to a surge when power was restored. The generators were stopped because power was restored, and the UPS systems failed 3 minutes later. Closing the breakers couldn't be done in time (you don't approach them without being heavily protected, putting on the suit to protect you needs more time than simply closing the breaker). There's no such thing as uninterruptible power supply. Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how possible is that ceph cluster crash
Many reasons: 1) You will eventually get a DC wide power event anyway at which point probably most of the OSDs will have hopelessly corrupted internal xfs structures (yes, I have seen this happen to a poor soul with a DC with redundant power). 2) Even in the case of a single rack/node power failure, the biggest danger isn't that the OSDs don't start. It's that they *do start*, but forgot or arbitrarily corrupted a random subset of transactions they told other osds and clients that they committed. The exact impact would be random, but for sure, any guarantees Ceph normally provides would be out the window. RBD devices could have random byte ranges zapped back in time (not great if they're the offsets assigned to your database or fs journal...) for instance. 3) Deliberately powercycling a node counts as a power failure if you don't stop services and sync etc first. In other words, don't mess with the definition of "committing a transaction" if you value your data. -Sam "just say no" Just On Fri, Nov 18, 2016 at 4:04 PM, Nick Fisk <n...@fisk.me.uk> wrote: > Yes, because these things happen > > http://www.theregister.co.uk/2016/11/15/memset_power_cut_service_interruption/ > > We had customers who had kit in this DC. > > To use your analogy, it's like crossing the road at traffic lights but not > checking cars have stopped. You might be OK 99%of the time, but sooner or > later it will bite you in the arse and it won't be pretty. > > > From: "Brian ::" <b...@iptel.co> > Sent: 18 Nov 2016 11:52 p.m. > To: sj...@redhat.com > Cc: Craig Chi; ceph-users@lists.ceph.com; Nick Fisk > Subject: Re: [ceph-users] how possible is that ceph cluster crash > >> X-Assp-URIBLcache failed: '1e100.net'(black.uribl.com) >> X-Assp-Spam-Level: * >> X-Assp-Envelope-From: b...@iptel.co >> X-Assp-Intended-For: n...@fisk.me.uk >> X-Assp-ID: ASSP.fisk.me.uk (47951-11296) >> X-Assp-Version: 1.9.1.4(1.0.00) >> >> >> This is like your mother telling not to cross the road when you were 4 >> years of age but not telling you it was because you could be flattened >> by a car :) >> >> Can you expand on your answer? If you are in a DC with AB power, >> redundant UPS, dual feed from the electric company, onsite generators, >> dual PSU servers, is it still a bad idea? >> >> >> >> >> On Fri, Nov 18, 2016 at 6:52 PM, Samuel Just <sj...@redhat.com> wrote: >>> >>> Never *ever* use nobarrier with ceph under *any* circumstances. I >>> cannot stress this enough. >>> -Sam >>> >>> On Fri, Nov 18, 2016 at 10:39 AM, Craig Chi <craig...@synology.com> >>> wrote: >>>> >>>> Hi Nick and other Cephers, >>>> >>>> Thanks for your reply. >>>> >>>>> 2) Config Errors >>>>> This can be an easy one to say you are safe from. But I would say most >>>>> outages and data loss incidents I have seen on the mailing >>>>> lists have been due to poor hardware choice or configuring options such >>>>> as >>>>> size=2, min_size=1 or enabling stuff like nobarriers. >>>> >>>> >>>> I am wondering the pros and cons of the nobarrier option used by Ceph. >>>> >>>> It is well known that nobarrier is dangerous when power outage happens, >>>> but >>>> if we already have replicas in different racks or PDUs, will Ceph reduce >>>> the >>>> risk of data lost with this option? >>>> >>>> I have seen many performance tuning articles providing nobarrier option >>>> in >>>> xfs, but there are not many of then mention the trade-off of nobarrier. >>>> >>>> Is it really unacceptable to use nobarrier in production environment? I >>>> will >>>> be much grateful if you guys are willing to share any experiences about >>>> nobarrier and xfs. >>>> >>>> Sincerely, >>>> Craig Chi (Product Developer) >>>> Synology Inc. Taipei, Taiwan. Ext. 361 >>>> >>>> On 2016-11-17 05:04, Nick Fisk <n...@fisk.me.uk> wrote: >>>> >>>>> -Original Message- >>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf >>>>> Of >>>>> Pedro Benites >>>>> Sent: 16 November 2016 17:51 >>>>> To: ceph-users@lists.ceph.com >>>>> Subject: [ceph-users] how possible is that ceph cluster crash >>>>> >>>>> Hi, &
Re: [ceph-users] how possible is that ceph cluster crash
Yes, because these things happen http://www.theregister.co.uk/2016/11/15/memset_power_cut_service_interruption/ We had customers who had kit in this DC. To use your analogy, it's like crossing the road at traffic lights but not checking cars have stopped. You might be OK 99%of the time, but sooner or later it will bite you in the arse and it won't be pretty. From: "Brian ::" <b...@iptel.co> Sent: 18 Nov 2016 11:52 p.m. To: sj...@redhat.com Cc: Craig Chi; ceph-users@lists.ceph.com; Nick Fisk Subject: Re: [ceph-users] how possible is that ceph cluster crash X-Assp-URIBLcache failed: '1e100.net'(black.uribl.com) X-Assp-Spam-Level: * X-Assp-Envelope-From: b...@iptel.co X-Assp-Intended-For: n...@fisk.me.uk X-Assp-ID: ASSP.fisk.me.uk (47951-11296) X-Assp-Version: 1.9.1.4(1.0.00) This is like your mother telling not to cross the road when you were 4 years of age but not telling you it was because you could be flattened by a car :) Can you expand on your answer? If you are in a DC with AB power, redundant UPS, dual feed from the electric company, onsite generators, dual PSU servers, is it still a bad idea? On Fri, Nov 18, 2016 at 6:52 PM, Samuel Just <sj...@redhat.com> wrote: Never *ever* use nobarrier with ceph under *any* circumstances. I cannot stress this enough. -Sam On Fri, Nov 18, 2016 at 10:39 AM, Craig Chi <craig...@synology.com> wrote: Hi Nick and other Cephers, Thanks for your reply. 2) Config Errors This can be an easy one to say you are safe from. But I would say most outages and data loss incidents I have seen on the mailing lists have been due to poor hardware choice or configuring options such as size=2, min_size=1 or enabling stuff like nobarriers. I am wondering the pros and cons of the nobarrier option used by Ceph. It is well known that nobarrier is dangerous when power outage happens, but if we already have replicas in different racks or PDUs, will Ceph reduce the risk of data lost with this option? I have seen many performance tuning articles providing nobarrier option in xfs, but there are not many of then mention the trade-off of nobarrier. Is it really unacceptable to use nobarrier in production environment? I will be much grateful if you guys are willing to share any experiences about nobarrier and xfs. Sincerely, Craig Chi (Product Developer) Synology Inc. Taipei, Taiwan. Ext. 361 On 2016-11-17 05:04, Nick Fisk <n...@fisk.me.uk> wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Pedro Benites Sent: 16 November 2016 17:51 To: ceph-users@lists.ceph.com Subject: [ceph-users] how possible is that ceph cluster crash Hi, I have a ceph cluster with 50 TB, with 15 osds, it is working fine for one year and I would like to grow it and migrate all my old storage, about 100 TB to ceph, but I have a doubt. How possible is that the cluster fail and everything went very bad? Everything is possible, I think there are 3 main risks 1) Hardware failure I would say Ceph is probably one of the safest options in regards to hardware failures, certainly if you start using 4TB+ disks. 2) Config Errors This can be an easy one to say you are safe from. But I would say most outages and data loss incidents I have seen on the mailing lists have been due to poor hardware choice or configuring options such as size=2, min_size=1 or enabling stuff like nobarriers. 3) Ceph Bugs Probably the rarest, but potentially the most scary as you have less control. They do happen and it's something to be aware of How reliable is ceph? What is the risk about lose my data.? is necessary backup my data? Yes, always backup your data, no matter solution you use. Just like RAID != Backup, neither does ceph. Regards. Pedro. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Sent from Synology MailPlus ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how possible is that ceph cluster crash
This is like your mother telling not to cross the road when you were 4 years of age but not telling you it was because you could be flattened by a car :) Can you expand on your answer? If you are in a DC with AB power, redundant UPS, dual feed from the electric company, onsite generators, dual PSU servers, is it still a bad idea? On Fri, Nov 18, 2016 at 6:52 PM, Samuel Just <sj...@redhat.com> wrote: > Never *ever* use nobarrier with ceph under *any* circumstances. I > cannot stress this enough. > -Sam > > On Fri, Nov 18, 2016 at 10:39 AM, Craig Chi <craig...@synology.com> wrote: >> Hi Nick and other Cephers, >> >> Thanks for your reply. >> >>> 2) Config Errors >>> This can be an easy one to say you are safe from. But I would say most >>> outages and data loss incidents I have seen on the mailing >>> lists have been due to poor hardware choice or configuring options such as >>> size=2, min_size=1 or enabling stuff like nobarriers. >> >> I am wondering the pros and cons of the nobarrier option used by Ceph. >> >> It is well known that nobarrier is dangerous when power outage happens, but >> if we already have replicas in different racks or PDUs, will Ceph reduce the >> risk of data lost with this option? >> >> I have seen many performance tuning articles providing nobarrier option in >> xfs, but there are not many of then mention the trade-off of nobarrier. >> >> Is it really unacceptable to use nobarrier in production environment? I will >> be much grateful if you guys are willing to share any experiences about >> nobarrier and xfs. >> >> Sincerely, >> Craig Chi (Product Developer) >> Synology Inc. Taipei, Taiwan. Ext. 361 >> >> On 2016-11-17 05:04, Nick Fisk <n...@fisk.me.uk> wrote: >> >>> -Original Message----- >>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >>> Pedro Benites >>> Sent: 16 November 2016 17:51 >>> To: ceph-users@lists.ceph.com >>> Subject: [ceph-users] how possible is that ceph cluster crash >>> >>> Hi, >>> >>> I have a ceph cluster with 50 TB, with 15 osds, it is working fine for one >>> year and I would like to grow it and migrate all my old >> storage, >>> about 100 TB to ceph, but I have a doubt. How possible is that the cluster >>> fail and everything went very bad? >> >> Everything is possible, I think there are 3 main risks >> >> 1) Hardware failure >> I would say Ceph is probably one of the safest options in regards to >> hardware failures, certainly if you start using 4TB+ disks. >> >> 2) Config Errors >> This can be an easy one to say you are safe from. But I would say most >> outages and data loss incidents I have seen on the mailing >> lists have been due to poor hardware choice or configuring options such as >> size=2, min_size=1 or enabling stuff like nobarriers. >> >> 3) Ceph Bugs >> Probably the rarest, but potentially the most scary as you have less >> control. They do happen and it's something to be aware of >> >> How reliable is ceph? >>> What is the risk about lose my data.? is necessary backup my data? >> >> Yes, always backup your data, no matter solution you use. Just like RAID != >> Backup, neither does ceph. >> >>> >>> Regards. >>> Pedro. >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> >> Sent from Synology MailPlus >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how possible is that ceph cluster crash
Hi Nick and other Cephers, Thanks for your reply. >2) Config Errors>This can be an easy one to say you are safe from. But I would >say most outages and data loss incidents I have seen on the mailing>lists have >been due to poor hardware choice or configuring options such as size=2, >min_size=1 or enabling stuff like nobarriers. I am wondering the pros and cons of the nobarrier option used by Ceph. It is well known that nobarrier is dangerous when power outage happens, but if we already have replicas in different racks or PDUs, will Ceph reduce the risk of data lost with this option? I have seen many performance tuning articles providing nobarrier option in xfs, but there are not many of then mention the trade-off of nobarrier. Is it really unacceptable to use nobarrier in production environment? I will be much grateful if you guys are willing to share any experiences about nobarrier and xfs. Sincerely, Craig Chi (Product Developer) Synology Inc. Taipei, Taiwan. Ext. 361 On 2016-11-17 05:04, Nick Fisk<n...@fisk.me.uk>wrote: > >-Original Message->From: ceph-users > >[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Pedro Benites>Sent: > >16 November 2016 17:51>To: ceph-users@lists.ceph.com>Subject: [ceph-users] > >how possible is that ceph cluster crash>>Hi,>>I have a ceph cluster with 50 > >TB, with 15 osds, it is working fine for one year and I would like to grow > >it and migrate all my old storage,>about 100 TB to ceph, but I have a doubt. > >How possible is that the cluster fail and everything went very bad? > >Everything is possible, I think there are 3 main risks 1) Hardware failure I > >would say Ceph is probably one of the safest options in regards to hardware > >failures, certainly if you start using 4TB+ disks. 2) Config Errors This can > >be an easy one to say you are safe from. But I would say most outages and > >data loss incidents I have seen on the mailing lists have been due to poor > >hardware choice or configuring options such as size=2, min_size=1 or > >enabling stuff like nobarriers. 3) Ceph Bugs Pro bably th e rarest, but potentially the most scary as you have less control. They do happen and it's something to be aware of How reliable is ceph?>What is the risk about lose my data.? is necessary backup my data? Yes, always backup your data, no matter solution you use. Just like RAID != Backup, neither does ceph.>>Regards.>Pedro.>___>ceph-users mailing list>ceph-users@lists.ceph.com>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how possible is that ceph cluster crash
Olá Pedro... These are extremely generic questions, and therefore, hard to answer. Nick did a good job in defining the risks. In our case, we are running a Ceph/CephFS system in production for over an year, and before that, we tried to understand Ceph for a year also. Ceph is incredibility good is dealing with hardware failures so it is a powerfull tool if you are using commodity hardware. If your disks fail or even if a fraction of your hosts fail, it is able to cope and recover properly (until a given extent) if you have the proper crush rules in place (the default ones do a good job on that) and free space available. To be on the safe side: - decouple mons from osds servers - check the RAM requirement for your osds servers (depend in the number of osds in each server) - have, at least, 3 mons in a production system - use a 3x replica There is a good info page on hardware requirements in the ceph wikis. However, the devil is on the details. Ceph is a complex system still in permanent development. Wrong configurations might lead to performance problems. If your network is not reliable, that might lead to flapping osds, which on its turn, might lead to problems in your pgs. When your osds starts to become full (a single full osd freezes all I/O to the cluster) many problems may start to appear. Finally there are bugs. Their number is not huge and there is a real good effort form the developers and from the community to address those in a fast and reliable way. However, sometimes it is difficult to diagnose what could be wrong because of the so many layers involved. It is not infrequent that we have to go and look to the source code to figure out (when possible) what may be happening. So, I would say that there is a learning curve that myself and others are still going through. Abraço Gonçalo From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Pedro Benites [pbeni...@litholaser.com] Sent: 17 November 2016 04:50 To: ceph-users@lists.ceph.com Subject: [ceph-users] how possible is that ceph cluster crash Hi, I have a ceph cluster with 50 TB, with 15 osds, it is working fine for one year and I would like to grow it and migrate all my old storage, about 100 TB to ceph, but I have a doubt. How possible is that the cluster fail and everything went very bad? How reliable is ceph? What is the risk about lose my data.? is necessary backup my data? Regards. Pedro. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how possible is that ceph cluster crash
> -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Pedro Benites > Sent: 16 November 2016 17:51 > To: ceph-users@lists.ceph.com > Subject: [ceph-users] how possible is that ceph cluster crash > > Hi, > > I have a ceph cluster with 50 TB, with 15 osds, it is working fine for one > year and I would like to grow it and migrate all my old storage, > about 100 TB to ceph, but I have a doubt. How possible is that the cluster > fail and everything went very bad? Everything is possible, I think there are 3 main risks 1) Hardware failure I would say Ceph is probably one of the safest options in regards to hardware failures, certainly if you start using 4TB+ disks. 2) Config Errors This can be an easy one to say you are safe from. But I would say most outages and data loss incidents I have seen on the mailing lists have been due to poor hardware choice or configuring options such as size=2, min_size=1 or enabling stuff like nobarriers. 3) Ceph Bugs Probably the rarest, but potentially the most scary as you have less control. They do happen and it's something to be aware of How reliable is ceph? > What is the risk about lose my data.? is necessary backup my data? Yes, always backup your data, no matter solution you use. Just like RAID != Backup, neither does ceph. > > Regards. > Pedro. > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] how possible is that ceph cluster crash
Hi, I have a ceph cluster with 50 TB, with 15 osds, it is working fine for one year and I would like to grow it and migrate all my old storage, about 100 TB to ceph, but I have a doubt. How possible is that the cluster fail and everything went very bad? How reliable is ceph? What is the risk about lose my data.? is necessary backup my data? Regards. Pedro. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com