Re: [ceph-users] 2x replica with NVMe

2017-06-08 Thread David Turner
...@gmail.com>, i...@witeq.com
> *Cc: *"ceph-users" <ceph-users@lists.ceph.com>
> *Sent: *Thursday, June 8, 2017 3:19:20 PM
> *Subject: *RE: [ceph-users] 2x replica with NVMe
>
> There are two main concerns with using 2x replicas, recovery speed and
> coming across inconsistent objects.
>
>
>
> With spinning disks their size to access speed means recovery can take a
> long time and increases the chance that additional failures may happen
> during the recovery process. NVME will recover a lot faster and so this
> risk is greatly reduced and means that using 2x replicas may be possible.
>
>
>
> However, with Filestore there are no checksums and so there is no way to
> determine in the event of inconsistent objects, which one is corrupt. So
> even with NVME, I would not feel 100% confident using 2x replicas. With
> Bluestore this problem will go away.
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *Vy Nguyen Tan
> *Sent:* 08 June 2017 13:47
> *To:* i...@witeq.com
> *Cc:* ceph-users <ceph-users@lists.ceph.com>
> *Subject:* Re: [ceph-users] 2x replica with NVMe
>
>
>
> Hi,
>
>
>
> I think that the replica 2x on HDD/SSD are the same. You should read quote
> from Wido bellow:
>
>
>
> ""Hi,
>
>
> As a Ceph consultant I get numerous calls throughout the year to help
> people with getting their broken Ceph clusters back online.
>
> The causes of downtime vary vastly, but one of the biggest causes is that
> people use replication 2x. size = 2, min_size = 1.
>
> In 2016 the amount of cases I have where data was lost due to these
> settings grew exponentially.
>
> Usually a disk failed, recovery kicks in and while recovery is happening a
> second disk fails. Causing PGs to become incomplete.
>
> There have been to many times where I had to use xfs_repair on broken disks
> and use ceph-objectstore-tool to export/import PGs.
>
> I really don't like these cases, mainly because they can be prevented
> easily by using size = 3 and min_size = 2 for all pools.
>
> With size = 2 you go into the danger zone as soon as a single disk/daemon
> fails. With size = 3 you always have two additional copies left thus
> keeping your data safe(r).
>
> If you are running CephFS, at least consider running the 'metadata' pool
> with size = 3 to keep the MDS happy.
>
> Please, let this be a big warning to everybody who is running with size =
> 2. The downtime and problems caused by missing objects/replicas are usually
> big and it takes days to recover from those. But very often data is lost
> and/or corrupted which causes even more problems.
>
> I can't stress this enough. Running with size = 2 in production is a
> SERIOUS hazard and should not be done imho.
>
> To anyone out there running with size = 2, please reconsider this!
>
> Thanks,
>
> Wido""
>
>
>
> On Thu, Jun 8, 2017 at 5:32 PM, <i...@witeq.com> wrote:
>
> Hi all,
>
>
>
> i'm going to build an all-flash ceph cluster, looking around the existing
> documentation i see lots of guides and and use case scenarios from various
> vendor testing Ceph with replica 2x.
>
>
>
> Now, i'm an old school Ceph user, I always considered 2x replica really
> dangerous for production data, especially when both OSDs can't decide which
> replica is the good one.
>
> Why all NVMe storage vendor and partners use only 2x replica?
>
> They claim it's safe because NVMe is better in handling errors, but i
> usually don't trust marketing claims :)
>
> Is it true? Can someone confirm that NVMe is different compared to HDD and
> therefore replica 2 can be considered safe to be put in production?
>
>
>
> Many Thanks
>
> Giordano
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> <http://xo4t.mj.am/lnk/AEEALj8jXYgAAF3gd04AADNJBWwAAACRXwBZOU7cb46h4z9pToCa6VLdYf2h6AAAlBI/1/GysXe0cHiheNJt5oY81oAA/aHR0cDovL2xpc3RzLmNlcGguY29tL2xpc3RpbmZvLmNnaS9jZXBoLXVzZXJzLWNlcGguY29t>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replica with NVMe

2017-06-08 Thread Nick Fisk
Bluestore will make 2x Replica’s “safer” to use in theory. Until Bluestore is 
in use in the wild, I don’t think anyone can give any guarantees. 

 

From: i...@witeq.com [mailto:i...@witeq.com] 
Sent: 08 June 2017 14:32
To: nick <n...@fisk.me.uk>
Cc: Vy Nguyen Tan <vynt.kensh...@gmail.com>; ceph-users 
<ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] 2x replica with NVMe

 

I'm thinking to delay this project until Luminous release to have Bluestore 
support.

 

So are you telling me that checksum capability will be present in Bluestore and 
therefore considering using NVMe with 2x replica for production data will be 
possibile?

 

  _  

From: "nick" <n...@fisk.me.uk>
To: "Vy Nguyen Tan" <vynt.kensh...@gmail.com>, i...@witeq.com
Cc: "ceph-users" <ceph-users@lists.ceph.com>
Sent: Thursday, June 8, 2017 3:19:20 PM
Subject: RE: [ceph-users] 2x replica with NVMe

 

There are two main concerns with using 2x replicas, recovery speed and coming 
across inconsistent objects.

 

With spinning disks their size to access speed means recovery can take a long 
time and increases the chance that additional failures may happen during the 
recovery process. NVME will recover a lot faster and so this risk is greatly 
reduced and means that using 2x replicas may be possible.

 

However, with Filestore there are no checksums and so there is no way to 
determine in the event of inconsistent objects, which one is corrupt. So even 
with NVME, I would not feel 100% confident using 2x replicas. With Bluestore 
this problem will go away.

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Vy 
Nguyen Tan
Sent: 08 June 2017 13:47
To: i...@witeq.com
Cc: ceph-users <ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] 2x replica with NVMe

 

Hi,

 

I think that the replica 2x on HDD/SSD are the same. You should read quote from 
Wido bellow:

 

""Hi,


As a Ceph consultant I get numerous calls throughout the year to help people 
with getting their broken Ceph clusters back online.

The causes of downtime vary vastly, but one of the biggest causes is that 
people use replication 2x. size = 2, min_size = 1.

In 2016 the amount of cases I have where data was lost due to these settings 
grew exponentially.

Usually a disk failed, recovery kicks in and while recovery is happening a 
second disk fails. Causing PGs to become incomplete.

There have been to many times where I had to use xfs_repair on broken disks and 
use ceph-objectstore-tool to export/import PGs.

I really don't like these cases, mainly because they can be prevented easily by 
using size = 3 and min_size = 2 for all pools.

With size = 2 you go into the danger zone as soon as a single disk/daemon 
fails. With size = 3 you always have two additional copies left thus keeping 
your data safe(r).

If you are running CephFS, at least consider running the 'metadata' pool with 
size = 3 to keep the MDS happy.

Please, let this be a big warning to everybody who is running with size = 2. 
The downtime and problems caused by missing objects/replicas are usually big 
and it takes days to recover from those. But very often data is lost and/or 
corrupted which causes even more problems.

I can't stress this enough. Running with size = 2 in production is a SERIOUS 
hazard and should not be done imho.

To anyone out there running with size = 2, please reconsider this!

Thanks,

Wido""

 

On Thu, Jun 8, 2017 at 5:32 PM, <i...@witeq.com <mailto:i...@witeq.com> > wrote:

Hi all,

 

i'm going to build an all-flash ceph cluster, looking around the existing 
documentation i see lots of guides and and use case scenarios from various 
vendor testing Ceph with replica 2x.

 

Now, i'm an old school Ceph user, I always considered 2x replica really 
dangerous for production data, especially when both OSDs can't decide which 
replica is the good one.

Why all NVMe storage vendor and partners use only 2x replica? 

They claim it's safe because NVMe is better in handling errors, but i usually 
don't trust marketing claims :)

Is it true? Can someone confirm that NVMe is different compared to HDD and 
therefore replica 2 can be considered safe to be put in production?

 

Many Thanks

Giordano


___
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
<http://xo4t.mj.am/lnk/AEEALj8jXYgAAF3gd04AADNJBWwAAACRXwBZOU7cb46h4z9pToCa6VLdYf2h6AAAlBI/1/GysXe0cHiheNJt5oY81oAA/aHR0cDovL2xpc3RzLmNlcGguY29tL2xpc3RpbmZvLmNnaS9jZXBoLXVzZXJzLWNlcGguY29t>
 

 




 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replica with NVMe

2017-06-08 Thread info
I'm thinking to delay this project until Luminous release to have Bluestore 
support. 

So are you telling me that checksum capability will be present in Bluestore and 
therefore considering using NVMe with 2x replica for production data will be 
possibile? 



From: "nick" <n...@fisk.me.uk> 
To: "Vy Nguyen Tan" <vynt.kensh...@gmail.com>, i...@witeq.com 
Cc: "ceph-users" <ceph-users@lists.ceph.com> 
Sent: Thursday, June 8, 2017 3:19:20 PM 
Subject: RE: [ceph-users] 2x replica with NVMe 



There are two main concerns with using 2x replicas, recovery speed and coming 
across inconsistent objects. 



With spinning disks their size to access speed means recovery can take a long 
time and increases the chance that additional failures may happen during the 
recovery process. NVME will recover a lot faster and so this risk is greatly 
reduced and means that using 2x replicas may be possible. 



However, with Filestore there are no checksums and so there is no way to 
determine in the event of inconsistent objects, which one is corrupt. So even 
with NVME, I would not feel 100% confident using 2x replicas. With Bluestore 
this problem will go away. 




From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Vy 
Nguyen Tan 
Sent: 08 June 2017 13:47 
To: i...@witeq.com 
Cc: ceph-users <ceph-users@lists.ceph.com> 
Subject: Re: [ceph-users] 2x replica with NVMe 





Hi, 





I think that the replica 2x on HDD/SSD are the same. You should read quote from 
Wido bellow: 





" " Hi, 



As a Ceph consultant I get numerous calls throughout the year to help people 
with getting their broken Ceph clusters back online . 

The causes of downtime vary vastly, but one of the biggest causes is that 
people use replication 2x. size = 2, min_size = 1. 

In 2016 the amount of cases I have where data was lost due to these settings 
grew exponentially. 

Usually a disk failed, recovery kicks in and while recovery is happening a 
second disk fails. Causing PGs to become incomplete. 

There have been to many times where I had to use xfs_repair on broken disks and 
use ceph -objectstore-tool to export/import PGs. 

I really don't like these cases, mainly because they can be prevented easily by 
using size = 3 and min_size = 2 for all pools. 

With size = 2 you go into the danger zone as soon as a single disk/daemon 
fails. With size = 3 you always have two additional copies left thus keeping 
your data safe(r). 

If you are running CephFS, at least consider running the 'metadata' pool with 
size = 3 to keep the MDS happy. 

Please, let this be a big warning to everybody who is running with size = 2. 
The downtime and problems caused by missing objects/replicas are usually big 
and it takes days to recover from those. But very often data is lost and/or 
corrupted which causes even more problems. 

I can't stress this enough. Running with size = 2 in production is a SERIOUS 
hazard and should not be done imho. 

To anyone out there running with size = 2, please reconsider this! 

Thanks, 

Wido"" 





On Thu, Jun 8, 2017 at 5:32 PM, < i...@witeq.com > wrote: 




Hi all, 





i'm going to build an all-flash ceph cluster, looking around the existing 
documentation i see lots of guides and and use case scenarios from various 
vendor testing Ceph with replica 2x. 





Now, i'm an old school Ceph user, I always considered 2x replica really 
dangerous for production data, especially when both OSDs can't decide which 
replica is the good one. 


Why all NVMe storage vendor and partners use only 2x replica? 


They claim it's safe because NVMe is better in handling errors, but i usually 
don't trust marketing claims :) 


Is it true? Can someone confirm that NVMe is different compared to HDD and 
therefore replica 2 can be considered safe to be put in production? 





Many Thanks 


Giordano 



___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 







___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replica with NVMe

2017-06-08 Thread Nick Fisk
There are two main concerns with using 2x replicas, recovery speed and coming 
across inconsistent objects.

 

With spinning disks their size to access speed means recovery can take a long 
time and increases the chance that additional failures may happen during the 
recovery process. NVME will recover a lot faster and so this risk is greatly 
reduced and means that using 2x replicas may be possible.

 

However, with Filestore there are no checksums and so there is no way to 
determine in the event of inconsistent objects, which one is corrupt. So even 
with NVME, I would not feel 100% confident using 2x replicas. With Bluestore 
this problem will go away.

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Vy 
Nguyen Tan
Sent: 08 June 2017 13:47
To: i...@witeq.com
Cc: ceph-users <ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] 2x replica with NVMe

 

Hi,

 

I think that the replica 2x on HDD/SSD are the same. You should read quote from 
Wido bellow:

 

""Hi,


As a Ceph consultant I get numerous calls throughout the year to help people 
with getting their broken Ceph clusters back online.

The causes of downtime vary vastly, but one of the biggest causes is that 
people use replication 2x. size = 2, min_size = 1.

In 2016 the amount of cases I have where data was lost due to these settings 
grew exponentially.

Usually a disk failed, recovery kicks in and while recovery is happening a 
second disk fails. Causing PGs to become incomplete.

There have been to many times where I had to use xfs_repair on broken disks and 
use ceph-objectstore-tool to export/import PGs.

I really don't like these cases, mainly because they can be prevented easily by 
using size = 3 and min_size = 2 for all pools.

With size = 2 you go into the danger zone as soon as a single disk/daemon 
fails. With size = 3 you always have two additional copies left thus keeping 
your data safe(r).

If you are running CephFS, at least consider running the 'metadata' pool with 
size = 3 to keep the MDS happy.

Please, let this be a big warning to everybody who is running with size = 2. 
The downtime and problems caused by missing objects/replicas are usually big 
and it takes days to recover from those. But very often data is lost and/or 
corrupted which causes even more problems.

I can't stress this enough. Running with size = 2 in production is a SERIOUS 
hazard and should not be done imho.

To anyone out there running with size = 2, please reconsider this!

Thanks,

Wido""

 

On Thu, Jun 8, 2017 at 5:32 PM, <i...@witeq.com <mailto:i...@witeq.com> > wrote:

Hi all,

 

i'm going to build an all-flash ceph cluster, looking around the existing 
documentation i see lots of guides and and use case scenarios from various 
vendor testing Ceph with replica 2x.

 

Now, i'm an old school Ceph user, I always considered 2x replica really 
dangerous for production data, especially when both OSDs can't decide which 
replica is the good one.

Why all NVMe storage vendor and partners use only 2x replica? 

They claim it's safe because NVMe is better in handling errors, but i usually 
don't trust marketing claims :)

Is it true? Can someone confirm that NVMe is different compared to HDD and 
therefore replica 2 can be considered safe to be put in production?

 

Many Thanks

Giordano


___
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replica with NVMe

2017-06-08 Thread Vy Nguyen Tan
Hi,

I think that the replica 2x on HDD/SSD are the same. You should read quote
from Wido bellow:

""Hi,

As a Ceph consultant I get numerous calls throughout the year to help people
 with getting their broken Ceph clusters back online.

The causes of downtime vary vastly, but one of the biggest causes is that
people use replication 2x. size = 2, min_size = 1.

In 2016 the amount of cases I have where data was lost due to these
settings grew exponentially.

Usually a disk failed, recovery kicks in and while recovery is happening a
second disk fails. Causing PGs to become incomplete.

There have been to many times where I had to use xfs_repair on broken disks
and use ceph-objectstore-tool to export/import PGs.

I really don't like these cases, mainly because they can be prevented
easily by using size = 3 and min_size = 2 for all pools.

With size = 2 you go into the danger zone as soon as a single disk/daemon
fails. With size = 3 you always have two additional copies left thus
keeping your data safe(r).

If you are running CephFS, at least consider running the 'metadata' pool
with size = 3 to keep the MDS happy.

Please, let this be a big warning to everybody who is running with size =
2. The downtime and problems caused by missing objects/replicas are usually
big and it takes days to recover from those. But very often data is lost
and/or corrupted which causes even more problems.

I can't stress this enough. Running with size = 2 in production is a
SERIOUS hazard and should not be done imho.

To anyone out there running with size = 2, please reconsider this!

Thanks,

Wido""

On Thu, Jun 8, 2017 at 5:32 PM,  wrote:

> Hi all,
>
> i'm going to build an all-flash ceph cluster, looking around the existing
> documentation i see lots of guides and and use case scenarios from various
> vendor testing Ceph with replica 2x.
>
> Now, i'm an old school Ceph user, I always considered 2x replica really
> dangerous for production data, especially when both OSDs can't decide which
> replica is the good one.
> Why all NVMe storage vendor and partners use only 2x replica?
> They claim it's safe because NVMe is better in handling errors, but i
> usually don't trust marketing claims :)
> Is it true? Can someone confirm that NVMe is different compared to HDD and
> therefore replica 2 can be considered safe to be put in production?
>
> Many Thanks
> Giordano
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com