Re: [Openstack-operators] Openstack and Ceph
Another question is what type of SSD's are you using. There is a big difference between not just vendors of SSD's but the size of them as their internals make a big difference on how the OS interacts with them. This link is still very usage today: https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ On Fri, Feb 17, 2017 at 12:54 PM, Alex Hübnerwrote: > Are these nodes connected to a dedicated or a shared (in the sense there > are other workloads running) network switches? How fast (1G, 10G or faster) > are the interfaces? Also, how much RAM are you using? There's a rule of > thumb that says you should dedicate at least 1GB of RAM for each 1 TB of > raw disk space. How the clients are consuming the storage? Are they virtual > machines? Are you using iSCSI to connect those? Are these clients the same > ones you're testing against your regular SAN storage and are they > positioned in a similar fashion (ie: over a steady network channel)? What > Ceph version are you using? > > Finally, replicas are normally faster than erasure coding, so you're good > on this. It's *never* a good idea to enable RAID cache, even when it > apparently improves IOPS (the magic of Ceph relies on the cluster, it's > network and the number of nodes, don't approach the nodes as if they where > isolate storage servers). Also, RAID0 should only be used as a last resort > for the cases the disk controller doesn't offer JBOD mode. > > []'s > Hubner > > On Fri, Feb 17, 2017 at 7:19 AM, Vahric Muhtaryan > wrote: > >> Hello All , >> >> First thanks for your answers . Looks like everybody is ceph lover :) >> >> I believe that you already made some tests and have some results because >> of until now we used traditional storages like IBM V7000 or XIV or Netapp >> or something we are very happy to get good iops and also provide same >> performance to all instances until now. >> >> We saw that each OSD eating a lot of cpu and when multiple client try to >> get same performance from ceph its looks like not possible , ceph is >> sharing all things with clients and we can not reach hardware raw iops >> capacity with ceph. For example each SSD can do 90K iops we have three on >> each node and have 6 nodes means we should get better results then what we >> have now ! >> >> Could you pls share your hardware configs , iops test and advise our >> expectations correct or not ? >> >> We are using Kraken , almost all debug options are set 0/0 , we modified >> op_Tracker or some other ops based configs too ! >> >> Our Hardware >> >> 6 x Node >> Each Node Have : >> 2 Socket Intel(R) Xeon(R) CPU E5-2630L v3 @ 1.80GHz each and total 16 >> core and HT enabled >> 3 SSD + 12 HDD (SSDs are in journal mode 4 HDD to each SSD) >> Each disk configured Raid 0 (We did not see any performance different >> with JBOD mode of raid card because of that continued with raid 0 ) >> Also raid card write back cache is used because its adding extra IOPS too >> ! >> >> Our Test >> >> Its %100 random and write >> Ceph pool is configured 3 replica set. (we did not use 2 because at the >> failover time all system stacked and we couldn’t imagine great tunning >> about it because some of reading said that under high load OSDs can be down >> and up again we should care about this too ! ) >> >> Test Command : fio --randrepeat=1 --ioengine=libaio --direct=1 >> --gtod_reduce=1 --name=test --filename=test --bs=4k —iodepth=256 --size=1G >> --numjobs=8 --readwrite=randwrite —group_reporting >> >> Achieved IOPS : 35 K (Single Client) >> We tested up to 10 Clients which ceph fairly share this usage like almost >> 4K for each >> >> Thanks >> Regards >> Vahric Muhtaryan >> > > > ___ > OpenStack-operators mailing list > OpenStack-operators@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators > > ___ OpenStack-operators mailing list OpenStack-operators@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
Re: [Openstack-operators] Openstack and Ceph
Are these nodes connected to a dedicated or a shared (in the sense there are other workloads running) network switches? How fast (1G, 10G or faster) are the interfaces? Also, how much RAM are you using? There's a rule of thumb that says you should dedicate at least 1GB of RAM for each 1 TB of raw disk space. How the clients are consuming the storage? Are they virtual machines? Are you using iSCSI to connect those? Are these clients the same ones you're testing against your regular SAN storage and are they positioned in a similar fashion (ie: over a steady network channel)? What Ceph version are you using? Finally, replicas are normally faster than erasure coding, so you're good on this. It's *never* a good idea to enable RAID cache, even when it apparently improves IOPS (the magic of Ceph relies on the cluster, it's network and the number of nodes, don't approach the nodes as if they where isolate storage servers). Also, RAID0 should only be used as a last resort for the cases the disk controller doesn't offer JBOD mode. []'s Hubner On Fri, Feb 17, 2017 at 7:19 AM, Vahric Muhtaryanwrote: > Hello All , > > First thanks for your answers . Looks like everybody is ceph lover :) > > I believe that you already made some tests and have some results because > of until now we used traditional storages like IBM V7000 or XIV or Netapp > or something we are very happy to get good iops and also provide same > performance to all instances until now. > > We saw that each OSD eating a lot of cpu and when multiple client try to > get same performance from ceph its looks like not possible , ceph is > sharing all things with clients and we can not reach hardware raw iops > capacity with ceph. For example each SSD can do 90K iops we have three on > each node and have 6 nodes means we should get better results then what we > have now ! > > Could you pls share your hardware configs , iops test and advise our > expectations correct or not ? > > We are using Kraken , almost all debug options are set 0/0 , we modified > op_Tracker or some other ops based configs too ! > > Our Hardware > > 6 x Node > Each Node Have : > 2 Socket Intel(R) Xeon(R) CPU E5-2630L v3 @ 1.80GHz each and total 16 > core and HT enabled > 3 SSD + 12 HDD (SSDs are in journal mode 4 HDD to each SSD) > Each disk configured Raid 0 (We did not see any performance different with > JBOD mode of raid card because of that continued with raid 0 ) > Also raid card write back cache is used because its adding extra IOPS too > ! > > Our Test > > Its %100 random and write > Ceph pool is configured 3 replica set. (we did not use 2 because at the > failover time all system stacked and we couldn’t imagine great tunning > about it because some of reading said that under high load OSDs can be down > and up again we should care about this too ! ) > > Test Command : fio --randrepeat=1 --ioengine=libaio --direct=1 > --gtod_reduce=1 --name=test --filename=test --bs=4k —iodepth=256 --size=1G > --numjobs=8 --readwrite=randwrite —group_reporting > > Achieved IOPS : 35 K (Single Client) > We tested up to 10 Clients which ceph fairly share this usage like almost > 4K for each > > Thanks > Regards > Vahric Muhtaryan > ___ OpenStack-operators mailing list OpenStack-operators@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
Re: [Openstack-operators] [openstack-operators][ceph][nova] How do you handle Nova on Ceph?
Hi Adam, I agree somewhat, capacity management and growth at scale is something of a pain. Ceph gives you a hugely powerful and flexible way to manage data-placement through crush but there is very little quality info about, or examples of, non-naive crushmap configurations. I think I understand what you are getting at in regards to failure-domain, e.g., a large cluster of 1000+ drives may require a single storage pool (e.g., for nova) across most/all of that storage. The chances of overlapping drive failures (overlapping meaning before recovery has completed) in multiple nodes is higher the more drives there are in the pool unless you design your crushmap to limit the size of any replica-domain (i.e., the leaf crush bucket that a single copy of an object may end up in). And in the rbd use case, if you are unlucky and even lose just a tiny fraction of objects, due to random placement there is a good chance you have lost a handful of objects from most/all rbd volumes in the cluster, which could make for many unhappy users with potentially unrecoverable filesystems in those rbds. The guys at UnitedStack did a nice presentation that touched on this a while back (http://www.slideshare.net/kioecn/build-an-highperformance-and-highdurable-block-storage-service-based-on-ceph) but I'm not sure I follow their durability model just from these slides, and if you're going to play with this you really do want a tool to calculate/simulate the impact the changes. Interesting discussion - maybe loop in ceph-users? Cheers, On 14 October 2016 at 19:53, Adam Kijak <adam.ki...@corp.ovh.com> wrote: >> >> From: Clint Byrum <cl...@fewbar.com> >> Sent: Wednesday, October 12, 2016 10:46 PM >> To: openstack-operators >> Subject: Re: [Openstack-operators] [openstack-operators][ceph][nova] How do >>you handle Nova on Ceph? >> >> Excerpts from Adam Kijak's message of 2016-10-12 12:23:41 +: >> > > >> > > From: Xav Paice <xavpa...@gmail.com> >> > > Sent: Monday, October 10, 2016 8:41 PM >> > > To: openstack-operators@lists.openstack.org >> > > Subject: Re: [Openstack-operators] [openstack-operators][ceph][nova] How >> > > do you handle Nova on Ceph? >> > > >> > > I'm really keen to hear more about those limitations. >> > >> > Basically it's all related to the failure domain ("blast radius") and risk >> > management. >> > Bigger Ceph cluster means more users. >> >> Are these risks well documented? Since Ceph is specifically designed >> _not_ to have the kind of large blast radius that one might see with >> say, a centralized SAN, I'm curious to hear what events trigger >> cluster-wide blasts. > > In theory yes, Ceph is desgined to be fault tolerant, > but from our experience it's not always like that. > I think it's not well documented, but I know this case: > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg32804.html > >> > Growing the Ceph cluster temporary slows it down, so many users will be >> > affected. >> One might say that a Ceph cluster that can't be grown without the users >> noticing is an over-subscribed Ceph cluster. My understanding is that >> one is always advised to provision a certain amount of cluster capacity >> for growing and replicating to replaced drives. > > I agree that provisioning a fixed size Cluster would solve some problems but > planning the capacity is not always easy. > Predicting the size and making it cost effective (empty big Ceph cluster > costs a lot on the beginning) is quite difficult. > Also adding a new Ceph cluster will be always more transparent to users than > manipulating existing one especially when growing pool PGs) > > ___ > OpenStack-operators mailing list > OpenStack-operators@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators -- Cheers, ~Blairo ___ OpenStack-operators mailing list OpenStack-operators@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
Re: [Openstack-operators] [openstack-operators][ceph][nova] How do you handle Nova on Ceph?
> > From: Clint Byrum <cl...@fewbar.com> > Sent: Wednesday, October 12, 2016 10:46 PM > To: openstack-operators > Subject: Re: [Openstack-operators] [openstack-operators][ceph][nova] How do > you handle Nova on Ceph? > > Excerpts from Adam Kijak's message of 2016-10-12 12:23:41 +: > > > > > > From: Xav Paice <xavpa...@gmail.com> > > > Sent: Monday, October 10, 2016 8:41 PM > > > To: openstack-operators@lists.openstack.org > > > Subject: Re: [Openstack-operators] [openstack-operators][ceph][nova] How > > > do you handle Nova on Ceph? > > > > > > I'm really keen to hear more about those limitations. > > > > Basically it's all related to the failure domain ("blast radius") and risk > > management. > > Bigger Ceph cluster means more users. > > Are these risks well documented? Since Ceph is specifically designed > _not_ to have the kind of large blast radius that one might see with > say, a centralized SAN, I'm curious to hear what events trigger > cluster-wide blasts. In theory yes, Ceph is desgined to be fault tolerant, but from our experience it's not always like that. I think it's not well documented, but I know this case: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg32804.html > > Growing the Ceph cluster temporary slows it down, so many users will be > > affected. > One might say that a Ceph cluster that can't be grown without the users > noticing is an over-subscribed Ceph cluster. My understanding is that > one is always advised to provision a certain amount of cluster capacity > for growing and replicating to replaced drives. I agree that provisioning a fixed size Cluster would solve some problems but planning the capacity is not always easy. Predicting the size and making it cost effective (empty big Ceph cluster costs a lot on the beginning) is quite difficult. Also adding a new Ceph cluster will be always more transparent to users than manipulating existing one especially when growing pool PGs) ___ OpenStack-operators mailing list OpenStack-operators@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
Re: [Openstack-operators] [openstack-operators][ceph][nova] How do you handle Nova on Ceph?
> From: Warren Wang <war...@wangspeed.com> > Sent: Wednesday, October 12, 2016 10:02 PM > To: Adam Kijak > Cc: Abel Lopez; openstack-operators > Subject: Re: [Openstack-operators] [openstack-operators][ceph][nova] How do > you handle Nova on Ceph? > > If fault domain is a concern, you can always split the cloud up into 3 > regions, each having a dedicate Ceph cluster. It isn't necessarily going to > mean more hardware, just logical splits. This is kind of assuming that the > network doesn't share the same fault domain though. This is not an option because having Region1-1, Region1-2, ..., Region1-10 would not be very convenient for users. > Alternatively, you can split the hardware for the Ceph boxes into multiple > clusters, and use multi backend Cinder to talk to the same set of hypervisors > to use multiple Ceph clusters. We're doing that to migrate from one Ceph > cluster to another. You can even mount a volume from each cluster into a > single instance. Multiple Ceph clusters on Cinder is not a problem, I agree. Unfortunately we use Ceph for Nova (disks of instances are on Ceph directly). > Keep in mind that you don't really want to shrink a Ceph cluster too much. > What's "too big"? You should keep growing so that the fault domains aren't > too small (3 physical rack min), or you guarantee that the entire cluster > stops if you lose network. > Just my 2 cents, Thanks! ___ OpenStack-operators mailing list OpenStack-operators@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
Re: [Openstack-operators] [openstack-operators][ceph][nova] How do you handle Nova on Ceph?
Excerpts from Adam Kijak's message of 2016-10-12 12:23:41 +: > > > > From: Xav Paice <xavpa...@gmail.com> > > Sent: Monday, October 10, 2016 8:41 PM > > To: openstack-operators@lists.openstack.org > > Subject: Re: [Openstack-operators] [openstack-operators][ceph][nova] How do > > you handle Nova on Ceph? > > > > On Mon, 2016-10-10 at 13:29 +, Adam Kijak wrote: > > > Hello, > > > > > > We use a Ceph cluster for Nova (Glance and Cinder as well) and over > > > time, > > > more and more data is stored there. We can't keep the cluster so big > > > because of > > > Ceph's limitations. Sooner or later it needs to be closed for adding > > > new > > > instances, images and volumes. Not to mention it's a big failure > > > domain. > > > > I'm really keen to hear more about those limitations. > > Basically it's all related to the failure domain ("blast radius") and risk > management. > Bigger Ceph cluster means more users. Are these risks well documented? Since Ceph is specifically designed _not_ to have the kind of large blast radius that one might see with say, a centralized SAN, I'm curious to hear what events trigger cluster-wide blasts. > Growing the Ceph cluster temporary slows it down, so many users will be > affected. One might say that a Ceph cluster that can't be grown without the users noticing is an over-subscribed Ceph cluster. My understanding is that one is always advised to provision a certain amount of cluster capacity for growing and replicating to replaced drives. > There are bugs in Ceph which can cause data corruption. It's rare, but when > it happens > it can affect many (maybe all) users of the Ceph cluster. > :( ___ OpenStack-operators mailing list OpenStack-operators@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
Re: [Openstack-operators] [openstack-operators][ceph][nova] How do you handle Nova on Ceph?
If fault domain is a concern, you can always split the cloud up into 3 regions, each having a dedicate Ceph cluster. It isn't necessarily going to mean more hardware, just logical splits. This is kind of assuming that the network doesn't share the same fault domain though. Alternatively, you can split the hardware for the Ceph boxes into multiple clusters, and use multi backend Cinder to talk to the same set of hypervisors to use multiple Ceph clusters. We're doing that to migrate from one Ceph cluster to another. You can even mount a volume from each cluster into a single instance. Keep in mind that you don't really want to shrink a Ceph cluster too much. What's "too big"? You should keep growing so that the fault domains aren't too small (3 physical rack min), or you guarantee that the entire cluster stops if you lose network. Just my 2 cents, Warren On Wed, Oct 12, 2016 at 8:35 AM, Adam Kijak <adam.ki...@corp.ovh.com> wrote: > > ___ > > From: Abel Lopez <alopg...@gmail.com> > > Sent: Monday, October 10, 2016 9:57 PM > > To: Adam Kijak > > Cc: openstack-operators > > Subject: Re: [Openstack-operators] [openstack-operators][ceph][nova] > How do you handle Nova on Ceph? > > > > Have you thought about dedicated pools for cinder/nova and a separate > pool for glance, and any other uses you might have? > > You need to setup secrets on kvm, but you can have cinder creating > volumes from glance images quickly in different pools > > We already have separate pool for images, volumes and instances. > Separate pools doesn't really split the failure domain though. > Also AFAIK you can't set up multiple pools for instances in nova.conf, > right? > > ___ > OpenStack-operators mailing list > OpenStack-operators@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators > ___ OpenStack-operators mailing list OpenStack-operators@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
Re: [Openstack-operators] [openstack-operators][ceph][nova] How do you handle Nova on Ceph?
> ___ > From: Abel Lopez <alopg...@gmail.com> > Sent: Monday, October 10, 2016 9:57 PM > To: Adam Kijak > Cc: openstack-operators > Subject: Re: [Openstack-operators] [openstack-operators][ceph][nova] How do > you handle Nova on Ceph? > > Have you thought about dedicated pools for cinder/nova and a separate pool > for glance, and any other uses you might have? > You need to setup secrets on kvm, but you can have cinder creating volumes > from glance images quickly in different pools We already have separate pool for images, volumes and instances. Separate pools doesn't really split the failure domain though. Also AFAIK you can't set up multiple pools for instances in nova.conf, right? ___ OpenStack-operators mailing list OpenStack-operators@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
Re: [Openstack-operators] [openstack-operators][ceph][nova] How do you handle Nova on Ceph?
> > From: Xav Paice <xavpa...@gmail.com> > Sent: Monday, October 10, 2016 8:41 PM > To: openstack-operators@lists.openstack.org > Subject: Re: [Openstack-operators] [openstack-operators][ceph][nova] How do > you handle Nova on Ceph? > > On Mon, 2016-10-10 at 13:29 +, Adam Kijak wrote: > > Hello, > > > > We use a Ceph cluster for Nova (Glance and Cinder as well) and over > > time, > > more and more data is stored there. We can't keep the cluster so big > > because of > > Ceph's limitations. Sooner or later it needs to be closed for adding > > new > > instances, images and volumes. Not to mention it's a big failure > > domain. > > I'm really keen to hear more about those limitations. Basically it's all related to the failure domain ("blast radius") and risk management. Bigger Ceph cluster means more users. Growing the Ceph cluster temporary slows it down, so many users will be affected. There are bugs in Ceph which can cause data corruption. It's rare, but when it happens it can affect many (maybe all) users of the Ceph cluster. > > > > How do you handle this issue? > > What is your strategy to divide Ceph clusters between compute nodes? > > How do you solve VM snapshot placement and migration issues then > > (snapshots will be left on older Ceph)? > > Having played with Ceph and compute on the same hosts, I'm a big fan of > separating them and having dedicated Ceph hosts, and dedicated compute > hosts. That allows me a lot more flexibility with hardware > configuration and maintenance, easier troubleshooting for resource > contention, and also allows scaling at different rates. Exactly, I consider it the best practice as well. ___ OpenStack-operators mailing list OpenStack-operators@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
Re: [Openstack-operators] [openstack-operators][ceph][nova] How do you handle Nova on Ceph?
Have you thought about dedicated pools for cinder/nova and a separate pool for glance, and any other uses you might have? You need to setup secrets on kvm, but you can have cinder creating volumes from glance images quickly in different pools > On Oct 10, 2016, at 6:29 AM, Adam Kijakwrote: > > Hello, > > We use a Ceph cluster for Nova (Glance and Cinder as well) and over time, > more and more data is stored there. We can't keep the cluster so big because > of > Ceph's limitations. Sooner or later it needs to be closed for adding new > instances, images and volumes. Not to mention it's a big failure domain. > > How do you handle this issue? > What is your strategy to divide Ceph clusters between compute nodes? > How do you solve VM snapshot placement and migration issues then > (snapshots will be left on older Ceph)? > > We've been thinking about features like: dynamic Ceph configuration > (not static like in nova.conf) in Nova, pinning instances to a Ceph cluster > etc. > What do you think about that? > > > ___ > OpenStack-operators mailing list > OpenStack-operators@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators ___ OpenStack-operators mailing list OpenStack-operators@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
Re: [Openstack-operators] [openstack-operators][ceph][nova] How do you handle Nova on Ceph?
On Mon, 2016-10-10 at 13:29 +, Adam Kijak wrote: > Hello, > > We use a Ceph cluster for Nova (Glance and Cinder as well) and over > time, > more and more data is stored there. We can't keep the cluster so big > because of > Ceph's limitations. Sooner or later it needs to be closed for adding > new > instances, images and volumes. Not to mention it's a big failure > domain. I'm really keen to hear more about those limitations. > > How do you handle this issue? > What is your strategy to divide Ceph clusters between compute nodes? > How do you solve VM snapshot placement and migration issues then > (snapshots will be left on older Ceph)? Having played with Ceph and compute on the same hosts, I'm a big fan of separating them and having dedicated Ceph hosts, and dedicated compute hosts. That allows me a lot more flexibility with hardware configuration and maintenance, easier troubleshooting for resource contention, and also allows scaling at different rates. > > We've been thinking about features like: dynamic Ceph configuration > (not static like in nova.conf) in Nova, pinning instances to a Ceph > cluster etc. > What do you think about that? > > > ___ > OpenStack-operators mailing list > OpenStack-operators@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operato > rs ___ OpenStack-operators mailing list OpenStack-operators@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators