Re: [ceph-users] Deployment with Xen

2018-02-15 Thread David Turner
Glad I could help. ProxMox is a prebuilt kvm/qemu hypervisor with ceph
integration that may be worth looking into. Booting from rbds is definitely
something that is possible. There should be some resources on the ceph
documentation or people in the ML that know how to do it already.

On Thu, Feb 15, 2018, 7:01 AM Egoitz Aurrekoetxea  wrote:

> Good morning David!!
>
>
> First all I wanted to hugely thank the mail you sent yesterday. You don't
> receive all the days these kind of advises from an expert in the area. I
> printed the mail and read it slowly for understanding properly.
>
> Basically I wanted to confirm there's no single point of failure and the
> hypervisors opinion or ideas
>
>
> I'm trying now KVM. Although Qemu is able to create that kind of disk
> images I'm not totally sure if it could boot from them that is something
> very useful for us. In Xen I managed to access through krbd to the cluster
>
> space but you know, that's not the most optimized config as you told
> yesterday, because it should be done through librbd to be totally
> optimal... you know but am not really sure it could boot from that disks
> although
>
> have read too it can but... it seems Centos packages have the rbd support
> built in so... in fact, I can create disks... but am not just able to
> boot...
>
>
> Well I assume last topic would be finally clarified in KVM/Qemu list.
>
>
> Just wanted to send all my gratitude for your help :)
>
>
> Thanks mate,
>
> Cheers,
>
>
>
> El 2018-02-14 16:47, David Turner escribió:
>
> First off to answer your questions about mons, you need to understand that
> they work in a Paxos Quorum.  What that means is that there needs to be a
> majority of Mons that agree that they are in charge.  This is why even
> numbers of mons is a bad idea as they can potentially split themselves in
> half.  For this case, let's say you have 3 mons.  2 of them need to be up
> and communicating for them to agree that they can respond to clients.  If
> the third mon is online, but networking troubles are keeping it from
> communicating with the other 2 mons, it will realize that it isn't a part
> of the quorum and will refuse to respond to anyone that asks it questions.
> I think there might be some logic for allowing 1 mon to manage the cluster,
> but I think that works best if the other mons properly shut down informing
> the other mons that they are going offline so it isn't up to a vote for who
> is in charge.
>
> Lifecycle of a client and a mon.  When a client first communicates with a
> Ceph cluster it uses the mon_host setting in its ceph.conf file to know who
> the mons are.  It goes through the list until it gets one that will
> authoritatively respond for the cluster and give it the osd map.  Now that
> it has an osd map it can start communicating with all of the osds in the
> cluster, reading, writing, mounting, etc.  This is usually where a client
> stops talking to mons.  As a client is talking with osds, the osds will
> respond back with updated maps if there are any.  This change was made in
> the Hammer release of Ceph.  Before that, all map updates were handled by
> the mons and it was a huge burden on them causing them to prevent a cluster
> from growing larger than about 1,000 osds because the mons couldn't handle
> managing the maps for any more osds.  In Hammer, and still happening today,
> osds started updating each other's osd maps as they communicated with each
> other.  If anything is confused as to which map to use, they still ask the
> mon and the mon will tell them the right one.
>
> If a mon goes down, then the rest of the mon_host will be used to know who
> to contact.  It might fail on a down mon, but it will retry and get to one
> that is online.  Mons are the keeper of cephx auth keys and map versions,
> but other than that, they really don't impact performance much.  Everything
> else is handled by the algorithms in the osd map that tell a client where
> all objects and osds are in the cluster and the majority of map updates
> will come from the communication with the osds.
>
> Back to VMs and librbd vs krbd (which is /dev/rbd* devices).  The kernel
> driver does not have feature parity with Ceph.  Even the latest kernel does
> not support all Ceph RBD features and you will have to disable them in your
> cluster.  This disables things like object map which is how Ceph keeps
> track of which objects do and don't exist in an RBD.  Without object map
> Ceph has to assume that every object that can exist in an RBD does.  With
> object map, if you delete an RBD Ceph issues a delete to only the objects
> that exist, without it Ceph has to attempt to delete every object
> regardless if it exists.  Checking the used space of an RBD with object map
> is instant, checking it without object map can take several minutes on RBDs
> that are only 100GB in size (this is even worse if you are using snapshots
> as it has to check for every object that can possibly 

Re: [ceph-users] Deployment with Xen

2018-02-15 Thread Egoitz Aurrekoetxea
Good morning David!! 

First all I wanted to hugely thank the mail you sent yesterday. You
don't receive all the days these kind of advises from an expert in the
area. I printed the mail and read it slowly for understanding properly. 

Basically I wanted to confirm there's no single point of failure and the
hypervisors opinion or ideas 

I'm trying now KVM. Although Qemu is able to create that kind of disk
images I'm not totally sure if it could boot from them that is something
very useful for us. In Xen I managed to access through krbd to the
cluster  

space but you know, that's not the most optimized config as you told
yesterday, because it should be done through librbd to be totally
optimal... you know but am not really sure it could boot from that disks
although  

have read too it can but... it seems Centos packages have the rbd
support built in so... in fact, I can create disks... but am not just
able to boot...  

Well I assume last topic would be finally clarified in KVM/Qemu list. 

Just wanted to send all my gratitude for your help :) 

Thanks mate, 

Cheers,

El 2018-02-14 16:47, David Turner escribió:

> First off to answer your questions about mons, you need to understand that 
> they work in a Paxos Quorum.  What that means is that there needs to be a 
> majority of Mons that agree that they are in charge.  This is why even 
> numbers of mons is a bad idea as they can potentially split themselves in 
> half.  For this case, let's say you have 3 mons.  2 of them need to be up and 
> communicating for them to agree that they can respond to clients.  If the 
> third mon is online, but networking troubles are keeping it from 
> communicating with the other 2 mons, it will realize that it isn't a part of 
> the quorum and will refuse to respond to anyone that asks it questions.  I 
> think there might be some logic for allowing 1 mon to manage the cluster, but 
> I think that works best if the other mons properly shut down informing the 
> other mons that they are going offline so it isn't up to a vote for who is in 
> charge. 
> Lifecycle of a client and a mon.  When a client first communicates with a 
> Ceph cluster it uses the mon_host setting in its ceph.conf file to know who 
> the mons are.  It goes through the list until it gets one that will 
> authoritatively respond for the cluster and give it the osd map.  Now that it 
> has an osd map it can start communicating with all of the osds in the 
> cluster, reading, writing, mounting, etc.  This is usually where a client 
> stops talking to mons.  As a client is talking with osds, the osds will 
> respond back with updated maps if there are any.  This change was made in the 
> Hammer release of Ceph.  Before that, all map updates were handled by the 
> mons and it was a huge burden on them causing them to prevent a cluster from 
> growing larger than about 1,000 osds because the mons couldn't handle 
> managing the maps for any more osds.  In Hammer, and still happening today, 
> osds started updating each other's osd maps as they communicated with each 
> other.  If anything is confused as
to which map to use, they still ask the mon and the mon will tell them the 
right one. 
> 
> If a mon goes down, then the rest of the mon_host will be used to know who to 
> contact.  It might fail on a down mon, but it will retry and get to one that 
> is online.  Mons are the keeper of cephx auth keys and map versions, but 
> other than that, they really don't impact performance much.  Everything else 
> is handled by the algorithms in the osd map that tell a client where all 
> objects and osds are in the cluster and the majority of map updates will come 
> from the communication with the osds. 
> 
> Back to VMs and librbd vs krbd (which is /dev/rbd* devices).  The kernel 
> driver does not have feature parity with Ceph.  Even the latest kernel does 
> not support all Ceph RBD features and you will have to disable them in your 
> cluster.  This disables things like object map which is how Ceph keeps track 
> of which objects do and don't exist in an RBD.  Without object map Ceph has 
> to assume that every object that can exist in an RBD does.  With object map, 
> if you delete an RBD Ceph issues a delete to only the objects that exist, 
> without it Ceph has to attempt to delete every object regardless if it 
> exists.  Checking the used space of an RBD with object map is instant, 
> checking it without object map can take several minutes on RBDs that are only 
> 100GB in size (this is even worse if you are using snapshots as it has to 
> check for every object that can possibly exist on the RBD itself as well as 
> the snapshots). 
> 
> librbd has feature parity with Ceph as it is updated and the same version as 
> Ceph with every release.  krbd is still trying to implement RBD features 
> released over a year ago.  I prefer to use the Ceph libraries as often as 
> possible, then the fuse drivers (except rbd-fuse because it is slower than 
> 

Re: [ceph-users] Deployment with Xen

2018-02-14 Thread David Turner
First off to answer your questions about mons, you need to understand that
they work in a Paxos Quorum.  What that means is that there needs to be a
majority of Mons that agree that they are in charge.  This is why even
numbers of mons is a bad idea as they can potentially split themselves in
half.  For this case, let's say you have 3 mons.  2 of them need to be up
and communicating for them to agree that they can respond to clients.  If
the third mon is online, but networking troubles are keeping it from
communicating with the other 2 mons, it will realize that it isn't a part
of the quorum and will refuse to respond to anyone that asks it questions.
I think there might be some logic for allowing 1 mon to manage the cluster,
but I think that works best if the other mons properly shut down informing
the other mons that they are going offline so it isn't up to a vote for who
is in charge.

Lifecycle of a client and a mon.  When a client first communicates with a
Ceph cluster it uses the mon_host setting in its ceph.conf file to know who
the mons are.  It goes through the list until it gets one that will
authoritatively respond for the cluster and give it the osd map.  Now that
it has an osd map it can start communicating with all of the osds in the
cluster, reading, writing, mounting, etc.  This is usually where a client
stops talking to mons.  As a client is talking with osds, the osds will
respond back with updated maps if there are any.  This change was made in
the Hammer release of Ceph.  Before that, all map updates were handled by
the mons and it was a huge burden on them causing them to prevent a cluster
from growing larger than about 1,000 osds because the mons couldn't handle
managing the maps for any more osds.  In Hammer, and still happening today,
osds started updating each other's osd maps as they communicated with each
other.  If anything is confused as to which map to use, they still ask the
mon and the mon will tell them the right one.

If a mon goes down, then the rest of the mon_host will be used to know who
to contact.  It might fail on a down mon, but it will retry and get to one
that is online.  Mons are the keeper of cephx auth keys and map versions,
but other than that, they really don't impact performance much.  Everything
else is handled by the algorithms in the osd map that tell a client where
all objects and osds are in the cluster and the majority of map updates
will come from the communication with the osds.

Back to VMs and librbd vs krbd (which is /dev/rbd* devices).  The kernel
driver does not have feature parity with Ceph.  Even the latest kernel does
not support all Ceph RBD features and you will have to disable them in your
cluster.  This disables things like object map which is how Ceph keeps
track of which objects do and don't exist in an RBD.  Without object map
Ceph has to assume that every object that can exist in an RBD does.  With
object map, if you delete an RBD Ceph issues a delete to only the objects
that exist, without it Ceph has to attempt to delete every object
regardless if it exists.  Checking the used space of an RBD with object map
is instant, checking it without object map can take several minutes on RBDs
that are only 100GB in size (this is even worse if you are using snapshots
as it has to check for every object that can possibly exist on the RBD
itself as well as the snapshots).

librbd has feature parity with Ceph as it is updated and the same version
as Ceph with every release.  krbd is still trying to implement RBD features
released over a year ago.  I prefer to use the Ceph libraries as often as
possible, then the fuse drivers (except rbd-fuse because it is slower than
dirt), and if I have no other choice then I'll use the kernel drivers.
When it comes to choosing a hypervisor for hosting VMs on RBDs, there is no
question in my mind that I would only look at options that use librbd.

On Tue, Feb 13, 2018 at 6:13 PM Egoitz Aurrekoetxea 
wrote:

> Hi David!!
>
> Thanks a lot for your answer. But what happens when you have... imagine
> two monitors or more and one of them becomes unreponsive?. Another one is
> used after a timeout or... what happens when a client wants to access to
> some data, needs to query for that (for knowing where the info is) a
> monitor and does not answer?. A monitor that becomes not responsive is
> discarded for the following queries of where the data exists in the
> cluster?.
>
> So saying in some way... you wont use when talking in terms of performance
> any kind of solution not accessing through librbd?. Is the performance poor
> or bad when using /dev/rbdX devices mounted?. Or perhaps you say in terms
> of data integrity?.
>
> I was planning to use Xen with Cepth but after your advine ... . Would
> you definitively to with KVM?.
>
> Thanks a lot again 
> Chefs,
>
>
> Egoitz,
>
> El 13 feb 2018, a las 20:19, David Turner 
> escribió:
>
> Monitors are not required for accessing 

Re: [ceph-users] Deployment with Xen

2018-02-13 Thread Egoitz Aurrekoetxea
Hi David!!

Thanks a lot for your answer. But what happens when you have... imagine two 
monitors or more and one of them becomes unreponsive?. Another one is used 
after a timeout or... what happens when a client wants to access to some data, 
needs to query for that (for knowing where the info is) a monitor and does not 
answer?. A monitor that becomes not responsive is discarded for the following 
queries of where the data exists in the cluster?.

So saying in some way... you wont use when talking in terms of performance any 
kind of solution not accessing through librbd?. Is the performance poor or bad 
when using /dev/rbdX devices mounted?. Or perhaps you say in terms of data 
integrity?.

I was planning to use Xen with Cepth but after your advine ... . Would you 
definitively to with KVM?.

Thanks a lot again 
Chefs,


Egoitz,

> El 13 feb 2018, a las 20:19, David Turner  escribió:
> 
> Monitors are not required for accessing data from the Ceph cluster.  Clients 
> will ask a monitor for a current OSD map and then use that OSD map to 
> communicate with the OSDs directly for all reads and writes.  The map 
> includes the crush map which has all of the information a client needs to 
> know where every object is in the cluster.  Having 3 mons is a good number 
> for small deployments.  5 mons is better for better redundancy in the monitor 
> quorum.  Avoid an even number of mons always.
> 
> librbd is definitely the way to go for accessing RBDs for a hypervisor as 
> opposed to fuse or krbd.  For a quick and easy hypervisor using Ceph, I like 
> Proxmox.  It natively has the ability to use KVM with Ceph without having to 
> configure it yourself.  It comes with a nice gui as well to see the console 
> screen for your VMs.  It also has a fairly simple guide to cluster 
> hypervisors together to provide HA support for your VMs.  For larger scale VM 
> deployments, Openstack is probably the way I would go.
> 
>> On Tue, Feb 13, 2018 at 2:11 PM Egoitz Aurrekoetxea  
>> wrote:
>> Good afternoon,
>> 
>> As I'm new to Ceph I was wondering what could be the most proper way to
>> use it with Xen hypervisor (with a plain Linux installation, Centos, for
>> instance). Have read the less proper one is to just
>> mount the /dev/rbdX device in a mount point and just showing that space
>> to the Hypervisor but I see it pretty easy and seems stable. Seems not
>> to perform bad... Is it better to use for instance librbd
>> with KVM?. Does it perform better?.
>> 
>> By the way, it seems to use the monitor node in order to access to the
>> space in the osd cluster. Have read too that Ceph has been designed
>> keeping in mind no single points of failure but... is it possible
>> to configure several monitor nodes, and then after a very little timeout
>> or similar to access to the file system through the other nodes?. What
>> could be the most proper way of configuring this for avoiding a
>> machine to loose the storage if the monitor fails?. Could you point
>> please me in the right direction?. Perhaps with several monitors or
>> 
>> By the way if you could consider it would be better to use another
>> hypervisor or config (with librados or whatever) with Ceph, could you
>> please suggest me too?. Help to the newbie :p :) :)
>> 
>> Best regards,
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deployment with Xen

2018-02-13 Thread David Turner
Monitors are not required for accessing data from the Ceph cluster.
Clients will ask a monitor for a current OSD map and then use that OSD map
to communicate with the OSDs directly for all reads and writes.  The map
includes the crush map which has all of the information a client needs to
know where every object is in the cluster.  Having 3 mons is a good number
for small deployments.  5 mons is better for better redundancy in the
monitor quorum.  Avoid an even number of mons always.

librbd is definitely the way to go for accessing RBDs for a hypervisor as
opposed to fuse or krbd.  For a quick and easy hypervisor using Ceph, I
like Proxmox.  It natively has the ability to use KVM with Ceph without
having to configure it yourself.  It comes with a nice gui as well to see
the console screen for your VMs.  It also has a fairly simple guide to
cluster hypervisors together to provide HA support for your VMs.  For
larger scale VM deployments, Openstack is probably the way I would go.

On Tue, Feb 13, 2018 at 2:11 PM Egoitz Aurrekoetxea 
wrote:

> Good afternoon,
>
> As I'm new to Ceph I was wondering what could be the most proper way to
> use it with Xen hypervisor (with a plain Linux installation, Centos, for
> instance). Have read the less proper one is to just
> mount the /dev/rbdX device in a mount point and just showing that space
> to the Hypervisor but I see it pretty easy and seems stable. Seems not
> to perform bad... Is it better to use for instance librbd
> with KVM?. Does it perform better?.
>
> By the way, it seems to use the monitor node in order to access to the
> space in the osd cluster. Have read too that Ceph has been designed
> keeping in mind no single points of failure but... is it possible
> to configure several monitor nodes, and then after a very little timeout
> or similar to access to the file system through the other nodes?. What
> could be the most proper way of configuring this for avoiding a
> machine to loose the storage if the monitor fails?. Could you point
> please me in the right direction?. Perhaps with several monitors or
>
> By the way if you could consider it would be better to use another
> hypervisor or config (with librados or whatever) with Ceph, could you
> please suggest me too?. Help to the newbie :p :) :)
>
> Best regards,
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Deployment with Xen

2018-02-13 Thread Egoitz Aurrekoetxea
Good afternoon,

As I'm new to Ceph I was wondering what could be the most proper way to
use it with Xen hypervisor (with a plain Linux installation, Centos, for
instance). Have read the less proper one is to just
mount the /dev/rbdX device in a mount point and just showing that space
to the Hypervisor but I see it pretty easy and seems stable. Seems not
to perform bad... Is it better to use for instance librbd
with KVM?. Does it perform better?.

By the way, it seems to use the monitor node in order to access to the
space in the osd cluster. Have read too that Ceph has been designed
keeping in mind no single points of failure but... is it possible
to configure several monitor nodes, and then after a very little timeout
or similar to access to the file system through the other nodes?. What
could be the most proper way of configuring this for avoiding a
machine to loose the storage if the monitor fails?. Could you point
please me in the right direction?. Perhaps with several monitors or

By the way if you could consider it would be better to use another
hypervisor or config (with librados or whatever) with Ceph, could you
please suggest me too?. Help to the newbie :p :) :)

Best regards,

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com