Re: [vdsm] [RFC] GlusterFS domain specific changes

2012-09-10 Thread M. Mohan Kumar
On Fri, 7 Sep 2012 17:07:28 -0400 (EDT), Ayal Baron aba...@redhat.com wrote:
 
 
 - Original Message -
  As of now BD xlator supports only working with linear Logical
  volumes,
  they are thick provisioned. gluster cli command gluster volume
  create
  with option device=lv allows to work with logical volumes as files.
  
  As a POC I have a code(not posted to external list), with option
  device=thin to gluster volume create command it allows to work with
  thin provisioned targets. But it does not take care of resizing
  thin-pool when it reaches low-level threshold. Supporting thin
  targets
  is in our TODO list. We have dependency on lvm2 library to provide
  apis
  to create thin-targets.
 
 I'm definitely missing some background here.
 1. Can the LV span on multiple bricks in gLuster?
  i. If 'yes' then
a. do you use gLuster's replication and distribution schemes to gain 
 performance and redundancy?
b. what performance gain is there over normal gLuster with files?
  ii. If 'not' then you're only exposing single host local storage LVM? (in 
 which case I don't see why gLuster is used at all and where).


No, as of now BD Xlator works only with one brick. There are some issues
in supporting GlusterFS features such as replication and stripe from BD
xlator. We are still evaluating BD xlator for such scenarios. 

Advantages of BD Xlator:
 (*) Ease of use and unified management for both file and block based
 storage.
 (*) Making block devices available to nodes which don't have direct
 access to SAN. Supporting migration to nodes which don't have SAN
 access.
 (*) With FS interfaces, it becomes easier to support T10 extensions
 like xcopy, writesame (Currently not supported, future plan)
 (*) Use of dm-thin logical volumes to provide VM images that are
 inherently thin provisioned. It allows multi-level snapshot. When we
 support thin-provisioned logical volumes with 'unmap' support its
 almost equivalant to sparse files. This is also a future plan. 

   
 
 From a different angle, the only benefit I can think of in exposing a fs 
 interface over LVM is for consumers who do not wish to know the details of 
 the underlying storage but want the performance gain of using block storage.
 vdsm is already intimately familiar with LVM and block devices, so adding the 
 FS layer scheme on top doesn't strike me as adding any value. In addition, 
 you require the consumer to know a lot about your interface because it's not 
 truely a FS interface.  e.g. consumer is not allowed to create directories, 
 files are not sparse, not to mention that if you're indeed using LVM then I 
 don't think you're considering the VG MD and extent size limitations:
 1. LVM currently has severe limitations wrt number of objects it can manage 
 (the limitation is actually the size of the VG metadata, but the distinction 
 is not important just yet).  This means that creating a metadata LV in 
 addition to each data LV is very costly (at around 1000 LVs you'd hit a 
 problem.  vdsm currently creates 2 files per snapshot (the data and a small 
 file with metadata describing it) meaning that you'd reach this limit really 
 fast.
 2. LVM max LV size is extent size * 65K, this means that if I choose a 4K 
 extent size then my max LV size would be 256MB. This obviously won't do for 
 VMs disks so you'd choose a much larget extent size.  However a larger extent 
 size means that each metadata file vdsm creates wastes a lot of storage 
 space.  So even if LVM could scale, your storage usage plummets and your $/MB 
 ratio increases.
 The way around this is of course not to have a metadata file per volume but 
 have 1 file containing all the metadata, but then that means I'm fully aware 
 of the limitations of the environment and treating my objects as files gains 
 me nothing (but does require a new hybrid domain, a lot more code etc).


GlusterFS + BD xlator domain will be similar to block based storage
domain. IIUC in block based storage, VSDM will not create as many
LVs(files) similar to posix based storage.

BD xlator provides filesystem kind of interface to create/manipulate LVs
while in block based storage domain commands like lvcreate, lvextend
commands are used to manipulate them. ie BD xlator provides FS interface
for block based storage domain.

In future when we have proper support for reflink[1] cp --reflink can be
used for creating linked clone. Also there was a discussion in the past
on copyfile[2] interface which could be used to create full clone of lvs

[1] http://marc.info/?l=linux-fsdevelm=125296717319013w=2
[2] http://www.spinics.net/lists/linux-nfs/msg26203.html

 Also note that without thin provisioning we loose our ability to create 
 snapshots.
 
Could you please explain it?

___
vdsm-devel mailing list
vdsm-devel@lists.fedorahosted.org

Re: [vdsm] [RFC] GlusterFS domain specific changes

2012-09-07 Thread Itamar Heim

On 09/07/2012 08:21 AM, M. Mohan Kumar wrote:

On Thu, 6 Sep 2012 18:59:19 -0400 (EDT), Ayal Baron aba...@redhat.com wrote:



- Original Message -

- Original Message -

From: M. Mohan Kumar mo...@in.ibm.com
To: vdsm-devel@lists.fedorahosted.org
Sent: Wednesday, July 25, 2012 1:26:15 PM
Subject: [vdsm] [RFC] GlusterFS domain specific changes


We are developing a GlusterFS server translator to export block
devices
as regular files to the client. Using block devices to serve VM
images
gives performance improvements, since it avoids some file system
bottlenecks in the host kernel. Goal is to use one block device(ie
file
at the client side) per VM image and feed this file to QEMU to get
the
performance improvements. QEMU will talk to glusterfs server
directly
using libgfapi.

Currently we support only exporting Volume groups and Logical
Volumes. Logical volumes are exported as regular files to the
client.


Are you actually using LVM behind the scenes?
If so, why bother with exposing the LVs as files and not raw block devices?


Ayal,

The idea is to provide a FS interface for managing block devices. One
can mount the Block Device Gluster Volume and create a LV and size it
just by
  $ touch lv1
  $ truncate -s5G lv1

And other file commands can be used to clone LVs, snapshot LVs
  $ ln lv1 lv2 # clones
  $ ln -s lv1 lv1.sn # creates snapshot

By enabling this feature GlusterFS can directly export storage in
SAN. We are planning to add feature to export LUNs also as regular files
in future.




In GlusterFS terminology a volume capable of exporting block
devices is
created by specifying the 'Volume Group' (ie VG in Logical Volume
management). Block Device translator(BD xlator) exports this volume
group as a directory and LVs under it as regular files. In the
gluster
mount point creating a file results in creating a logical volume,
removing a file results in removing logical volume etc.

When a GlusterFS volume enabled with BD xlator is used, directory
creation in that gluster mount path is not supported because
directory
maps to Volume groups in BD xlator. But it could be an issue in
VDSM
environment when a new VDSM volume is created for GlusterFS domain,
VDSM
mounts the storage domain and creates directories under that and
create
files for vm image and other uses (like meta data).



Is it possible to modify this behavior in VDSM to use flat
structure
instead of creating directories and VM images and other files
underneath
it? ie for GlusterFS domain with BD xlator VDSM will not create any
directory and only creates all required files under the mount point
directory itself.


 From your description I think that the GlusterFS for block devices is
actually more similar to what happens with the regular block domains.
You should probably need to mount the share somewhere in the system
and
then use symlinks to point to the volumes.

Create a regular block domain and look inside
/rhev/data-center/mnt/blockSD,
you'll probably get the idea of what I mean.

That said we'd need to come up with a way of extending the LVs on the
gluster server when required (for thin provisioning).


Why? if it's exposed as a file that probably means it supports sparseness.  
i.e. if this becomes a new type of block domain it should only support 
'preallocated' images.



For start using the LVs we will always do truncate for the required
size, it will resize the LV. I didn't get what you are mentioning about
thin-provisioning, but I have a dumb code using dm-thin targets showing
BD xlators can be extended to use dm-thin targets for thin-provisioning.


so even though this is block storage, it will be extended as needed? how 
does that work exactly?

say i have a VM with a 100GB disk.
thin provisioning means we only allocated 1GB to it, then as the guest 
uses that storage, we allocate more as needed (lvextend, pause guest, 
lvrefresh, resume guest)



___
vdsm-devel mailing list
vdsm-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel


Re: [vdsm] [RFC] GlusterFS domain specific changes

2012-09-07 Thread M. Mohan Kumar
On Fri, 07 Sep 2012 09:35:10 +0300, Itamar Heim ih...@redhat.com wrote:
 On 09/07/2012 08:21 AM, M. Mohan Kumar wrote:
  On Thu, 6 Sep 2012 18:59:19 -0400 (EDT), Ayal Baron aba...@redhat.com 
  wrote:
 
 

 
  For start using the LVs we will always do truncate for the required
  size, it will resize the LV. I didn't get what you are mentioning about
  thin-provisioning, but I have a dumb code using dm-thin targets showing
  BD xlators can be extended to use dm-thin targets for thin-provisioning.
 
 so even though this is block storage, it will be extended as needed? how 
 does that work exactly?
 say i have a VM with a 100GB disk.
 thin provisioning means we only allocated 1GB to it, then as the guest 
 uses that storage, we allocate more as needed (lvextend, pause guest, 
 lvrefresh, resume guest)
 
 

When we use device=lv, it means we use only thick provisioned logical
volumes. If this logical volume runs out of space in the guest, one can
resize it from the client by using truncate (results in lvresize at the
server side) and run filesystem tools at guest to get added space.

But with device=thin type, all LVs are thinly provisioned and allocating
space to them is taken care by device-mapper thin target
automatically. The thin-pool should have enough space to accomoodate the
sizing requirements. 

___
vdsm-devel mailing list
vdsm-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel


Re: [vdsm] [RFC] GlusterFS domain specific changes

2012-09-07 Thread M. Mohan Kumar
On Fri, 07 Sep 2012 14:23:08 +0800, Shu Ming shum...@linux.vnet.ibm.com wrote:
 于 2012-9-7 13:21, M. Mohan Kumar 写道:
  On Thu, 6 Sep 2012 18:59:19 -0400 (EDT), Ayal Baron aba...@redhat.com 
  wrote:
 
  - Original Message -
  - Original Message -
  From: M. Mohan Kumar mo...@in.ibm.com
  To: vdsm-devel@lists.fedorahosted.org
  Sent: Wednesday, July 25, 2012 1:26:15 PM
  Subject: [vdsm] [RFC] GlusterFS domain specific changes
 
 
  We are developing a GlusterFS server translator to export block
  devices
  as regular files to the client. Using block devices to serve VM
  images
  gives performance improvements, since it avoids some file system
  bottlenecks in the host kernel. Goal is to use one block device(ie
  file
  at the client side) per VM image and feed this file to QEMU to get
  the
  performance improvements. QEMU will talk to glusterfs server
  directly
  using libgfapi.
 
  Currently we support only exporting Volume groups and Logical
  Volumes. Logical volumes are exported as regular files to the
  client.
  Are you actually using LVM behind the scenes?
  If so, why bother with exposing the LVs as files and not raw block devices?
 
  Ayal,
 
  The idea is to provide a FS interface for managing block devices. One
  can mount the Block Device Gluster Volume and create a LV and size it
  just by
$ touch lv1
$ truncate -s5G lv1
 
  And other file commands can be used to clone LVs, snapshot LVs
$ ln lv1 lv2 # clones
$ ln -s lv1 lv1.sn # creates snapshot
 Do we have special reason to use ln?
 Why not use cp as the comannd to do the snapshot instead of ln?

cp involves opening source file in read-only mode, opening/creating
destination file with write-mode and issue series of read on source file
and write that into destination file till end of source file.

But we can't apply this to logical volume copy (or clone), because when
we create a logical volume we have to specify the size, but thats not
possible with above approach ie open/create does not take size as the
parameter so we can't create destination lv with required size.

But if I use link interface to copy LVs, VFS/FUSE/GlusterFS provides
link() interface that takes source file, destination file name. In BD
xlator link() code, I will get size of source LV and create destination
LV with that size and copy the contents.

This problem can be solved if we have a syscall copyfile(source, dest,
size). There have been discussions in the past on copyfile() interface which
could be made use of in this scenario copy.
http://www.spinics.net/lists/linux-nfs/msg26203.html 

 
  By enabling this feature GlusterFS can directly export storage in
  SAN. We are planning to add feature to export LUNs also as regular files
  in future.
 
 IMO, The major feature of GlusterFS is to export distributed local disks 
 to the clients.
 If we have SAN in the backend, that means the storage block devices 
 should be exported
 to clients natually.  Why do we need GlusterSF to export the block 
 devices in SAN?
 

By enabling this feature we are allowing GlusterFS to work with local
storage, NAS storage and SAN storage. ie it allows machines to access
block devices from the SAN which are not directly connected to SAN
storage.

Also providing block devices as vm disk image has some advantages like
 * it does not incur host side filesystem over head
 * if storage arrays provide storage offload features such as flashcopy,
   it can be exploited (these offloads will be usually at LUN level)

___
vdsm-devel mailing list
vdsm-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel


Re: [vdsm] [RFC] GlusterFS domain specific changes

2012-09-06 Thread M. Mohan Kumar
On Thu, 6 Sep 2012 18:59:19 -0400 (EDT), Ayal Baron aba...@redhat.com wrote:
 
 
 - Original Message -
  - Original Message -
   From: M. Mohan Kumar mo...@in.ibm.com
   To: vdsm-devel@lists.fedorahosted.org
   Sent: Wednesday, July 25, 2012 1:26:15 PM
   Subject: [vdsm] [RFC] GlusterFS domain specific changes
   
   
   We are developing a GlusterFS server translator to export block
   devices
   as regular files to the client. Using block devices to serve VM
   images
   gives performance improvements, since it avoids some file system
   bottlenecks in the host kernel. Goal is to use one block device(ie
   file
   at the client side) per VM image and feed this file to QEMU to get
   the
   performance improvements. QEMU will talk to glusterfs server
   directly
   using libgfapi.
   
   Currently we support only exporting Volume groups and Logical
   Volumes. Logical volumes are exported as regular files to the
   client.
 
 Are you actually using LVM behind the scenes?
 If so, why bother with exposing the LVs as files and not raw block devices?

Ayal,

The idea is to provide a FS interface for managing block devices. One
can mount the Block Device Gluster Volume and create a LV and size it
just by
 $ touch lv1
 $ truncate -s5G lv1

And other file commands can be used to clone LVs, snapshot LVs
 $ ln lv1 lv2 # clones
 $ ln -s lv1 lv1.sn # creates snapshot

By enabling this feature GlusterFS can directly export storage in
SAN. We are planning to add feature to export LUNs also as regular files
in future. 

   
   
   In GlusterFS terminology a volume capable of exporting block
   devices is
   created by specifying the 'Volume Group' (ie VG in Logical Volume
   management). Block Device translator(BD xlator) exports this volume
   group as a directory and LVs under it as regular files. In the
   gluster
   mount point creating a file results in creating a logical volume,
   removing a file results in removing logical volume etc.
   
   When a GlusterFS volume enabled with BD xlator is used, directory
   creation in that gluster mount path is not supported because
   directory
   maps to Volume groups in BD xlator. But it could be an issue in
   VDSM
   environment when a new VDSM volume is created for GlusterFS domain,
   VDSM
   mounts the storage domain and creates directories under that and
   create
   files for vm image and other uses (like meta data).
 
   Is it possible to modify this behavior in VDSM to use flat
   structure
   instead of creating directories and VM images and other files
   underneath
   it? ie for GlusterFS domain with BD xlator VDSM will not create any
   directory and only creates all required files under the mount point
   directory itself.
  
  From your description I think that the GlusterFS for block devices is
  actually more similar to what happens with the regular block domains.
  You should probably need to mount the share somewhere in the system
  and
  then use symlinks to point to the volumes.
  
  Create a regular block domain and look inside
  /rhev/data-center/mnt/blockSD,
  you'll probably get the idea of what I mean.
  
  That said we'd need to come up with a way of extending the LVs on the
  gluster server when required (for thin provisioning).
 
 Why? if it's exposed as a file that probably means it supports sparseness.  
 i.e. if this becomes a new type of block domain it should only support 
 'preallocated' images.
 

For start using the LVs we will always do truncate for the required
size, it will resize the LV. I didn't get what you are mentioning about
thin-provisioning, but I have a dumb code using dm-thin targets showing
BD xlators can be extended to use dm-thin targets for thin-provisioning.

___
vdsm-devel mailing list
vdsm-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel


Re: [vdsm] [RFC] GlusterFS domain specific changes

2012-08-02 Thread M. Mohan Kumar

Hello any suggestion/feedback on this?

On Wed, 25 Jul 2012 16:56:15 +0530, M. Mohan Kumar mo...@in.ibm.com wrote:
 
 We are developing a GlusterFS server translator to export block devices
 as regular files to the client. Using block devices to serve VM images
 gives performance improvements, since it avoids some file system
 bottlenecks in the host kernel. Goal is to use one block device(ie file
 at the client side) per VM image and feed this file to QEMU to get the
 performance improvements. QEMU will talk to glusterfs server directly
 using libgfapi.
 
 Currently we support only exporting Volume groups and Logical
 Volumes. Logical volumes are exported as regular files to the client. In
 GlusterFS terminology a volume capable of exporting block devices is
 created by specifying the 'Volume Group' (ie VG in Logical Volume
 management). Block Device translator(BD xlator) exports this volume
 group as a directory and LVs under it as regular files. In the gluster
 mount point creating a file results in creating a logical volume,
 removing a file results in removing logical volume etc.
 
 When a GlusterFS volume enabled with BD xlator is used, directory
 creation in that gluster mount path is not supported because directory
 maps to Volume groups in BD xlator. But it could be an issue in VDSM
 environment when a new VDSM volume is created for GlusterFS domain, VDSM
 mounts the storage domain and creates directories under that and create
 files for vm image and other uses (like meta data).
 
 Is it possible to modify this behavior in VDSM to use flat structure
 instead of creating directories and VM images and other files underneath
 it? ie for GlusterFS domain with BD xlator VDSM will not create any
 directory and only creates all required files under the mount point
 directory itself.
 
 Note:
 Patches to enable exporting block devices as regular files are available
 in Gluster Gerrit system
 http://review.gluster.com/3551
 
 ___
 vdsm-devel mailing list
 vdsm-devel@lists.fedorahosted.org
 https://fedorahosted.org/mailman/listinfo/vdsm-devel

___
vdsm-devel mailing list
vdsm-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel


[vdsm] [RFC] GlusterFS domain specific changes

2012-07-25 Thread M. Mohan Kumar

We are developing a GlusterFS server translator to export block devices
as regular files to the client. Using block devices to serve VM images
gives performance improvements, since it avoids some file system
bottlenecks in the host kernel. Goal is to use one block device(ie file
at the client side) per VM image and feed this file to QEMU to get the
performance improvements. QEMU will talk to glusterfs server directly
using libgfapi.

Currently we support only exporting Volume groups and Logical
Volumes. Logical volumes are exported as regular files to the client. In
GlusterFS terminology a volume capable of exporting block devices is
created by specifying the 'Volume Group' (ie VG in Logical Volume
management). Block Device translator(BD xlator) exports this volume
group as a directory and LVs under it as regular files. In the gluster
mount point creating a file results in creating a logical volume,
removing a file results in removing logical volume etc.

When a GlusterFS volume enabled with BD xlator is used, directory
creation in that gluster mount path is not supported because directory
maps to Volume groups in BD xlator. But it could be an issue in VDSM
environment when a new VDSM volume is created for GlusterFS domain, VDSM
mounts the storage domain and creates directories under that and create
files for vm image and other uses (like meta data).

Is it possible to modify this behavior in VDSM to use flat structure
instead of creating directories and VM images and other files underneath
it? ie for GlusterFS domain with BD xlator VDSM will not create any
directory and only creates all required files under the mount point
directory itself.

Note:
Patches to enable exporting block devices as regular files are available
in Gluster Gerrit system
http://review.gluster.com/3551

___
vdsm-devel mailing list
vdsm-devel@lists.fedorahosted.org
https://fedorahosted.org/mailman/listinfo/vdsm-devel