Re: [vdsm] [RFC] GlusterFS domain specific changes
On 09/07/2012 08:21 AM, M. Mohan Kumar wrote: On Thu, 6 Sep 2012 18:59:19 -0400 (EDT), Ayal Baron aba...@redhat.com wrote: - Original Message - - Original Message - From: M. Mohan Kumar mo...@in.ibm.com To: vdsm-devel@lists.fedorahosted.org Sent: Wednesday, July 25, 2012 1:26:15 PM Subject: [vdsm] [RFC] GlusterFS domain specific changes We are developing a GlusterFS server translator to export block devices as regular files to the client. Using block devices to serve VM images gives performance improvements, since it avoids some file system bottlenecks in the host kernel. Goal is to use one block device(ie file at the client side) per VM image and feed this file to QEMU to get the performance improvements. QEMU will talk to glusterfs server directly using libgfapi. Currently we support only exporting Volume groups and Logical Volumes. Logical volumes are exported as regular files to the client. Are you actually using LVM behind the scenes? If so, why bother with exposing the LVs as files and not raw block devices? Ayal, The idea is to provide a FS interface for managing block devices. One can mount the Block Device Gluster Volume and create a LV and size it just by $ touch lv1 $ truncate -s5G lv1 And other file commands can be used to clone LVs, snapshot LVs $ ln lv1 lv2 # clones $ ln -s lv1 lv1.sn # creates snapshot By enabling this feature GlusterFS can directly export storage in SAN. We are planning to add feature to export LUNs also as regular files in future. In GlusterFS terminology a volume capable of exporting block devices is created by specifying the 'Volume Group' (ie VG in Logical Volume management). Block Device translator(BD xlator) exports this volume group as a directory and LVs under it as regular files. In the gluster mount point creating a file results in creating a logical volume, removing a file results in removing logical volume etc. When a GlusterFS volume enabled with BD xlator is used, directory creation in that gluster mount path is not supported because directory maps to Volume groups in BD xlator. But it could be an issue in VDSM environment when a new VDSM volume is created for GlusterFS domain, VDSM mounts the storage domain and creates directories under that and create files for vm image and other uses (like meta data). Is it possible to modify this behavior in VDSM to use flat structure instead of creating directories and VM images and other files underneath it? ie for GlusterFS domain with BD xlator VDSM will not create any directory and only creates all required files under the mount point directory itself. From your description I think that the GlusterFS for block devices is actually more similar to what happens with the regular block domains. You should probably need to mount the share somewhere in the system and then use symlinks to point to the volumes. Create a regular block domain and look inside /rhev/data-center/mnt/blockSD, you'll probably get the idea of what I mean. That said we'd need to come up with a way of extending the LVs on the gluster server when required (for thin provisioning). Why? if it's exposed as a file that probably means it supports sparseness. i.e. if this becomes a new type of block domain it should only support 'preallocated' images. For start using the LVs we will always do truncate for the required size, it will resize the LV. I didn't get what you are mentioning about thin-provisioning, but I have a dumb code using dm-thin targets showing BD xlators can be extended to use dm-thin targets for thin-provisioning. so even though this is block storage, it will be extended as needed? how does that work exactly? say i have a VM with a 100GB disk. thin provisioning means we only allocated 1GB to it, then as the guest uses that storage, we allocate more as needed (lvextend, pause guest, lvrefresh, resume guest) ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel
Re: [vdsm] [RFC]about the implement of text-based console
On 09/04/2012 10:36 PM, Xu He Jie wrote: On 09/04/2012 06:52 PM, Dan Kenigsberg wrote: On Tue, Sep 04, 2012 at 03:05:37PM +0800, Xu He Jie wrote: On 09/03/2012 10:33 PM, Dan Kenigsberg wrote: On Thu, Aug 30, 2012 at 04:26:31PM -0500, Adam Litke wrote: On Thu, Aug 30, 2012 at 11:32:02AM +0800, Xu He Jie wrote: Hi, I submited a patch for text-based console http://gerrit.ovirt.org/#/c/7165/ the issue I want to discussing as below: 1. fix port VS dynamic port Use fix port for all VM's console. connect console with 'ssh vmUUID@ip -p port'. Distinguishing VM by vmUUID. The current implement was vdsm will allocated port for console dynamically and spawn sub-process when VM creating. In sub-process the main thread responsible for accept new connection and dispatch output of console to each connection. When new connection is coming, main processing create new thread for each new connection. Dynamic port will allocated port for each VM and use range port. It isn't good for firewall rules. so I got a suggestion that use fix port. and connect console with 'ssh vmuuid@hostip -p fixport'. this is simple for user. We need one process for accept new connection from fix port and when new connection is coming, spawn sub-process for each vm. But because the console only can open by one process, main process need responsible for dispatching console's output of all vms and all connection. So the code will be a little complex then dynamic port. So this is dynamic port VS fix port and simple code VS complex code. From a usability point of view, I think the fixed port suggestion is nicer. This means that a system administrator needs only to open one port to enable remote console access. If your initial implementation limits console access to one connection per VM would that simplify the code? Yes, using a fixed port for all consoles of all VMs seems like a cooler idea. Besides the firewall issue, there's user experience: instead of calling getVmStats to tell the vm port, and then use ssh, only one ssh call is needed. (Taking this one step further - it would make sense to add another layer on top, directing console clients to the specific host currently running the Vm.) I did not take a close look at your implementation, and did not research this myself, but have you considered using sshd for this? I suppose you can configure sshd to collect the list of known users from `getAllVmStats`, and force it to run a command that redirects VM's console to the ssh client. It has a potential of being a more robust implementation. I have considered using sshd and ssh tunnel. They can't implement fixed port and share console. Would you elaborate on that? Usually sshd listens to a fixed port 22, and allows multiple users to have independet shells. What do you mean by share console? sharable console is like qemu vnc, you can open multiple connection, but picture is same in all connection. virsh limited only one user can open console, so I think make it sharable is more powerful. Hmm... for sshd, I think I missing something. It could be implemented using sshd in the following way: Add new system user for that vm on setVmTicket. And change that user's login program to another program that can redirect console. To share console among multiple connection, It need that a process redirects the console to local unix socket, then we can copy console's output to multiple connection. This is just in my mind. I am going to give a try. Thanks for your suggestion! I gave a try for system sshd. That can works. But I think add user in system for each vm is't good enough. So I have look in PAM, try to find a way skip create real user in system, but it doesn't work. Even we can create virtual user with PAM, we still can't tell sshd use which user and which login program. That means sshd doesn't support that. And I didn't find any other solution if I didn't miss something. I think create user in system isn't good, there have security implication too, and it will mess the system configuration, we need be care for clean all the user of vm. So I think again for implement console server by ourself. I want to ask is that really unsafe? We just use ssh protocol as transfer protocol. It isn't a real sshd. It didn't access any system resource and shell. It only can redirect the vm's console after setVmTicket. Current implement we can do anything that what we want. Yes, it is completely under our control, but there are down sides, too: we have to maintain another process, and another entry point, instead of configuring a universally-used, well maintained and debugged application. Dan. ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org
Re: [vdsm] [RFC] GlusterFS domain specific changes
On Fri, 07 Sep 2012 09:35:10 +0300, Itamar Heim ih...@redhat.com wrote: On 09/07/2012 08:21 AM, M. Mohan Kumar wrote: On Thu, 6 Sep 2012 18:59:19 -0400 (EDT), Ayal Baron aba...@redhat.com wrote: For start using the LVs we will always do truncate for the required size, it will resize the LV. I didn't get what you are mentioning about thin-provisioning, but I have a dumb code using dm-thin targets showing BD xlators can be extended to use dm-thin targets for thin-provisioning. so even though this is block storage, it will be extended as needed? how does that work exactly? say i have a VM with a 100GB disk. thin provisioning means we only allocated 1GB to it, then as the guest uses that storage, we allocate more as needed (lvextend, pause guest, lvrefresh, resume guest) When we use device=lv, it means we use only thick provisioned logical volumes. If this logical volume runs out of space in the guest, one can resize it from the client by using truncate (results in lvresize at the server side) and run filesystem tools at guest to get added space. But with device=thin type, all LVs are thinly provisioned and allocating space to them is taken care by device-mapper thin target automatically. The thin-pool should have enough space to accomoodate the sizing requirements. ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel
Re: [vdsm] [RFC] GlusterFS domain specific changes
On Fri, 07 Sep 2012 14:23:08 +0800, Shu Ming shum...@linux.vnet.ibm.com wrote: δΊ 2012-9-7 13:21, M. Mohan Kumar ει: On Thu, 6 Sep 2012 18:59:19 -0400 (EDT), Ayal Baron aba...@redhat.com wrote: - Original Message - - Original Message - From: M. Mohan Kumar mo...@in.ibm.com To: vdsm-devel@lists.fedorahosted.org Sent: Wednesday, July 25, 2012 1:26:15 PM Subject: [vdsm] [RFC] GlusterFS domain specific changes We are developing a GlusterFS server translator to export block devices as regular files to the client. Using block devices to serve VM images gives performance improvements, since it avoids some file system bottlenecks in the host kernel. Goal is to use one block device(ie file at the client side) per VM image and feed this file to QEMU to get the performance improvements. QEMU will talk to glusterfs server directly using libgfapi. Currently we support only exporting Volume groups and Logical Volumes. Logical volumes are exported as regular files to the client. Are you actually using LVM behind the scenes? If so, why bother with exposing the LVs as files and not raw block devices? Ayal, The idea is to provide a FS interface for managing block devices. One can mount the Block Device Gluster Volume and create a LV and size it just by $ touch lv1 $ truncate -s5G lv1 And other file commands can be used to clone LVs, snapshot LVs $ ln lv1 lv2 # clones $ ln -s lv1 lv1.sn # creates snapshot Do we have special reason to use ln? Why not use cp as the comannd to do the snapshot instead of ln? cp involves opening source file in read-only mode, opening/creating destination file with write-mode and issue series of read on source file and write that into destination file till end of source file. But we can't apply this to logical volume copy (or clone), because when we create a logical volume we have to specify the size, but thats not possible with above approach ie open/create does not take size as the parameter so we can't create destination lv with required size. But if I use link interface to copy LVs, VFS/FUSE/GlusterFS provides link() interface that takes source file, destination file name. In BD xlator link() code, I will get size of source LV and create destination LV with that size and copy the contents. This problem can be solved if we have a syscall copyfile(source, dest, size). There have been discussions in the past on copyfile() interface which could be made use of in this scenario copy. http://www.spinics.net/lists/linux-nfs/msg26203.html By enabling this feature GlusterFS can directly export storage in SAN. We are planning to add feature to export LUNs also as regular files in future. IMO, The major feature of GlusterFS is to export distributed local disks to the clients. If we have SAN in the backend, that means the storage block devices should be exported to clients natually. Why do we need GlusterSF to export the block devices in SAN? By enabling this feature we are allowing GlusterFS to work with local storage, NAS storage and SAN storage. ie it allows machines to access block devices from the SAN which are not directly connected to SAN storage. Also providing block devices as vm disk image has some advantages like * it does not incur host side filesystem over head * if storage arrays provide storage offload features such as flashcopy, it can be exploited (these offloads will be usually at LUN level) ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel
Re: [vdsm] Change in vdsm[master]: bootstrap: perform reboot asynchronously
* Alon Bar-Lev alo...@redhat.com [2012-09-05 16:11]: Alon Bar-Lev has uploaded a new change for review. Change subject: bootstrap: perform reboot asynchronously .. bootstrap: perform reboot asynchronously The use of /sbin/reboot may cause reboot to be performed at the middle of script execution. Reboot should be delayed in background so that script will have a fair chance to terminate properly. So, we fork and sleep 10 seconds? Is that really want we want to do? Why is 10 seconds enough? Shouldn't the deployUtil be tracking the script execution and waiting for the scripts to complete before rebooting? Change-Id: I0abb02ae4d5033a8b9f2d468da86fcdc53e2e1c2 Signed-off-by: Alon Bar-Lev alo...@redhat.com --- M vdsm_reg/deployUtil.py.in 1 file changed, 39 insertions(+), 5 deletions(-) git pull ssh://gerrit.ovirt.org:29418/vdsm refs/changes/83/7783/1 diff --git a/vdsm_reg/deployUtil.py.in b/vdsm_reg/deployUtil.py.in index ebc7d36..b72cb44 100644 --- a/vdsm_reg/deployUtil.py.in +++ b/vdsm_reg/deployUtil.py.in @@ -166,13 +166,47 @@ def reboot(): -This function reboots the machine. +This function reboots the machine async -fReturn = True +fReturn = False -out, err, ret = _logExec([EX_REBOOT]) -if ret: -fReturn = False +# Default maximum for the number of available file descriptors. +MAXFD = 1024 + +import resource # Resource usage information. +maxfd = resource.getrlimit(resource.RLIMIT_NOFILE)[1] +if (maxfd == resource.RLIM_INFINITY): +maxfd = MAXFD + +try: +pid = os.fork() +if pid == 0: +try: +os.setsid() +for fd in range(0, maxfd): +try: +os.close(fd) +except OSError: # ERROR, fd wasn't open to begin with (ignored) +pass + +os.open(os.devnull, os.O_RDWR) # standard input (0) +os.dup2(0, 1) # standard output (1) +os.dup2(0, 2) # standard error (2) + +if os.fork() != 0: +os._exit(0) + +time.sleep(10) +os.execl(EX_REBOOT, EX_REBOOT) +finally: +os._exit(1) + +pid, status = os.waitpid(pid, 0) + +if os.WIFEXITED(status) and os.WEXITSTATUS(status) == 0: +fReturn = True +except OSError: +pass return fReturn -- To view, visit http://gerrit.ovirt.org/7783 To unsubscribe, visit http://gerrit.ovirt.org/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I0abb02ae4d5033a8b9f2d468da86fcdc53e2e1c2 Gerrit-PatchSet: 1 Gerrit-Project: vdsm Gerrit-Branch: master Gerrit-Owner: Alon Bar-Lev alo...@redhat.com ___ vdsm-patches mailing list vdsm-patc...@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/vdsm-patches -- Ryan Harper Software Engineer; Linux Technology Center IBM Corp., Austin, Tx ry...@us.ibm.com ___ vdsm-devel mailing list vdsm-devel@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel