Sorry - its too late - all hosts have been re-imaged and are setup as local storage.
On Mon, Sep 21, 2015 at 10:38 PM, Ravishankar N <[email protected]> wrote: > Hi Chris, > > Replies inline.. > > On 09/22/2015 09:31 AM, Sahina Bose wrote: > > > > > -------- Forwarded Message -------- Subject: Re: [ovirt-users] urgent > issue Date: Wed, 9 Sep 2015 08:31:07 -0700 From: Chris Liebman > <[email protected]> <[email protected]> To: users <[email protected]> > <[email protected]> > > Ok - I think I'm going to switch to local storage - I've had way to many > unexplainable issue with glusterfs  :-(. Is there any reason I cant add > local storage to the existing shared-storage cluster? I see that the menu > item is greyed out.... > > > > What version of gluster and ovirt are you using? > > > > > On Tue, Sep 8, 2015 at 4:19 PM, Chris Liebman <[email protected]> wrote: > >> Its possible that this is specific to just one gluster volume... I've >> moved a few VM disks off of that volume and am able to start them fine. >> My recolection is that any VM started on the "bad" volume causes it to be >> disconnected and forces the ovirt node to be marked down until >> Maint->Activate. >> >> On Tue, Sep 8, 2015 at 3:52 PM, Chris Liebman < <[email protected]> >> [email protected]> wrote: >> >>> In attempting to put an ovirt cluster in production I'm running into >>> some off errors with gluster it looks like. Its 12 hosts each with one >>> brick in distributed-replicate.  (actually 2 bricks but they are separate >>> volumes) >>> >>> > These 12 nodes in dist-rep config, are they in replica 2 or replica 3? The > latter is what is recommended for VM use-cases. Could you give the output > of `gluster volume info` ? > > [root@ovirt-node268 glusterfs]# rpm -qa | grep vdsm >>> >>> vdsm-jsonrpc-4.16.20-0.el6.noarch >>> >>> vdsm-gluster-4.16.20-0.el6.noarch >>> >>> vdsm-xmlrpc-4.16.20-0.el6.noarch >>> >>> vdsm-yajsonrpc-4.16.20-0.el6.noarch >>> >>> vdsm-4.16.20-0.el6.x86_64 >>> >>> vdsm-python-zombiereaper-4.16.20-0.el6.noarch >>> >>> vdsm-python-4.16.20-0.el6.noarch >>> >>> vdsm-cli-4.16.20-0.el6.noarch >>> >>> >>>   Everything was fine last week, however, today various clients in >>> the gluster cluster seem get "client quorum not met" periodically - when >>> they get this they take one of the bricks offline - this causes VM's to be >>> attempted to move - sometimes 20 at a time. That takes a long time :-(. >>> I've tried disabling automatic migration and teh VM's get paused when this >>> happens - resuming gets nothing at that point as the volumes mount on the >>> server hosting the VM is not connected: >>> >>> from >>> rhev-data-center-mnt-glusterSD-ovirt-node268.la.taboolasyndication.com: >>> _LADC-TBX-V02.log: >>> >>> [2015-09-08 21:18:42.920771] W [MSGID: 108001] >>> [afr-common.c:4043:afr_notify] 2-LADC-TBX-V02-replicate-2: Client-quorum >>> is not met >>> >> > When client-quorum is not met (due to network disconnects, or gluster > brick processes going down etc), gluster makes the volume read-only. This > is expected behavior and prevents split-brains. It's probably a bit late, > but do you have the gluster fuse mount logs to confirm this indeed was the > issue? > > [2015-09-08 21:18:42.931751] I [fuse-bridge.c:4900:fuse_thread_proc] >>> 0-fuse: unmounting >>> /rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com: >>> _LADC-TBX-V02 >>> >>> [2015-09-08 21:18:42.931836] W [glusterfsd.c:1219:cleanup_and_exit] >>> (-->/lib64/libpthread.so.0(+0x7a51) [0x7f1bebc84a51] >>> -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xcd) [0x405e4d] >>> -->/usr/sbin/glusterfs(cleanup_and_exit+0x >>> >>> 65) [0x4059b5] ) 0-: received signum (15), shutting down >>> >>> [2015-09-08 21:18:42.931858] I [fuse-bridge.c:5595:fini] 0-fuse: >>> Unmounting >>> '/rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com: >>> _LADC-TBX-V02'. >>> >> > The VM pause you saw could be because of the unmount.I understand that a > fix (https://gerrit.ovirt.org/#/c/40240/) went in for ovirt 3-.6 > (vdsm-4.17) to prevent vdsm from unmounting the gluster volume when vdsm > exits/restarts. > Is it possible to run a test setup on 3.6 and see if this is still > happening? > > >>> And the mount is broken at that point: >>> >>> [root@ovirt-node267 ~]# df >>> >>> *df: >>> `/rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com:_LADC-TBX-V02': >>> Transport endpoint is not connected* >>> >> > Yes because it received a SIGTERM above. > > Thanks, > Ravi > > Filesystem       1K-blocks    Used  Available Use% >>> Mounted on >>> >>> /dev/sda3      >>>   51475068   1968452   46885176   5% / >>> >>> tmpfs          132210244    >>>   0  132210244   0% /dev/shm >>> >>> /dev/sda2         487652    32409  >>>   429643   8% /boot >>> >>> /dev/sda1         204580     260  >>>   204320   1% /boot/efi >>> >>> /dev/sda5       1849960960 156714056 1599267616   9% >>> /data1 >>> >>> /dev/sdb1       1902274676  18714468 1786923588   2% >>> /data2 >>> >>> ovirt-node268.la.taboolasyndication.com:/LADC-TBX-V01 >>> >>>             9249804800 727008640 8052899712   9% >>> /rhev/data-center/mnt/glusterSD/ovirt-node268.la.taboolasyndication.com: >>> _LADC-TBX-V01 >>> >>> ovirt-node251.la.taboolasyndication.com:/LADC-TBX-V03 >>> >>>             1849960960    73728 >>> 1755907968   1% >>> /rhev/data-center/mnt/glusterSD/ovirt-node251.la.taboolasyndication.com: >>> _LADC-TBX-V03 >>> >>> The fix for that is to put the server in maintenance mode then activate >>> it again. But all VM's need to be migrated or stopped for that to work. >>> >>> I'm not seeing any obvious network or disk errors...... >>> >>> Are their configuration options I'm missing? >>> >>> >> > > > >
_______________________________________________ Users mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/users

