from:"Jim Kusznir"

[ovirt-users] Gluster rebuild: request suggestions (poor IO performance)

2019-05-21 Thread Jim Kusznir

Hi:

I've been having one heck of a time for nearly the entire time I've been
running ovirt with disk IO performance. I've tried a variety of things,
I've posted to this list for help several times, and it sounds like in most
cases the problems are due to design decisions and such.

My cluster has been devolving into nearly unusable performance, and I
believe its mostly disk IO related. I'm currently using FreeNAS as my
primary VM storage (via NFS), but now it too is performing slowly (it
started out reasonable, but slowly devolved for unknown reasons).

I'm ready to switch back to gluster if I can get specific recommendations
as to what I need to do to make it work. I feel like I've been trying
random things, and sinking money into this to try and make it work, but
nothing has really fixed the problem.

I have 3 Dell R610 servers with 750GB SSDs as their primary drive. I had
used some Seagate SSHDs, but the internal Dell DRAC raid controller (which
had been configured to pass them through as a single disk volume, but still
wasn't really JBOD), but it started silently failing them, and causing
major issues for gluster. I think the DRAC just doesn't like those HDDs.

I can put some real spinning disks in; perhaps a RAID-1 pair of 2TB? These
servers only take 2.5" hdd's, so that greatly limits my options.

I'm sure others out there are using Dell R610 servers...what do you use
for storage? How does it perform? What do I need to do to get this
cluster actually usable again? Are PERC-6i storage controllers usable?
I'm not even sure where to go troubleshooting now...everything is so
slw.

BTW: I had a small data volume on the SSDs, and the gluster performance on
those was pretty poor. performance of the hosted engine is pretty poor
still, and it is still on the SSDs.
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/IGR3RDAKQYXSPGAQCHWS5SGKOYA4QKJY/

[ovirt-users] Poor I/O Performance (again...)

2019-04-14 Thread Jim Kusznir

Hi all:

I've had I/O performance problems pretty much since the beginning of using
oVirt. I've applied several upgrades as time went on, but strangely, none
of them have alleviated the problem. VM disk I/O is still very slow to the
point that running VMs is often painful; it notably affects nearly all my
VMs, and makes me leary of starting any more. I'm currently running 12 VMs
and the hosted engine on the stack.

My configuration started out with 1Gbps networking and hyperconverged
gluster running on a single SSD on each node. It worked, but I/O was
painfully slow. I also started running out of space, so I added an SSHD on
each node, created another gluster volume, and moved VMs over to it. I
also ran that on a dedicated 1Gbps network. I had recurring disk failures
(seems that disks only lasted about 3-6 months; I warrantied all three at
least once, and some twice before giving up). I suspect the Dell PERC 6/i
was partly to blame; the raid card refused to see/acknowledge the disk, but
plugging it into a normal PC showed no signs of problems. In any case,
performance on that storage was notably bad, even though the gig-e
interface was rarely taxed.

I put in 10Gbps ethernet and moved all the storage on that none the less,
as several people here said that 1Gbps just wasn't fast enough. Some
aspects improved a bit, but disk I/O is still slow. And I was still having
problems with the SSHD data gluster volume eating disks, so I bought a
dedicated NAS server (supermicro 12 disk dedicated FreeNAS NFS storage
system on 10Gbps ethernet). Set that up. I found that it was actually
FASTER than the SSD-based gluster volume, but still slow. Lately its been
getting slower, too...Don't know why. The FreeNAS server reports network
loads around 4MB/s on its 10Gbe interface, so its not network constrained.
At 4MB/s, I'd sure hope the 12 spindle SAS interface wasn't constrained
either. (and disk I/O operations on the NAS itself complete much
faster).

So, running a test on my NAS against an ISO file I haven't accessed in
months:

# dd
if=en_windows_server_2008_r2_standard_enterprise_datacenter_and_web_x64_dvd_x15-59754.iso
of=/dev/null bs=1024k count=500

500+0 records in
500+0 records out
524288000 bytes transferred in 2.459501 secs (213168465 bytes/sec)

Running it on one of my hosts:
root@unifi:/home/kusznir# time dd if=/dev/sda of=/dev/null bs=1024k
count=500
500+0 records in
500+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 7.21337 s, 72.7 MB/s

(I don't know if this is a true apples to apples comparison, as I don't
have a large file inside this VM's image). Even this is faster than I
often see.

I have a VoIP Phone server running as a VM. Voicemail and other recordings
usually fail due to IO issues opening and writing the files. Often, the
first 4 or so seconds of the recording is missed; sometimes the entire
thing just fails. I didn't use to have this problem, but its definately
been getting worse. I finally bit the bullet and ordered a physical server
dedicated for my VoIP System...But I still want to figure out why I'm
having all these IO problems. I read on the list of people running 30+
VMs...I feel that my IO can't take any more VMs with any semblance of
reliability. We have a Quickbooks server on here too (windows), and the
performance is abysmal; my CPA is charging me extra because of all the lost
staff time waiting on the system to respond and generate reports.

I'm at my whits end...I started with gluster on SSD with 1Gbps network,
migrated to 10Gbps network, and now to dedicated high performance NAS box
over NFS, and still have performance issues.I don't know how to
troubleshoot the issue any further, but I've never had these kinds of
issues when I was playing with other VM technologies. I'd like to get to
the point where I can resell virtual servers to customers, but I can't do
so with my current performance levels.

I'd greatly appreciate help troubleshooting this further.

--Jim
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct:
https://www.ovirt.org/community/about/community-guidelines/
List Archives:
https://lists.ovirt.org/archives/list/users@ovirt.org/message/ZR64VABNT2SGKLNP3XNTHCGFZXSOJAQF/

[ovirt-users] Ovirt Host Replacement/Rebuild

2019-04-05 Thread Jim Kusznir

Hi all:

I had an unplanned power outage (generator failed to start, power failure
lasted 3 min longer than UPS batteries).  One node didn't survive the
unplanned power outage.

By that, I mean it kernel panic's on boot, and I haven't been able to
capture the KP or the first part of it (just the end), and so I don't
truely know what the root cause is.  I have validated the hardware is just
fine, so its got to be an OS corruption.

Based on this, I was thinking that perhaps the easiest way to recover would
simply be to delete the host from the cluster, reformat and reinstall this
host, and then add it back to the cluster as a new host.  Is this in fact a
good idea?  Are there any references to how to do this (the detailed steps
so I don't mess it up)?

My cluster is (was) a 3 node hyperconverged cluster with gluster used for
the management node.  I also have a gluster share for VMs, but I use an NFS
share from a NAS for that (which I will ask about in another post).

Thanks for the help!
--Jim
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/Y66A6Q3NOGD3BCQ4UVAZK5ATS4ZFPVYV/

[ovirt-users] Re: Upgraded host, engine now won't boot

2018-09-03 Thread Jim Kusznir

Ok, finally got it...Had to get a terminal ready with the virsh command and
guess what the instance number was, and then run suspend right after
starting with --vm-start-paused.  Got it to really be paused, got into the
console, booted the old kernel, and have now been repairing a bad yum
transactionI *think* I've finished that.

So, if I understand correctly, after the yum update, I should run
engine-setup?  Do I run that inside the engine vm, or on the host its
running on?

BTW: I did look up upgrade procedures on the documentation for the
release.  It links to two or three levels of other documents, then ends in
an error 404.

--Jim

On Mon, Sep 3, 2018 at 6:39 PM, Jim Kusznir  wrote:

> global maintence mode is already on.  hosted-engine --vm-start-paused
> results in a non-paused VM being started.  Of course, this is executed
> after hosted-engine --vm-poweroff and suitable time left to let things shut
> down.
>
> I just ran another test, and did in fact see the engine was briefly
> paused, but then was quickly put in the running state.  I don't know by
> what, though.  Global maintence mode is definitely enabled, every run of
> the hosted-engine command reminds me!
>
>
>
>
>
> On Mon, Sep 3, 2018 at 11:12 AM, Darrell Budic 
> wrote:
>
>> Don’t know if there’s anything special, it’s been a while since I’ve
>> needed to start it in paused mode. Try putting it in HA maintenance mode
>> from the CLI and then start it in paused mode maybe?
>>
>> --
>> *From:* Jim Kusznir 
>> *Subject:* Re: [ovirt-users] Upgraded host, engine now won't boot
>> *Date:* September 3, 2018 at 1:08:27 PM CDT
>>
>> *To:* Darrell Budic
>> *Cc:* users
>>
>> Unfortunately, I seem unable to get connected to the console early enough
>> to actually see a kernel list.
>>
>> I've tried the hosted-engine --start-vm-paused command, but it just
>> starts it (running mode, not paused).  By the time I can get vnc connected,
>> I have just that last line.  ctrl-alt-del doesn't do anything with it,
>> either.  sending a reset through virsh seems to just kill the VM (it
>> doesn't respawn).
>>
>> ha seems to have some trouble with this too...Originally I allowed ha to
>> start it, and it would take it a good long while before it gave up on the
>> engine and reset it.  It instantly booted to the same crashed state, and
>> again waited a "good long while" (sorry, never timed it, but I know it was
>> >5 min).
>>
>> My current thought is that I need to get the engine started in paused
>> mode, connect vnc, then unpause it with virsh to catch what is happening.
>> Is there any magic to getting it started in paused mode?
>>
>> On Mon, Sep 3, 2018 at 11:03 AM, Darrell Budic 
>> wrote:
>>
>>> Send it a ctl-alt-delete and see what happens. Possibly try an older
>>> kernel at the grub boot menu. Could also try stopping it with hosted-engine
>>> —vm-stop and let HA reboot it, see if it boots or get onto the console
>>> quickly and try and watch more of the boot.
>>>
>>> Ssh and yum upgrade is fine for the OS, although it’s a good idea to
>>> enable Global HA Maintenance first so the HA watchdogs don’t reboot it in
>>> the middle of that. After that, run “engine-setup” again, at least if there
>>> are new ovirt engine updates to be done. Then disable Global HA
>>> Maintenance, and run "shutdown -h now” to stop the Engine VM (rebooting
>>> seems to cause it to exit anyway, HA seems to run it as a single execution
>>> VM. Or at least in the past, it seems to quit anyway on me and shutdown
>>> triggered HA faster). Wait a few minutes, and HA will respawn it on a new
>>> instance and you can log into your engine again.
>>>
>>> --
>>> *From:* Jim Kusznir 
>>> *Subject:* Re: [ovirt-users] Upgraded host, engine now won't boot
>>> *Date:* September 3, 2018 at 12:45:22 PM CDT
>>> *To:* Darrell Budic
>>> *Cc:* users
>>>
>>>
>>> Thanks to Jayme who pointed me to the --add-console-password
>>> hosted-engine command to set a password for vnc.  Using that, I see only
>>> the single line:
>>>
>>> Probing EDD (edd=off to disable)... ok
>>>
>>> --Jim
>>>
>>> On Mon, Sep 3, 2018 at 10:26 AM, Jim Kusznir 
>>> wrote:
>>>
>>>> Is there a way to get a graphical console on boot of the engine vm so I
>>>> can see what's causing the failure to boot?
>>>>
>>>> On Mon, Sep 3, 2018 at 10:23 AM, Jim Kusz

[ovirt-users] Re: Upgraded host, engine now won't boot

2018-09-03 Thread Jim Kusznir

global maintence mode is already on.  hosted-engine --vm-start-paused
results in a non-paused VM being started.  Of course, this is executed
after hosted-engine --vm-poweroff and suitable time left to let things shut
down.

I just ran another test, and did in fact see the engine was briefly paused,
but then was quickly put in the running state.  I don't know by what,
though.  Global maintence mode is definitely enabled, every run of the
hosted-engine command reminds me!





On Mon, Sep 3, 2018 at 11:12 AM, Darrell Budic 
wrote:

> Don’t know if there’s anything special, it’s been a while since I’ve
> needed to start it in paused mode. Try putting it in HA maintenance mode
> from the CLI and then start it in paused mode maybe?
>
> ------
> *From:* Jim Kusznir 
> *Subject:* Re: [ovirt-users] Upgraded host, engine now won't boot
> *Date:* September 3, 2018 at 1:08:27 PM CDT
>
> *To:* Darrell Budic
> *Cc:* users
>
> Unfortunately, I seem unable to get connected to the console early enough
> to actually see a kernel list.
>
> I've tried the hosted-engine --start-vm-paused command, but it just starts
> it (running mode, not paused).  By the time I can get vnc connected, I have
> just that last line.  ctrl-alt-del doesn't do anything with it, either.
> sending a reset through virsh seems to just kill the VM (it doesn't
> respawn).
>
> ha seems to have some trouble with this too...Originally I allowed ha to
> start it, and it would take it a good long while before it gave up on the
> engine and reset it.  It instantly booted to the same crashed state, and
> again waited a "good long while" (sorry, never timed it, but I know it was
> >5 min).
>
> My current thought is that I need to get the engine started in paused
> mode, connect vnc, then unpause it with virsh to catch what is happening.
> Is there any magic to getting it started in paused mode?
>
> On Mon, Sep 3, 2018 at 11:03 AM, Darrell Budic 
> wrote:
>
>> Send it a ctl-alt-delete and see what happens. Possibly try an older
>> kernel at the grub boot menu. Could also try stopping it with hosted-engine
>> —vm-stop and let HA reboot it, see if it boots or get onto the console
>> quickly and try and watch more of the boot.
>>
>> Ssh and yum upgrade is fine for the OS, although it’s a good idea to
>> enable Global HA Maintenance first so the HA watchdogs don’t reboot it in
>> the middle of that. After that, run “engine-setup” again, at least if there
>> are new ovirt engine updates to be done. Then disable Global HA
>> Maintenance, and run "shutdown -h now” to stop the Engine VM (rebooting
>> seems to cause it to exit anyway, HA seems to run it as a single execution
>> VM. Or at least in the past, it seems to quit anyway on me and shutdown
>> triggered HA faster). Wait a few minutes, and HA will respawn it on a new
>> instance and you can log into your engine again.
>>
>> --
>> *From:* Jim Kusznir 
>> *Subject:* Re: [ovirt-users] Upgraded host, engine now won't boot
>> *Date:* September 3, 2018 at 12:45:22 PM CDT
>> *To:* Darrell Budic
>> *Cc:* users
>>
>>
>> Thanks to Jayme who pointed me to the --add-console-password
>> hosted-engine command to set a password for vnc.  Using that, I see only
>> the single line:
>>
>> Probing EDD (edd=off to disable)... ok
>>
>> --Jim
>>
>> On Mon, Sep 3, 2018 at 10:26 AM, Jim Kusznir  wrote:
>>
>>> Is there a way to get a graphical console on boot of the engine vm so I
>>> can see what's causing the failure to boot?
>>>
>>> On Mon, Sep 3, 2018 at 10:23 AM, Jim Kusznir 
>>> wrote:
>>>
>>>> Thanks; I guess I didn't mention that I started there.
>>>>
>>>> The virsh list shows it in state running, and gluster is showing fully
>>>> online and healed.  However, I cannot bring up a console of the engine VM
>>>> to see why its not booting, even though it shows in running state.
>>>>
>>>> In any case, the hosts and engine were running happily.  I applied the
>>>> latest updates on the host, and the engine went unstable.  I thought, Ok,
>>>> maybe there's an update to ovirt that also needs to be applied to the
>>>> engine, so I ssh'ed in and ran yum update (never did find clear
>>>> instructions on how one is supposed to maintain the engine, but I did see
>>>> that listed online).  A while later, it reset and never booted again.
>>>>
>>>> -JIm
>>>>
>>>> On Sun, Sep 2, 2018 at 4:28 PM, Darrell Budic 
>>>> wrote

[ovirt-users] Re: Upgraded host, engine now won't boot

2018-09-03 Thread Jim Kusznir

Unfortunately, I seem unable to get connected to the console early enough
to actually see a kernel list.

I've tried the hosted-engine --start-vm-paused command, but it just starts
it (running mode, not paused).  By the time I can get vnc connected, I have
just that last line.  ctrl-alt-del doesn't do anything with it, either.
sending a reset through virsh seems to just kill the VM (it doesn't
respawn).

ha seems to have some trouble with this too...Originally I allowed ha to
start it, and it would take it a good long while before it gave up on the
engine and reset it.  It instantly booted to the same crashed state, and
again waited a "good long while" (sorry, never timed it, but I know it was
>5 min).

My current thought is that I need to get the engine started in paused mode,
connect vnc, then unpause it with virsh to catch what is happening.  Is
there any magic to getting it started in paused mode?

On Mon, Sep 3, 2018 at 11:03 AM, Darrell Budic 
wrote:

> Send it a ctl-alt-delete and see what happens. Possibly try an older
> kernel at the grub boot menu. Could also try stopping it with hosted-engine
> —vm-stop and let HA reboot it, see if it boots or get onto the console
> quickly and try and watch more of the boot.
>
> Ssh and yum upgrade is fine for the OS, although it’s a good idea to
> enable Global HA Maintenance first so the HA watchdogs don’t reboot it in
> the middle of that. After that, run “engine-setup” again, at least if there
> are new ovirt engine updates to be done. Then disable Global HA
> Maintenance, and run "shutdown -h now” to stop the Engine VM (rebooting
> seems to cause it to exit anyway, HA seems to run it as a single execution
> VM. Or at least in the past, it seems to quit anyway on me and shutdown
> triggered HA faster). Wait a few minutes, and HA will respawn it on a new
> instance and you can log into your engine again.
>
> --
> *From:* Jim Kusznir 
> *Subject:* Re: [ovirt-users] Upgraded host, engine now won't boot
> *Date:* September 3, 2018 at 12:45:22 PM CDT
> *To:* Darrell Budic
> *Cc:* users
>
>
> Thanks to Jayme who pointed me to the --add-console-password hosted-engine
> command to set a password for vnc.  Using that, I see only the single line:
>
> Probing EDD (edd=off to disable)... ok
>
> --Jim
>
> On Mon, Sep 3, 2018 at 10:26 AM, Jim Kusznir  wrote:
>
>> Is there a way to get a graphical console on boot of the engine vm so I
>> can see what's causing the failure to boot?
>>
>> On Mon, Sep 3, 2018 at 10:23 AM, Jim Kusznir  wrote:
>>
>>> Thanks; I guess I didn't mention that I started there.
>>>
>>> The virsh list shows it in state running, and gluster is showing fully
>>> online and healed.  However, I cannot bring up a console of the engine VM
>>> to see why its not booting, even though it shows in running state.
>>>
>>> In any case, the hosts and engine were running happily.  I applied the
>>> latest updates on the host, and the engine went unstable.  I thought, Ok,
>>> maybe there's an update to ovirt that also needs to be applied to the
>>> engine, so I ssh'ed in and ran yum update (never did find clear
>>> instructions on how one is supposed to maintain the engine, but I did see
>>> that listed online).  A while later, it reset and never booted again.
>>>
>>> -JIm
>>>
>>> On Sun, Sep 2, 2018 at 4:28 PM, Darrell Budic 
>>> wrote:
>>>
>>>> It’s definitely not starting, you’ll have to see if you can figure out
>>>> why. A couple things to try:
>>>>
>>>> - Check "virsh list" and see if it’s running, or paused for storage.
>>>> (google "virsh saslpasswd2
>>>> <https://www.google.com/search?client=safari=en=virsh+saslpasswd2=UTF-8=UTF-8>”
>>>> if you need to add a user to do this with, it’s per host)
>>>> -  It’s hyper converged, so check your gluster volume for healing
>>>> and/or split brains and wait/resolve those.
>>>> - check “gluster peer status” and on each host and make sure your
>>>> gluster hosts are all talking. I’ve seen an upgrade screwup the firewall,
>>>> easy fix is to add a rule to allow the hosts to talk to each other on your
>>>> gluster network, no questions asked (-j ACCEPT, no port, etc).
>>>>
>>>> Good luck!
>>>>
>>>> --
>>>> *From:* Jim Kusznir 
>>>> *Subject:* [ovirt-users] Upgraded host, engine now won't boot
>>>> *Date:* September 1, 2018 at 8:38:12 PM CDT
>>>> *To:* users
>>>>
>>>> He

[ovirt-users] Re: Upgraded host, engine now won't boot

2018-09-03 Thread Jim Kusznir

Thanks to Jayme who pointed me to the --add-console-password hosted-engine
command to set a password for vnc.  Using that, I see only the single line:

Probing EDD (edd=off to disable)... ok

--Jim

On Mon, Sep 3, 2018 at 10:26 AM, Jim Kusznir  wrote:

> Is there a way to get a graphical console on boot of the engine vm so I
> can see what's causing the failure to boot?
>
> On Mon, Sep 3, 2018 at 10:23 AM, Jim Kusznir  wrote:
>
>> Thanks; I guess I didn't mention that I started there.
>>
>> The virsh list shows it in state running, and gluster is showing fully
>> online and healed.  However, I cannot bring up a console of the engine VM
>> to see why its not booting, even though it shows in running state.
>>
>> In any case, the hosts and engine were running happily.  I applied the
>> latest updates on the host, and the engine went unstable.  I thought, Ok,
>> maybe there's an update to ovirt that also needs to be applied to the
>> engine, so I ssh'ed in and ran yum update (never did find clear
>> instructions on how one is supposed to maintain the engine, but I did see
>> that listed online).  A while later, it reset and never booted again.
>>
>> -JIm
>>
>> On Sun, Sep 2, 2018 at 4:28 PM, Darrell Budic 
>> wrote:
>>
>>> It’s definitely not starting, you’ll have to see if you can figure out
>>> why. A couple things to try:
>>>
>>> - Check "virsh list" and see if it’s running, or paused for storage.
>>> (google "virsh saslpasswd2
>>> <https://www.google.com/search?client=safari=en=virsh+saslpasswd2=UTF-8=UTF-8>”
>>> if you need to add a user to do this with, it’s per host)
>>> -  It’s hyper converged, so check your gluster volume for healing and/or
>>> split brains and wait/resolve those.
>>> - check “gluster peer status” and on each host and make sure your
>>> gluster hosts are all talking. I’ve seen an upgrade screwup the firewall,
>>> easy fix is to add a rule to allow the hosts to talk to each other on your
>>> gluster network, no questions asked (-j ACCEPT, no port, etc).
>>>
>>> Good luck!
>>>
>>> --
>>> *From:* Jim Kusznir 
>>> *Subject:* [ovirt-users] Upgraded host, engine now won't boot
>>> *Date:* September 1, 2018 at 8:38:12 PM CDT
>>> *To:* users
>>>
>>> Hello:
>>>
>>> I saw that there were updates to my ovirt-4.2 3 node hyperconverged
>>> system, so I proceeded to apply them the usual way through the UI.
>>>
>>> At one point, the hosted engine was migrated to one of the upgraded
>>> hosts, and then went "unstable" on me.  Now, the hosted engine appears to
>>> be crashed:  It gets powered up, but it never boots up to the point where
>>> it responds to pings or allows logins.  After a while, the hosted engine
>>> shows status (via console "hosted-engine --vm-status" command) "Powering
>>> Down".  It stays there for a long time.
>>>
>>> I tried forcing a poweroff then powering it on, but again, it never gets
>>> up to where it will respond to pings.  --vm-status shows bad health, but up.
>>>
>>> I tried running the hosted-engine --console command, but got:
>>>
>>> [root@ovirt1 ~]# hosted-engine --console
>>> The engine VM is running on this host
>>> Connected to domain HostedEngine
>>> Escape character is ^]
>>> error: internal error: cannot find character device 
>>>
>>> [root@ovirt1 ~]#
>>>
>>>
>>> I tried to run the hosted-engine --upgrade-appliance command, but it
>>> hangs at obtaining certificate (understandably, as the hosted-engine is not
>>> up).
>>>
>>> How do i recover from this?  And what caused this?
>>>
>>> --Jim
>>> ___
>>> Users mailing list -- users@ovirt.org
>>> To unsubscribe send an email to users-le...@ovirt.org
>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>> oVirt Code of Conduct: https://www.ovirt.org/communit
>>> y/about/community-guidelines/
>>> List Archives: https://lists.ovirt.org/archiv
>>> es/list/users@ovirt.org/message/XBNOOF4OA5C5AFGCT3KGUPUTRSOLIPXX/
>>>
>>>
>>>
>>
>
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/C62LXGEOGRDWCEZ6XWN3YUSGS32IPROS/

[ovirt-users] Re: Upgraded host, engine now won't boot

2018-09-03 Thread Jim Kusznir

Is there a way to get a graphical console on boot of the engine vm so I can
see what's causing the failure to boot?

On Mon, Sep 3, 2018 at 10:23 AM, Jim Kusznir  wrote:

> Thanks; I guess I didn't mention that I started there.
>
> The virsh list shows it in state running, and gluster is showing fully
> online and healed.  However, I cannot bring up a console of the engine VM
> to see why its not booting, even though it shows in running state.
>
> In any case, the hosts and engine were running happily.  I applied the
> latest updates on the host, and the engine went unstable.  I thought, Ok,
> maybe there's an update to ovirt that also needs to be applied to the
> engine, so I ssh'ed in and ran yum update (never did find clear
> instructions on how one is supposed to maintain the engine, but I did see
> that listed online).  A while later, it reset and never booted again.
>
> -JIm
>
> On Sun, Sep 2, 2018 at 4:28 PM, Darrell Budic 
> wrote:
>
>> It’s definitely not starting, you’ll have to see if you can figure out
>> why. A couple things to try:
>>
>> - Check "virsh list" and see if it’s running, or paused for storage.
>> (google "virsh saslpasswd2
>> <https://www.google.com/search?client=safari=en=virsh+saslpasswd2=UTF-8=UTF-8>”
>> if you need to add a user to do this with, it’s per host)
>> -  It’s hyper converged, so check your gluster volume for healing and/or
>> split brains and wait/resolve those.
>> - check “gluster peer status” and on each host and make sure your gluster
>> hosts are all talking. I’ve seen an upgrade screwup the firewall, easy fix
>> is to add a rule to allow the hosts to talk to each other on your gluster
>> network, no questions asked (-j ACCEPT, no port, etc).
>>
>> Good luck!
>>
>> --
>> *From:* Jim Kusznir 
>> *Subject:* [ovirt-users] Upgraded host, engine now won't boot
>> *Date:* September 1, 2018 at 8:38:12 PM CDT
>> *To:* users
>>
>> Hello:
>>
>> I saw that there were updates to my ovirt-4.2 3 node hyperconverged
>> system, so I proceeded to apply them the usual way through the UI.
>>
>> At one point, the hosted engine was migrated to one of the upgraded
>> hosts, and then went "unstable" on me.  Now, the hosted engine appears to
>> be crashed:  It gets powered up, but it never boots up to the point where
>> it responds to pings or allows logins.  After a while, the hosted engine
>> shows status (via console "hosted-engine --vm-status" command) "Powering
>> Down".  It stays there for a long time.
>>
>> I tried forcing a poweroff then powering it on, but again, it never gets
>> up to where it will respond to pings.  --vm-status shows bad health, but up.
>>
>> I tried running the hosted-engine --console command, but got:
>>
>> [root@ovirt1 ~]# hosted-engine --console
>> The engine VM is running on this host
>> Connected to domain HostedEngine
>> Escape character is ^]
>> error: internal error: cannot find character device 
>>
>> [root@ovirt1 ~]#
>>
>>
>> I tried to run the hosted-engine --upgrade-appliance command, but it
>> hangs at obtaining certificate (understandably, as the hosted-engine is not
>> up).
>>
>> How do i recover from this?  And what caused this?
>>
>> --Jim
>> ___
>> Users mailing list -- users@ovirt.org
>> To unsubscribe send an email to users-le...@ovirt.org
>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>> oVirt Code of Conduct: https://www.ovirt.org/communit
>> y/about/community-guidelines/
>> List Archives: https://lists.ovirt.org/archiv
>> es/list/users@ovirt.org/message/XBNOOF4OA5C5AFGCT3KGUPUTRSOLIPXX/
>>
>>
>>
>
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/WKPMXVUYM5AAD7KYAYLB4DJ4NYGKXZFE/

[ovirt-users] Re: Upgraded host, engine now won't boot

2018-09-03 Thread Jim Kusznir

Thanks; I guess I didn't mention that I started there.

The virsh list shows it in state running, and gluster is showing fully
online and healed.  However, I cannot bring up a console of the engine VM
to see why its not booting, even though it shows in running state.

In any case, the hosts and engine were running happily.  I applied the
latest updates on the host, and the engine went unstable.  I thought, Ok,
maybe there's an update to ovirt that also needs to be applied to the
engine, so I ssh'ed in and ran yum update (never did find clear
instructions on how one is supposed to maintain the engine, but I did see
that listed online).  A while later, it reset and never booted again.

-JIm

On Sun, Sep 2, 2018 at 4:28 PM, Darrell Budic 
wrote:

> It’s definitely not starting, you’ll have to see if you can figure out
> why. A couple things to try:
>
> - Check "virsh list" and see if it’s running, or paused for storage.
> (google "virsh saslpasswd2
> <https://www.google.com/search?client=safari=en=virsh+saslpasswd2=UTF-8=UTF-8>”
> if you need to add a user to do this with, it’s per host)
> -  It’s hyper converged, so check your gluster volume for healing and/or
> split brains and wait/resolve those.
> - check “gluster peer status” and on each host and make sure your gluster
> hosts are all talking. I’ve seen an upgrade screwup the firewall, easy fix
> is to add a rule to allow the hosts to talk to each other on your gluster
> network, no questions asked (-j ACCEPT, no port, etc).
>
> Good luck!
>
> --
> *From:* Jim Kusznir 
> *Subject:* [ovirt-users] Upgraded host, engine now won't boot
> *Date:* September 1, 2018 at 8:38:12 PM CDT
> *To:* users
>
> Hello:
>
> I saw that there were updates to my ovirt-4.2 3 node hyperconverged
> system, so I proceeded to apply them the usual way through the UI.
>
> At one point, the hosted engine was migrated to one of the upgraded hosts,
> and then went "unstable" on me.  Now, the hosted engine appears to be
> crashed:  It gets powered up, but it never boots up to the point where it
> responds to pings or allows logins.  After a while, the hosted engine shows
> status (via console "hosted-engine --vm-status" command) "Powering Down".
> It stays there for a long time.
>
> I tried forcing a poweroff then powering it on, but again, it never gets
> up to where it will respond to pings.  --vm-status shows bad health, but up.
>
> I tried running the hosted-engine --console command, but got:
>
> [root@ovirt1 ~]# hosted-engine --console
> The engine VM is running on this host
> Connected to domain HostedEngine
> Escape character is ^]
> error: internal error: cannot find character device 
>
> [root@ovirt1 ~]#
>
>
> I tried to run the hosted-engine --upgrade-appliance command, but it hangs
> at obtaining certificate (understandably, as the hosted-engine is not up).
>
> How do i recover from this?  And what caused this?
>
> --Jim
> ___
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-
> guidelines/
> List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/
> message/XBNOOF4OA5C5AFGCT3KGUPUTRSOLIPXX/
>
>
>
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/25AE7XZBRFCWG3HKZBGAC2KBXDZLBOC2/

[ovirt-users] Upgraded host, engine now won't boot

2018-09-01 Thread Jim Kusznir

Hello:

I saw that there were updates to my ovirt-4.2 3 node hyperconverged system,
so I proceeded to apply them the usual way through the UI.

At one point, the hosted engine was migrated to one of the upgraded hosts,
and then went "unstable" on me. Now, the hosted engine appears to be
crashed: It gets powered up, but it never boots up to the point where it
responds to pings or allows logins. After a while, the hosted engine shows
status (via console "hosted-engine --vm-status" command) "Powering Down".
It stays there for a long time.

I tried forcing a poweroff then powering it on, but again, it never gets up
to where it will respond to pings. --vm-status shows bad health, but up.

I tried running the hosted-engine --console command, but got:

[root@ovirt1 ~]# hosted-engine --console
The engine VM is running on this host
Connected to domain HostedEngine
Escape character is ^]
error: internal error: cannot find character device

[root@ovirt1 ~]#

I tried to run the hosted-engine --upgrade-appliance command, but it hangs
at obtaining certificate (understandably, as the hosted-engine is not up).

How do i recover from this? And what caused this?

[ovirt-users] Data Recovery from snapshot

2018-07-30 Thread Jim Kusznir

Hi:

With yet another gluster disk failure / gluster collapse, it appears I lost
the "main" backing image for one of my vm servers.  I have snapshots still
in tact (or at least, appear to be), but the main image is gone.

The main server process stores a backup at regular intervals in its disk,
and that would have been changed data, so it would be in the snapshot
rather than the base image.  Is there any way to recover this one .tar.gz
file from the snapshot with the missing main image?  This is a backup of
dynamic data, and without it, I will have lost several customers data, some
of which cannot be recreated/regenerated.

it also appears that my backup (gluster geo-replication) did not work (had
crashed a while ago, and has a very old backup of this image).

--Jim
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/H6ROZ2HCOCLE7ETBFU5U2QVPW3LCP35H/

[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)

2018-07-14 Thread Jim Kusznir

Thank you for your help.

After more troubleshooting and host reboots, I accidentally discovered that
the backing disk on ovirt2 (host) had suffered a failure.  On reboot, the
raid card refused to see it at all.  It said it had cache waiting to be
written to disk, and in the end, as it couldn't (wouldn't) see that disk, I
had no choice but to discard that cache and boot up without the physical
disk.  Since doing so (and running a gluster volume remove for the affected
host), things are running like normal.

I don't understand why one bad disk wasn't simply failed, or if one
underlying process was having such a problem, the other hosts didn't take
it offline and continue (much like RAID would have done).  Instead,
everything was broke (including gluster volumes on unaffected disks that
are fully functional across all hosts).

I'm seeing the need to go multi-spindle for each storage, and I don't want
to do that with the ovirt hosts due to hardware concerns/issues (I have to
use the PERC6i, which I am also learning to distrust), and I would have to
use 2.5in disks (I want to use 3.5").  As such, I will be going to a
dedicated storage server with 12 spindles in a RAID6 configuration.  I'm
debating if its worth setting it up as a gluster replica 1 system (so I can
easily migrate later), or just build it NFS with FreeNAS.  I'm leaning to
the latter, as it seems pointless to run gluster on a single node.

--Jim

On Sun, Jul 8, 2018 at 3:54 AM, Yaniv Kaul  wrote:

>
>
> On Sat, Jul 7, 2018 at 8:45 AM, Jim Kusznir  wrote:
>
>> So, I'm still at a loss...It sounds like its either insufficient
>> ram/swap, or insufficient network.  It seems to be neither now.  At this
>> point, it appears that gluster is just "broke" and killing my systems for
>> no descernable reason.  Here's detals, all from the same system (currently
>> running 3 VMs):
>>
>> [root@ovirt3 ~]# w
>>  22:26:53 up 36 days,  4:34,  1 user,  load average: 42.78, 55.98, 53.31
>> USER TTY  FROM LOGIN@   IDLE   JCPU   PCPU WHAT
>> root pts/0192.168.8.90 22:262.00s  0.12s  0.11s w
>>
>> bwm-ng reports the highest data usage was about 6MB/s during this test
>> (and that was combined; I have two different gig networks.  One gluster
>> network (primary VM storage) runs on one, the other network handles
>> everything else).
>>
>> [root@ovirt3 ~]# free -m
>>   totalusedfree  shared  buff/cache
>>  available
>> Mem:  31996   13236 232  18   18526
>>  18195
>> Swap: 163831475   14908
>>
>> top - 22:32:56 up 36 days,  4:41,  1 user,  load average: 17.99, 39.69,
>> 47.66
>>
>
> That is indeed a high load average. How many CPUs do you have, btw?
>
>
>> Tasks: 407 total,   1 running, 405 sleeping,   1 stopped,   0 zombie
>> %Cpu(s):  8.6 us,  2.1 sy,  0.0 ni, 87.6 id,  1.6 wa,  0.0 hi,  0.1 si,
>> 0.0 st
>> KiB Mem : 32764284 total,   228296 free, 13541952 used, 18994036
>> buff/cache
>> KiB Swap: 16777212 total, 15246200 free,  1531012 used. 18643960 avail
>> Mem
>>
>
> Can you check what's swapping here? (a tweak to top output will show that)
>
>
>>
>>   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
>> COMMAND
>>
>> 30036 qemu  20   0 6872324   5.2g  13532 S 144.6 16.5 216:14.55
>> /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object
>> secret,id=masterKey0,format=raw,file=/v+
>> 28501 qemu  20   0 5034968   3.6g  12880 S  16.2 11.7  73:44.99
>> /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object
>> secret,id=masterKey0,format=raw,file=/va+
>>  2694 root  20   0 2169224  12164   3108 S   5.0  0.0   3290:42
>> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id
>> data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+
>>
>
> This one's certainly taking quite a bit of your CPU usage overall.
>
>
>> 14293 root  15  -5  944700  13356   4436 S   4.0  0.0  16:32.15
>> /usr/sbin/glusterfs --volfile-server=192.168.8.11
>> --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+
>>
>
> I'm not sure what the sorting order is, but doesn't look like Gluster is
> taking a lot of memory?
>
>
>> 25100 vdsm   0 -20 6747440 107868  12836 S   2.3  0.3  21:35.20
>> /usr/bin/python2 /usr/share/vdsm/vdsmd
>>
>> 28971 qemu  20   0 2842592   1.5g  13548 S   1.7  4.7 241:46.49
>> /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on
>> -S -object secret,id=masterKey0,format=+
>> 12095 root  20   0  162276   2836   1868 R   1.3

[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)

2018-07-09 Thread Jim Kusznir

Thank you for your help.

After more troubleshooting and host reboots, I accidentally discovered that
the backing disk on ovirt2 (host) had suffered a failure.  On reboot, the
raid card refused to see it at all.  It said it had cache waiting to be
written to disk, and in the end, as it couldn't (wouldn't) see that disk, I
had no choice but to discard that cache and boot up without the physical
disk.  Since doing so (and running a gluster volume remove for the affected
host), things are running like normal, although it appears it corrupted two
disks (I've now lost 5 VMs to gluster-induced disk failures during poorly
handled failures).

I don't understand why one bad disk wasn't simply failed, or if one
underlying process was having such a problem, the other hosts didn't take
it offline and continue (much like RAID would have done).  Instead,
everything was broke (including gluster volumes on unaffected disks that
are fully functional across all hosts) as well as very poor performance of
affected machine AND no diagnostic reports that would allude to a failing
hard drive.  Is this expected behavior?

--Jim

On Sun, Jul 8, 2018 at 3:54 AM, Yaniv Kaul  wrote:

>
>
> On Sat, Jul 7, 2018 at 8:45 AM, Jim Kusznir  wrote:
>
>> So, I'm still at a loss...It sounds like its either insufficient
>> ram/swap, or insufficient network.  It seems to be neither now.  At this
>> point, it appears that gluster is just "broke" and killing my systems for
>> no descernable reason.  Here's detals, all from the same system (currently
>> running 3 VMs):
>>
>> [root@ovirt3 ~]# w
>>  22:26:53 up 36 days,  4:34,  1 user,  load average: 42.78, 55.98, 53.31
>> USER TTY  FROM LOGIN@   IDLE   JCPU   PCPU WHAT
>> root pts/0192.168.8.90 22:262.00s  0.12s  0.11s w
>>
>> bwm-ng reports the highest data usage was about 6MB/s during this test
>> (and that was combined; I have two different gig networks.  One gluster
>> network (primary VM storage) runs on one, the other network handles
>> everything else).
>>
>> [root@ovirt3 ~]# free -m
>>   totalusedfree  shared  buff/cache
>>  available
>> Mem:  31996   13236 232  18   18526
>>  18195
>> Swap: 163831475   14908
>>
>> top - 22:32:56 up 36 days,  4:41,  1 user,  load average: 17.99, 39.69,
>> 47.66
>>
>
> That is indeed a high load average. How many CPUs do you have, btw?
>
>
>> Tasks: 407 total,   1 running, 405 sleeping,   1 stopped,   0 zombie
>> %Cpu(s):  8.6 us,  2.1 sy,  0.0 ni, 87.6 id,  1.6 wa,  0.0 hi,  0.1 si,
>> 0.0 st
>> KiB Mem : 32764284 total,   228296 free, 13541952 used, 18994036
>> buff/cache
>> KiB Swap: 16777212 total, 15246200 free,  1531012 used. 18643960 avail
>> Mem
>>
>
> Can you check what's swapping here? (a tweak to top output will show that)
>
>
>>
>>   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
>> COMMAND
>>
>> 30036 qemu  20   0 6872324   5.2g  13532 S 144.6 16.5 216:14.55
>> /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object
>> secret,id=masterKey0,format=raw,file=/v+
>> 28501 qemu  20   0 5034968   3.6g  12880 S  16.2 11.7  73:44.99
>> /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object
>> secret,id=masterKey0,format=raw,file=/va+
>>  2694 root  20   0 2169224  12164   3108 S   5.0  0.0   3290:42
>> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id
>> data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+
>>
>
> This one's certainly taking quite a bit of your CPU usage overall.
>
>
>> 14293 root  15  -5  944700  13356   4436 S   4.0  0.0  16:32.15
>> /usr/sbin/glusterfs --volfile-server=192.168.8.11
>> --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+
>>
>
> I'm not sure what the sorting order is, but doesn't look like Gluster is
> taking a lot of memory?
>
>
>> 25100 vdsm   0 -20 6747440 107868  12836 S   2.3  0.3  21:35.20
>> /usr/bin/python2 /usr/share/vdsm/vdsmd
>>
>> 28971 qemu  20   0 2842592   1.5g  13548 S   1.7  4.7 241:46.49
>> /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on
>> -S -object secret,id=masterKey0,format=+
>> 12095 root  20   0  162276   2836   1868 R   1.3  0.0   0:00.25 top
>>
>>
>>  2708 root  20   0 1906040  12404   3080 S   1.0  0.0   1083:33
>> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id
>> engine.ovirt3.nwfiber.com.gluster-brick1-engine -p /var/+
>> 28623 qemu  20   0 4749536   1.7g  12896 S   0.7  5.

[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)

2018-07-07 Thread Jim Kusznir

This host has NO VMs running on it, only 3 running cluster-wide (including
the engine, which is on its own storage):

top - 10:44:41 up 1 day, 17:10,  1 user,  load average: 15.86, 14.33, 13.39
Tasks: 381 total,   1 running, 379 sleeping,   1 stopped,   0 zombie
%Cpu(s):  2.7 us,  2.1 sy,  0.0 ni, 89.0 id,  6.1 wa,  0.0 hi,  0.2 si,
0.0 st
KiB Mem : 32764284 total,   338232 free,   842324 used, 31583728 buff/cache
KiB Swap: 12582908 total, 12258660 free,   324248 used. 31076748 avail Mem

  PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
COMMAND

13279 root  20   0 2380708  37628   4396 S  51.7  0.1   3768:03
glusterfsd

13273 root  20   0 2233212  20460   4380 S  17.2  0.1 105:50.44
glusterfsd

13287 root  20   0 2233212  20608   4340 S   4.3  0.1  34:27.20
glusterfsd

16205 vdsm   0 -20 5048672  88940  13364 S   1.3  0.3   0:32.69 vdsmd


16300 vdsm  20   0  608488  25096   5404 S   1.3  0.1   0:05.78 python


 1109 vdsm  20   0 3127696  44228   8552 S   0.7  0.1  18:49.76
ovirt-ha-broker

2 root  20   0   0  0  0 S   0.7  0.0   0:00.13
kworker/u64:3

   10 root  20   0   0  0  0 S   0.3  0.0   4:22.36
rcu_sched

  572 root   0 -20   0  0  0 S   0.3  0.0   0:12.02
kworker/1:1H

  797 root  20   0   0  0  0 S   0.3  0.0   1:59.59
kdmwork-253:2

  877 root   0 -20   0  0  0 S   0.3  0.0   0:11.34
kworker/3:1H

 1028 root  20   0   0  0  0 S   0.3  0.0   0:35.35
xfsaild/dm-10

 1869 root  20   0 1496472  10540   6564 S   0.3  0.0   2:15.46 python


 3747 root  20   0   0  0  0 D   0.3  0.0   0:01.21
kworker/u64:1

10979 root  15  -5  723504  15644   3920 S   0.3  0.0  22:46.27
glusterfs

15085 root  20   0  680884  10792   4328 S   0.3  0.0   0:01.13
glusterd

16102 root  15  -5 1204216  44948  11160 S   0.3  0.1   0:18.61
supervdsmd

At the moment, the engine is barely usable, my other VMs appear to be
unresponsive.  Two on one host, one on another, and none on the third.



On Sat, Jul 7, 2018 at 10:38 AM, Jim Kusznir  wrote:

> I run 4-7 VMs, and most of them are 2GB ram.  I have 2 VMs with 4GB.
>
> Ram hasn't been an issue until recent ovirt/gluster upgrades.  Storage has
> always been slow, especially with these drives.  However, even watching
> network utilization on my switch, the gig-e links never max out.
>
> The loadavg issues and unresponsive behavior started with yesterday's
> ovirt updates.  I now have one VM with low I/O that lives on a separate
> storage volume (data, fully SSD backed instead of data-hdd, which was
> having the issues).  I moved it to a ovirt host with no other VMs on it,
> and that had reshly been rebooted.  Before it had this one VM on it,
> loadavg was >0.5.  Now its up in the 20's, with only one low Disk I/O, 4GB
> ram VM on the host.
>
> This to me says there's now a new problem separate from Gluster.  I don't
> have any non-gluster storage available to test with.  I did notice that the
> last update included a new kernel, and it appears its the qemu-kvm
> processes that are consuming way more CPU than they used to now.
>
> Are there any known issues?  I'm going to reboot into my previous kernel
> to see if its kernel-caused.
>
> --Jim
>
>
>
> On Fri, Jul 6, 2018 at 11:07 PM, Johan Bernhardsson 
> wrote:
>
>> That is a single sata drive that is slow on random I/O and that has to be
>> synced with 2 other servers. Gluster works syncronous so one write has to
>> be written and acknowledged on all the three nodes.
>>
>> So you have a bottle neck in io on drives and one on network and
>> depending on how many virtual servers you have and how much ram they take
>> you might have memory.
>>
>> Load spikes when you have a wait somewhere and are overusing capacity.
>> But it's now only CPU that load is counted on. It is waiting for resources
>> so it can be memory or Network or drives.
>>
>> How many virtual server do you run and how much ram do they consume?
>>
>> On July 7, 2018 09:51:42 Jim Kusznir  wrote:
>>
>>> In case it matters, the data-hdd gluster volume uses these hard drives:
>>>
>>> https://www.amazon.com/gp/product/B01M1NHCZT/ref=oh_aui_deta
>>> ilpage_o05_s00?ie=UTF8=1
>>>
>>> This is in a Dell R610 with PERC6/i (one drive per server, configured as
>>> a single drive volume to pass it through as its own /dev/sd* device).
>>> Inside the OS, its partitioned with lvm_thin, then an lvm volume formatted
>>> with XFS and mounted as /gluster/brick3, with the data-hdd volume created
>>> inside that.
>>>
>>> --Jim
>>>
>>> On Fri, Jul 6, 2018 at 10:45 PM, Jim Kusznir 
>>> wrote:
>>

[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)

2018-07-07 Thread Jim Kusznir

I run 4-7 VMs, and most of them are 2GB ram.  I have 2 VMs with 4GB.

Ram hasn't been an issue until recent ovirt/gluster upgrades.  Storage has
always been slow, especially with these drives.  However, even watching
network utilization on my switch, the gig-e links never max out.

The loadavg issues and unresponsive behavior started with yesterday's ovirt
updates.  I now have one VM with low I/O that lives on a separate storage
volume (data, fully SSD backed instead of data-hdd, which was having the
issues).  I moved it to a ovirt host with no other VMs on it, and that had
reshly been rebooted.  Before it had this one VM on it, loadavg was >0.5.
Now its up in the 20's, with only one low Disk I/O, 4GB ram VM on the host.

This to me says there's now a new problem separate from Gluster.  I don't
have any non-gluster storage available to test with.  I did notice that the
last update included a new kernel, and it appears its the qemu-kvm
processes that are consuming way more CPU than they used to now.

Are there any known issues?  I'm going to reboot into my previous kernel to
see if its kernel-caused.

--Jim



On Fri, Jul 6, 2018 at 11:07 PM, Johan Bernhardsson  wrote:

> That is a single sata drive that is slow on random I/O and that has to be
> synced with 2 other servers. Gluster works syncronous so one write has to
> be written and acknowledged on all the three nodes.
>
> So you have a bottle neck in io on drives and one on network and depending
> on how many virtual servers you have and how much ram they take you might
> have memory.
>
> Load spikes when you have a wait somewhere and are overusing capacity. But
> it's now only CPU that load is counted on. It is waiting for resources so
> it can be memory or Network or drives.
>
> How many virtual server do you run and how much ram do they consume?
>
> On July 7, 2018 09:51:42 Jim Kusznir  wrote:
>
>> In case it matters, the data-hdd gluster volume uses these hard drives:
>>
>> https://www.amazon.com/gp/product/B01M1NHCZT/ref=oh_aui_
>> detailpage_o05_s00?ie=UTF8=1
>>
>> This is in a Dell R610 with PERC6/i (one drive per server, configured as
>> a single drive volume to pass it through as its own /dev/sd* device).
>> Inside the OS, its partitioned with lvm_thin, then an lvm volume formatted
>> with XFS and mounted as /gluster/brick3, with the data-hdd volume created
>> inside that.
>>
>> --Jim
>>
>> On Fri, Jul 6, 2018 at 10:45 PM, Jim Kusznir  wrote:
>>
>>> So, I'm still at a loss...It sounds like its either insufficient
>>> ram/swap, or insufficient network.  It seems to be neither now.  At this
>>> point, it appears that gluster is just "broke" and killing my systems for
>>> no descernable reason.  Here's detals, all from the same system (currently
>>> running 3 VMs):
>>>
>>> [root@ovirt3 ~]# w
>>>  22:26:53 up 36 days,  4:34,  1 user,  load average: 42.78, 55.98, 53.31
>>> USER TTY  FROM LOGIN@   IDLE   JCPU   PCPU WHAT
>>> root pts/0192.168.8.90 22:262.00s  0.12s  0.11s w
>>>
>>> bwm-ng reports the highest data usage was about 6MB/s during this test
>>> (and that was combined; I have two different gig networks.  One gluster
>>> network (primary VM storage) runs on one, the other network handles
>>> everything else).
>>>
>>> [root@ovirt3 ~]# free -m
>>>   totalusedfree  shared  buff/cache
>>>  available
>>> Mem:  31996   13236 232  18   18526
>>>  18195
>>> Swap: 163831475   14908
>>>
>>> top - 22:32:56 up 36 days,  4:41,  1 user,  load average: 17.99, 39.69,
>>> 47.66
>>> Tasks: 407 total,   1 running, 405 sleeping,   1 stopped,   0 zombie
>>> %Cpu(s):  8.6 us,  2.1 sy,  0.0 ni, 87.6 id,  1.6 wa,  0.0 hi,  0.1 si,
>>> 0.0 st
>>> KiB Mem : 32764284 total,   228296 free, 13541952 used, 18994036
>>> buff/cache
>>> KiB Swap: 16777212 total, 15246200 free,  1531012 used. 18643960 avail
>>> Mem
>>>
>>>   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
>>> COMMAND
>>>
>>> 30036 qemu  20   0 6872324   5.2g  13532 S 144.6 16.5 216:14.55
>>> /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S
>>> -object secret,id=masterKey0,format=raw,file=/v+
>>> 28501 qemu  20   0 5034968   3.6g  12880 S  16.2 11.7  73:44.99
>>> /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object
>>> secret,id=masterKey0,format=raw,file=/va+
>>>  2694 root  20   0 2169224  1216

[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)

2018-07-07 Thread Jim Kusznir

I think I should throw one more thing out there:  The current batch of
problems started essentially today, and I did apply the updates waiting in
the ovirt repos (through the ovirt mgmt interface: install updates).
Perhaps there is now something from that which is breaking things.

On Fri, Jul 6, 2018 at 10:51 PM, Jim Kusznir  wrote:

> In case it matters, the data-hdd gluster volume uses these hard drives:
>
> https://www.amazon.com/gp/product/B01M1NHCZT/ref=oh_aui_
> detailpage_o05_s00?ie=UTF8=1
>
> This is in a Dell R610 with PERC6/i (one drive per server, configured as a
> single drive volume to pass it through as its own /dev/sd* device).  Inside
> the OS, its partitioned with lvm_thin, then an lvm volume formatted with
> XFS and mounted as /gluster/brick3, with the data-hdd volume created inside
> that.
>
> --Jim
>
> On Fri, Jul 6, 2018 at 10:45 PM, Jim Kusznir  wrote:
>
>> So, I'm still at a loss...It sounds like its either insufficient
>> ram/swap, or insufficient network.  It seems to be neither now.  At this
>> point, it appears that gluster is just "broke" and killing my systems for
>> no descernable reason.  Here's detals, all from the same system (currently
>> running 3 VMs):
>>
>> [root@ovirt3 ~]# w
>>  22:26:53 up 36 days,  4:34,  1 user,  load average: 42.78, 55.98, 53.31
>> USER TTY  FROM LOGIN@   IDLE   JCPU   PCPU WHAT
>> root pts/0192.168.8.90 22:262.00s  0.12s  0.11s w
>>
>> bwm-ng reports the highest data usage was about 6MB/s during this test
>> (and that was combined; I have two different gig networks.  One gluster
>> network (primary VM storage) runs on one, the other network handles
>> everything else).
>>
>> [root@ovirt3 ~]# free -m
>>   totalusedfree  shared  buff/cache
>>  available
>> Mem:  31996   13236 232  18   18526
>>  18195
>> Swap: 163831475   14908
>>
>> top - 22:32:56 up 36 days,  4:41,  1 user,  load average: 17.99, 39.69,
>> 47.66
>> Tasks: 407 total,   1 running, 405 sleeping,   1 stopped,   0 zombie
>> %Cpu(s):  8.6 us,  2.1 sy,  0.0 ni, 87.6 id,  1.6 wa,  0.0 hi,  0.1 si,
>> 0.0 st
>> KiB Mem : 32764284 total,   228296 free, 13541952 used, 18994036
>> buff/cache
>> KiB Swap: 16777212 total, 15246200 free,  1531012 used. 18643960 avail
>> Mem
>>
>>   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
>> COMMAND
>>
>> 30036 qemu  20   0 6872324   5.2g  13532 S 144.6 16.5 216:14.55
>> /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object
>> secret,id=masterKey0,format=raw,file=/v+
>> 28501 qemu  20   0 5034968   3.6g  12880 S  16.2 11.7  73:44.99
>> /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object
>> secret,id=masterKey0,format=raw,file=/va+
>>  2694 root  20   0 2169224  12164   3108 S   5.0  0.0   3290:42
>> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id
>> data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+
>> 14293 root  15  -5  944700  13356   4436 S   4.0  0.0  16:32.15
>> /usr/sbin/glusterfs --volfile-server=192.168.8.11
>> --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+
>> 25100 vdsm   0 -20 6747440 107868  12836 S   2.3  0.3  21:35.20
>> /usr/bin/python2 /usr/share/vdsm/vdsmd
>>
>> 28971 qemu  20   0 2842592   1.5g  13548 S   1.7  4.7 241:46.49
>> /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on
>> -S -object secret,id=masterKey0,format=+
>> 12095 root  20   0  162276   2836   1868 R   1.3  0.0   0:00.25 top
>>
>>
>>  2708 root  20   0 1906040  12404   3080 S   1.0  0.0   1083:33
>> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id
>> engine.ovirt3.nwfiber.com.gluster-brick1-engine -p /var/+
>> 28623 qemu  20   0 4749536   1.7g  12896 S   0.7  5.5   4:30.64
>> /usr/libexec/qemu-kvm -name guest=billing.nwfiber.com,debug-threads=on
>> -S -object secret,id=masterKey0,format=ra+
>>10 root  20   0   0  0  0 S   0.3  0.0 215:54.72
>> [rcu_sched]
>>
>>  1030 sanlock   rt   0  773804  27908   2744 S   0.3  0.1  35:55.61
>> /usr/sbin/sanlock daemon
>>
>>  1890 zabbix20   0   83904   1696   1612 S   0.3  0.0  24:30.63
>> /usr/sbin/zabbix_agentd: collector [idle 1 sec]
>>
>>  2722 root  20   0 1298004   6148   2580 S   0.3  0.0  38:10.82
>> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id
>> iso.ovirt3.nwfiber.com.gluster-brick4-iso -p /var/run/gl+
>>  6340 ro

[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)

2018-07-07 Thread Jim Kusznir

In case it matters, the data-hdd gluster volume uses these hard drives:

https://www.amazon.com/gp/product/B01M1NHCZT/ref=oh_aui_detailpage_o05_s00?ie=UTF8=1

This is in a Dell R610 with PERC6/i (one drive per server, configured as a
single drive volume to pass it through as its own /dev/sd* device).  Inside
the OS, its partitioned with lvm_thin, then an lvm volume formatted with
XFS and mounted as /gluster/brick3, with the data-hdd volume created inside
that.

--Jim

On Fri, Jul 6, 2018 at 10:45 PM, Jim Kusznir  wrote:

> So, I'm still at a loss...It sounds like its either insufficient ram/swap,
> or insufficient network.  It seems to be neither now.  At this point, it
> appears that gluster is just "broke" and killing my systems for no
> descernable reason.  Here's detals, all from the same system (currently
> running 3 VMs):
>
> [root@ovirt3 ~]# w
>  22:26:53 up 36 days,  4:34,  1 user,  load average: 42.78, 55.98, 53.31
> USER TTY  FROM LOGIN@   IDLE   JCPU   PCPU WHAT
> root pts/0192.168.8.90 22:262.00s  0.12s  0.11s w
>
> bwm-ng reports the highest data usage was about 6MB/s during this test
> (and that was combined; I have two different gig networks.  One gluster
> network (primary VM storage) runs on one, the other network handles
> everything else).
>
> [root@ovirt3 ~]# free -m
>   totalusedfree  shared  buff/cache
>  available
> Mem:  31996   13236 232  18   18526
>  18195
> Swap: 163831475   14908
>
> top - 22:32:56 up 36 days,  4:41,  1 user,  load average: 17.99, 39.69,
> 47.66
> Tasks: 407 total,   1 running, 405 sleeping,   1 stopped,   0 zombie
> %Cpu(s):  8.6 us,  2.1 sy,  0.0 ni, 87.6 id,  1.6 wa,  0.0 hi,  0.1 si,
> 0.0 st
> KiB Mem : 32764284 total,   228296 free, 13541952 used, 18994036 buff/cache
> KiB Swap: 16777212 total, 15246200 free,  1531012 used. 18643960 avail Mem
>
>   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
> COMMAND
>
> 30036 qemu  20   0 6872324   5.2g  13532 S 144.6 16.5 216:14.55
> /usr/libexec/qemu-kvm -name guest=BillingWin,debug-threads=on -S -object
> secret,id=masterKey0,format=raw,file=/v+
> 28501 qemu  20   0 5034968   3.6g  12880 S  16.2 11.7  73:44.99
> /usr/libexec/qemu-kvm -name guest=FusionPBX,debug-threads=on -S -object
> secret,id=masterKey0,format=raw,file=/va+
>  2694 root  20   0 2169224  12164   3108 S   5.0  0.0   3290:42
> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id
> data.ovirt3.nwfiber.com.gluster-brick2-data -p /var/run/+
> 14293 root  15  -5  944700  13356   4436 S   4.0  0.0  16:32.15
> /usr/sbin/glusterfs --volfile-server=192.168.8.11
> --volfile-server=192.168.8.12 --volfile-server=192.168.8.13 --+
> 25100 vdsm   0 -20 6747440 107868  12836 S   2.3  0.3  21:35.20
> /usr/bin/python2 /usr/share/vdsm/vdsmd
>
> 28971 qemu  20   0 2842592   1.5g  13548 S   1.7  4.7 241:46.49
> /usr/libexec/qemu-kvm -name guest=unifi.palousetech.com,debug-threads=on
> -S -object secret,id=masterKey0,format=+
> 12095 root  20   0  162276   2836   1868 R   1.3  0.0   0:00.25 top
>
>
>  2708 root  20   0 1906040  12404   3080 S   1.0  0.0   1083:33
> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id
> engine.ovirt3.nwfiber.com.gluster-brick1-engine -p /var/+
> 28623 qemu  20   0 4749536   1.7g  12896 S   0.7  5.5   4:30.64
> /usr/libexec/qemu-kvm -name guest=billing.nwfiber.com,debug-threads=on -S
> -object secret,id=masterKey0,format=ra+
>10 root  20   0   0  0  0 S   0.3  0.0 215:54.72
> [rcu_sched]
>
>  1030 sanlock   rt   0  773804  27908   2744 S   0.3  0.1  35:55.61
> /usr/sbin/sanlock daemon
>
>  1890 zabbix20   0   83904   1696   1612 S   0.3  0.0  24:30.63
> /usr/sbin/zabbix_agentd: collector [idle 1 sec]
>
>  2722 root  20   0 1298004   6148   2580 S   0.3  0.0  38:10.82
> /usr/sbin/glusterfsd -s ovirt3.nwfiber.com --volfile-id
> iso.ovirt3.nwfiber.com.gluster-brick4-iso -p /var/run/gl+
>  6340 root  20   0   0  0  0 S   0.3  0.0   0:04.30
> [kworker/7:0]
>
> 10652 root  20   0   0  0  0 S   0.3  0.0   0:00.23
> [kworker/u64:2]
>
> 14724 root  20   0 1076344  17400   3200 S   0.3  0.1  10:04.13
> /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p
> /var/run/gluster/glustershd/glustershd.pid -+
> 22011 root  20   0   0  0  0 S   0.3  0.0   0:05.04
> [kworker/10:1]
>
>
> Not sure why the system load dropped other than I was trying to take a
> picture of it :)
>
> In any case, it appears that at this time, I have plenty of swap, ram, and
> network capacity, and yet things are still running ver

[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)

2018-07-06 Thread Jim Kusznir

al
storage for windows boxes) have been very solid; If I could get that kind
of reliability for my ovirt stack, it would be a substantial improvement.
Currently, it seems about every other month I have a gluster-induced outage.

Sometimes I wonder if its just hyperconverged is the issue, but my
infrastructure doesn't justify three servers at the same location...I might
be able to do two, but even that seems like its pushing it.

Looks like I can upgrade to 10G for about $900.  I can order a dual-Xeon
supermicro 12-disk server, loaded with 2TB WD Enterprise disks and a pair
of SSDs for the os, 32GB ram, 2.67Ghz CPUs for about $720 delivered.  I've
got to do something to improve my reliability; I can't keep going the way I
have been

--Jim


On Fri, Jul 6, 2018 at 9:13 PM, Johan Bernhardsson  wrote:

> Load like that is mostly io based either the machine is swapping or
> network is to slow. Check I/o wait in top.
>
> And the problem where you get oom killer to kill off gluster. That means
> that you don't monitor ram usage on the servers? Either it's eating all
> your ram and swap gets really io intensive and then is killed off. Or you
> have the wrong swap settings in sysctl.conf (there are tons of broken
> guides that recommends swappines to 0 but that disables swap on newer
> kernels. The proper swappines for only swapping when nesseary is 1 or a
> sufficiently low number like 10 default is 60)
>
>
> Moving to nfs will not improve things. You will get more memory since
> gluster isn't running and that is good. But you will have a single node
> that can fail with all your storage and it would still be on 1 gigabit only
> and your three node cluster would easily saturate that link.
>
> On July 7, 2018 04:13:13 Jim Kusznir  wrote:
>
>> So far it does not appear to be helping much. I'm still getting VM's
>> locking up and all kinds of notices from overt engine about non-responsive
>> hosts.  I'm still seeing load averages in the 20-30 range.
>>
>> Jim
>>
>> On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir  wrote:
>>
>>> Thank you for the advice and help
>>>
>>> I do plan on going 10Gbps networking; haven't quite jumped off that
>>> cliff yet, though.
>>>
>>> I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbps
>>> network, and I've watched throughput on that and never seen more than
>>> 60GB/s achieved (as reported by bwm-ng).  I have a separate 1Gbps network
>>> for communication and ovirt migration, but I wanted to break that up
>>> further (separate out VM traffice from migration/mgmt traffic).  My three
>>> SSD-backed gluster volumes run the main network too, as I haven't been able
>>> to get them to move to the new network (which I was trying to use as all
>>> gluster).  I tried bonding, but that seamed to reduce performance rather
>>> than improve it.
>>>
>>> --Jim
>>>
>>> On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence <
>>> jlawre...@squaretrade.com> wrote:
>>>
>>>> Hi Jim,
>>>>
>>>> I don't have any targeted suggestions, because there isn't much to
>>>> latch on to. I can say Gluster replica three  (no arbiters) on dedicated
>>>> servers serving a couple Ovirt VM clusters here have not had these sorts of
>>>> issues.
>>>>
>>>> I suspect your long heal times (and the resultant long periods of high
>>>> load) are at least partly related to 1G networking. That is just a matter
>>>> of IO - heals of VMs involve moving a lot of bits. My cluster uses 10G
>>>> bonded NICs on the gluster and ovirt boxes for storage traffic and separate
>>>> bonded 1G for ovirtmgmt and communication with other machines/people, and
>>>> we're occasionally hitting the bandwidth ceiling on the storage network.
>>>> I'm starting to think about 40/100G, different ways of splitting up
>>>> intensive systems, and considering iSCSI for specific volumes, although I
>>>> really don't want to go there.
>>>>
>>>> I don't run FreeNAS[1], but I do run FreeBSD as storage servers for
>>>> their excellent ZFS implementation, mostly for backups. ZFS will make your
>>>> `heal` problem go away, but not your bandwidth problems, which become worse
>>>> (because of fewer NICS pushing traffic). 10G hardware is not exactly in the
>>>> impulse-buy territory, but if you can, I'd recommend doing some testing
>>>> using it. I think at least some of your problems are related.
>>>>
>>>> If that's not possible, my next stops would be optimizing everything I
>

[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)

2018-07-06 Thread Jim Kusznir

So far it does not appear to be helping much. I'm still getting VM's
locking up and all kinds of notices from overt engine about non-responsive
hosts.  I'm still seeing load averages in the 20-30 range.

Jim

On Fri, Jul 6, 2018, 3:13 PM Jim Kusznir  wrote:

> Thank you for the advice and help
>
> I do plan on going 10Gbps networking; haven't quite jumped off that cliff
> yet, though.
>
> I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbps
> network, and I've watched throughput on that and never seen more than
> 60GB/s achieved (as reported by bwm-ng).  I have a separate 1Gbps network
> for communication and ovirt migration, but I wanted to break that up
> further (separate out VM traffice from migration/mgmt traffic).  My three
> SSD-backed gluster volumes run the main network too, as I haven't been able
> to get them to move to the new network (which I was trying to use as all
> gluster).  I tried bonding, but that seamed to reduce performance rather
> than improve it.
>
> --Jim
>
> On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence 
> wrote:
>
>> Hi Jim,
>>
>> I don't have any targeted suggestions, because there isn't much to latch
>> on to. I can say Gluster replica three  (no arbiters) on dedicated servers
>> serving a couple Ovirt VM clusters here have not had these sorts of issues.
>>
>> I suspect your long heal times (and the resultant long periods of high
>> load) are at least partly related to 1G networking. That is just a matter
>> of IO - heals of VMs involve moving a lot of bits. My cluster uses 10G
>> bonded NICs on the gluster and ovirt boxes for storage traffic and separate
>> bonded 1G for ovirtmgmt and communication with other machines/people, and
>> we're occasionally hitting the bandwidth ceiling on the storage network.
>> I'm starting to think about 40/100G, different ways of splitting up
>> intensive systems, and considering iSCSI for specific volumes, although I
>> really don't want to go there.
>>
>> I don't run FreeNAS[1], but I do run FreeBSD as storage servers for their
>> excellent ZFS implementation, mostly for backups. ZFS will make your `heal`
>> problem go away, but not your bandwidth problems, which become worse
>> (because of fewer NICS pushing traffic). 10G hardware is not exactly in the
>> impulse-buy territory, but if you can, I'd recommend doing some testing
>> using it. I think at least some of your problems are related.
>>
>> If that's not possible, my next stops would be optimizing everything I
>> could about sharding, healing and optimizing for serving the shard size to
>> squeeze as much performance out of 1G as I could, but that will only go so
>> far.
>>
>> -j
>>
>> [1] FreeNAS is just a storage-tuned FreeBSD with a GUI.
>>
>> > On Jul 6, 2018, at 1:19 PM, Jim Kusznir  wrote:
>> >
>> > hi all:
>> >
>> > Once again my production ovirt cluster is collapsing in on itself.  My
>> servers are intermittently unavailable or degrading, customers are noticing
>> and calling in.  This seems to be yet another gluster failure that I
>> haven't been able to pin down.
>> >
>> > I posted about this a while ago, but didn't get anywhere (no replies
>> that I found).  The problem started out as a glusterfsd process consuming
>> large amounts of ram (up to the point where ram and swap were exhausted and
>> the kernel OOM killer killed off the glusterfsd process).  For reasons not
>> clear to me at this time, that resulted in any VMs running on that host and
>> that gluster volume to be paused with I/O error (the glusterfs process is
>> usually unharmed; why it didn't continue I/O with other servers is
>> confusing to me).
>> >
>> > I have 3 servers and a total of 4 gluster volumes (engine, iso, data,
>> and data-hdd).  The first 3 are replica 2+arb; the 4th (data-hdd) is
>> replica 3.  The first 3 are backed by an LVM partition (some thin
>> provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some
>> internal flash for acceleration).  data-hdd is the only thing on the disk.
>> Servers are Dell R610 with the PERC/6i raid card, with the disks
>> individually passed through to the OS (no raid enabled).
>> >
>> > The above RAM usage issue came from the data-hdd volume.  Yesterday, I
>> cought one of the glusterfsd high ram usage before the OOM-Killer had to
>> run.  I was able to migrate the VMs off the machine and for good measure,
>> reboot the entire machine (after taking this opportunity to run the
>> software updates that ovirt said were pending).  Upon booting back up, the
>> nece

[ovirt-users] Re: Ovirt cluster unstable; gluster to blame (again)

2018-07-06 Thread Jim Kusznir

Thank you for the advice and help

I do plan on going 10Gbps networking; haven't quite jumped off that cliff
yet, though.

I did put my data-hdd (main VM storage volume) onto a dedicated 1Gbps
network, and I've watched throughput on that and never seen more than
60GB/s achieved (as reported by bwm-ng).  I have a separate 1Gbps network
for communication and ovirt migration, but I wanted to break that up
further (separate out VM traffice from migration/mgmt traffic).  My three
SSD-backed gluster volumes run the main network too, as I haven't been able
to get them to move to the new network (which I was trying to use as all
gluster).  I tried bonding, but that seamed to reduce performance rather
than improve it.

--Jim

On Fri, Jul 6, 2018 at 2:52 PM, Jamie Lawrence 
wrote:

> Hi Jim,
>
> I don't have any targeted suggestions, because there isn't much to latch
> on to. I can say Gluster replica three  (no arbiters) on dedicated servers
> serving a couple Ovirt VM clusters here have not had these sorts of issues.
>
> I suspect your long heal times (and the resultant long periods of high
> load) are at least partly related to 1G networking. That is just a matter
> of IO - heals of VMs involve moving a lot of bits. My cluster uses 10G
> bonded NICs on the gluster and ovirt boxes for storage traffic and separate
> bonded 1G for ovirtmgmt and communication with other machines/people, and
> we're occasionally hitting the bandwidth ceiling on the storage network.
> I'm starting to think about 40/100G, different ways of splitting up
> intensive systems, and considering iSCSI for specific volumes, although I
> really don't want to go there.
>
> I don't run FreeNAS[1], but I do run FreeBSD as storage servers for their
> excellent ZFS implementation, mostly for backups. ZFS will make your `heal`
> problem go away, but not your bandwidth problems, which become worse
> (because of fewer NICS pushing traffic). 10G hardware is not exactly in the
> impulse-buy territory, but if you can, I'd recommend doing some testing
> using it. I think at least some of your problems are related.
>
> If that's not possible, my next stops would be optimizing everything I
> could about sharding, healing and optimizing for serving the shard size to
> squeeze as much performance out of 1G as I could, but that will only go so
> far.
>
> -j
>
> [1] FreeNAS is just a storage-tuned FreeBSD with a GUI.
>
> > On Jul 6, 2018, at 1:19 PM, Jim Kusznir  wrote:
> >
> > hi all:
> >
> > Once again my production ovirt cluster is collapsing in on itself.  My
> servers are intermittently unavailable or degrading, customers are noticing
> and calling in.  This seems to be yet another gluster failure that I
> haven't been able to pin down.
> >
> > I posted about this a while ago, but didn't get anywhere (no replies
> that I found).  The problem started out as a glusterfsd process consuming
> large amounts of ram (up to the point where ram and swap were exhausted and
> the kernel OOM killer killed off the glusterfsd process).  For reasons not
> clear to me at this time, that resulted in any VMs running on that host and
> that gluster volume to be paused with I/O error (the glusterfs process is
> usually unharmed; why it didn't continue I/O with other servers is
> confusing to me).
> >
> > I have 3 servers and a total of 4 gluster volumes (engine, iso, data,
> and data-hdd).  The first 3 are replica 2+arb; the 4th (data-hdd) is
> replica 3.  The first 3 are backed by an LVM partition (some thin
> provisioned) on an SSD; the 4th is on a seagate hybrid disk (hdd + some
> internal flash for acceleration).  data-hdd is the only thing on the disk.
> Servers are Dell R610 with the PERC/6i raid card, with the disks
> individually passed through to the OS (no raid enabled).
> >
> > The above RAM usage issue came from the data-hdd volume.  Yesterday, I
> cought one of the glusterfsd high ram usage before the OOM-Killer had to
> run.  I was able to migrate the VMs off the machine and for good measure,
> reboot the entire machine (after taking this opportunity to run the
> software updates that ovirt said were pending).  Upon booting back up, the
> necessary volume healing began.  However, this time, the healing caused all
> three servers to go to very, very high load averages (I saw just under 200
> on one server; typically they've been 40-70) with top reporting IO Wait at
> 7-20%.  Network for this volume is a dedicated gig network.  According to
> bwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), but
> tailed off to mostly in the kB/s for a while.  All machines' load averages
> were still 40+ and gluster volume heal data-hdd info reported 5 items
> needing healing.  Server's were intermittently experien

[ovirt-users] Ovirt cluster unstable; gluster to blame (again)

2018-07-06 Thread Jim Kusznir

hi all:

Once again my production ovirt cluster is collapsing in on itself. My
servers are intermittently unavailable or degrading, customers are noticing
and calling in. This seems to be yet another gluster failure that I
haven't been able to pin down.

I posted about this a while ago, but didn't get anywhere (no replies that I
found). The problem started out as a glusterfsd process consuming large
amounts of ram (up to the point where ram and swap were exhausted and the
kernel OOM killer killed off the glusterfsd process). For reasons not
clear to me at this time, that resulted in any VMs running on that host and
that gluster volume to be paused with I/O error (the glusterfs process is
usually unharmed; why it didn't continue I/O with other servers is
confusing to me).

I have 3 servers and a total of 4 gluster volumes (engine, iso, data, and
data-hdd). The first 3 are replica 2+arb; the 4th (data-hdd) is replica
3. The first 3 are backed by an LVM partition (some thin provisioned) on
an SSD; the 4th is on a seagate hybrid disk (hdd + some internal flash for
acceleration). data-hdd is the only thing on the disk. Servers are Dell
R610 with the PERC/6i raid card, with the disks individually passed through
to the OS (no raid enabled).

The above RAM usage issue came from the data-hdd volume. Yesterday, I
cought one of the glusterfsd high ram usage before the OOM-Killer had to
run. I was able to migrate the VMs off the machine and for good measure,
reboot the entire machine (after taking this opportunity to run the
software updates that ovirt said were pending). Upon booting back up, the
necessary volume healing began. However, this time, the healing caused all
three servers to go to very, very high load averages (I saw just under 200
on one server; typically they've been 40-70) with top reporting IO Wait at
7-20%. Network for this volume is a dedicated gig network. According to
bwm-ng, initially the network bandwidth would hit 50MB/s (yes, bytes), but
tailed off to mostly in the kB/s for a while. All machines' load averages
were still 40+ and gluster volume heal data-hdd info reported 5 items
needing healing. Server's were intermittently experiencing IO issues, even
on the 3 gluster volumes that appeared largely unaffected. Even the OS
activities on the hosts itself (logging in, running commands) would often
be very delayed. The ovirt engine was seemingly randomly throwing engine
down / engine up / engine failed notifications. Responsiveness on ANY VM
was horrific most of the time, with random VMs being inaccessible.

I let the gluster heal run overnight. By morning, there were still 5 items
needing healing, all three servers were still experiencing high load, and
servers were still largely unstable.

I've noticed that all of my ovirt outages (and I've had a lot, way more
than is acceptable for a production cluster) have come from gluster. I
still have 3 VMs who's hard disk images have become corrupted by my last
gluster crash that I haven't had time to repair / rebuild yet (I believe
this crash was caused by the OOM issue previously mentioned, but I didn't
know it at the time).

Is gluster really ready for production yet? It seems so unstable to
me I'm looking at replacing gluster with a dedicated NFS server likely
FreeNAS. Any suggestions? What is the "right" way to do production
storage on this (3 node cluster)? Can I get this gluster volume stable
enough to get my VMs to run reliably again until I can deploy another
storage solution?

[ovirt-users] Re: User agent for Ovirt 4.2

2018-06-15 Thread Jim Kusznir

Hmm...I did that, and restarted the ovirt agent, but it doesn't appear to
be working...All my VMs in ovirt still complain about no / outdated agents.

I'll do more looking later today.

On Thu, Jun 14, 2018 at 11:11 AM, Alex K  wrote:

> In my case I am using some Debian9 VMs and I can install the guest agents
> with:
>
> apt-get install ovirt-guest-agent
>
> This article also has some reference:
> https://www.ovirt.org/documentation/how-to/guest-
> agent/install-the-guest-agent-in-debian/
>
>
> Alex
>
> On Thu, Jun 14, 2018 at 8:38 PM, Jim Kusznir  wrote:
>
>> What about for debian guests?  I unfortunately have several that must run
>> debian (I do have a mix of RedHat-based and Debian-based VMs).
>>
>> Thanks!
>> --Jim
>>
>> On Wed, Jun 13, 2018 at 11:00 PM, Leo David  wrote:
>>
>>> Hi,
>>> I've just managed to install guest agent by installing epel-release
>>> first.
>>> yum -y install epel-release
>>> yum -y install ovirt-guest-agent-common
>>>
>>> At this moment is getting  ovirt-guest-agent-common.noarch
>>> 0:1.0.14-1.el7
>>>
>>> On Thu, Jun 14, 2018 at 7:09 AM, Jim Kusznir 
>>> wrote:
>>>
>>>> Hi:
>>>>
>>>> I haven't managed to find the new / current repo/source for the ovirt
>>>> guest agent for the 4.2 upgrade.  All my VMs now say that they need the
>>>> agent.  Googles keep referring me to old / broke / non-existent repos.
>>>> Where do I find the 4.2 agent (or does the 4.2 agent even exist?)
>>>>
>>>> Thanks!
>>>> --Jim
>>>>
>>>> ___
>>>> Users mailing list -- users@ovirt.org
>>>> To unsubscribe send an email to users-le...@ovirt.org
>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>>> oVirt Code of Conduct: https://www.ovirt.org/communit
>>>> y/about/community-guidelines/
>>>> List Archives: https://lists.ovirt.org/archiv
>>>> es/list/users@ovirt.org/message/UYEOKFI7DQCWFJXZHBJ3GXM3V3SEGST4/
>>>>
>>>>
>>>
>>>
>>> --
>>> Best regards, Leo David
>>>
>>
>>
>> ___
>> Users mailing list -- users@ovirt.org
>> To unsubscribe send an email to users-le...@ovirt.org
>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>> oVirt Code of Conduct: https://www.ovirt.org/communit
>> y/about/community-guidelines/
>> List Archives: https://lists.ovirt.org/archiv
>> es/list/users@ovirt.org/message/L6PMHKVEG5X4RZDFQDGD7IZMG24WZG6I/
>>
>>
>
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/OOWW4HLTFFWO4Z2CO37HTSVNYCORN3N2/

[ovirt-users] Re: User agent for Ovirt 4.2

2018-06-14 Thread Jim Kusznir

What about for debian guests?  I unfortunately have several that must run
debian (I do have a mix of RedHat-based and Debian-based VMs).

Thanks!
--Jim

On Wed, Jun 13, 2018 at 11:00 PM, Leo David  wrote:

> Hi,
> I've just managed to install guest agent by installing epel-release first.
> yum -y install epel-release
> yum -y install ovirt-guest-agent-common
>
> At this moment is getting  ovirt-guest-agent-common.noarch 0:1.0.14-1.el7
>
> On Thu, Jun 14, 2018 at 7:09 AM, Jim Kusznir  wrote:
>
>> Hi:
>>
>> I haven't managed to find the new / current repo/source for the ovirt
>> guest agent for the 4.2 upgrade.  All my VMs now say that they need the
>> agent.  Googles keep referring me to old / broke / non-existent repos.
>> Where do I find the 4.2 agent (or does the 4.2 agent even exist?)
>>
>> Thanks!
>> --Jim
>>
>> ___
>> Users mailing list -- users@ovirt.org
>> To unsubscribe send an email to users-le...@ovirt.org
>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>> oVirt Code of Conduct: https://www.ovirt.org/communit
>> y/about/community-guidelines/
>> List Archives: https://lists.ovirt.org/archiv
>> es/list/users@ovirt.org/message/UYEOKFI7DQCWFJXZHBJ3GXM3V3SEGST4/
>>
>>
>
>
> --
> Best regards, Leo David
>
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/L6PMHKVEG5X4RZDFQDGD7IZMG24WZG6I/

[ovirt-users] User agent for Ovirt 4.2

2018-06-13 Thread Jim Kusznir

Hi:

I haven't managed to find the new / current repo/source for the ovirt guest
agent for the 4.2 upgrade.  All my VMs now say that they need the agent.
Googles keep referring me to old / broke / non-existent repos.  Where do I
find the 4.2 agent (or does the 4.2 agent even exist?)

Thanks!
--Jim
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/UYEOKFI7DQCWFJXZHBJ3GXM3V3SEGST4/

[ovirt-users] Re: Gluster problems, cluster performance issues

2018-05-30 Thread Jim Kusznir

At the moment, it is responding like I would expect.  I do know I have one
failed drive on one brick (hardware failure, OS removed drive completely;
the underlying /dev/sdb is gone).  I have a new disk on order (overnight),
but that is also one brick of one volume that is replica 3, so I would hope
a complete failure like that would restore the system to operational
capabilities.

Since having the gluster-volume-starting problems, I have performed a test
in the engine volume with writing and removing a file and verifying its
happening from all three hosts; that worked.  The engine volume has all of
its bricks, as well two other volumes; its only one volume that is shy one
brick.

--Jim

On Tue, May 29, 2018 at 11:41 PM, Johan Bernhardsson  wrote:

> Is storage working as it should?  Does the gluster mount point respond as
> it should? Can you write files to it?  Does the physical drives say that
> they are ok? Can you write (you shouldn't bypass gluster mount point but
> you need to test the drives) to the physical drives?
>
> For me this sounds like broken or almost broken hardware or broken
> underlying filesystems.
>
> If one of the drives malfunction and timeout, gluster will be slow and
> timeout. It runs write in sync so the slowest node will slow down the whole
> system.
>
> /Johan
>
>
> On May 30, 2018 08:29:46 Jim Kusznir  wrote:
>
>> hosted-engine --deploy failed (would not come up on my existing gluster
>> storage).  However, I realized no changes were written to my existing
>> storage.  So, I went back to trying to get my old engine running.
>>
>> hosted-engine --vm-status is now taking a very long time (5+minutes) to
>> return, and it returns stail information everywhere.  I thought perhaps the
>> lockspace is corrupt, so tried to clean that and metadata, but both are
>> failing (--cleam-metadata has hung and I can't even ctrl-c out of it).
>>
>> How can I reinitialize all the lockspace/metadata safely?  There is no
>> engine or VMs running currently
>>
>> --Jim
>>
>> On Tue, May 29, 2018 at 9:33 PM, Jim Kusznir  wrote:
>>
>>> Well, things went from bad to very, very bad
>>>
>>> It appears that during one of the 2 minute lockups, the fencing agents
>>> decided that another node in the cluster was down.  As a result, 2 of the 3
>>> nodes were simultaneously reset with fencing agent reboot.  After the nodes
>>> came back up, the engine would not start.  All running VMs (including VMs
>>> on the 3rd node that was not rebooted) crashed.
>>>
>>> I've now been working for about 3 hours trying to get the engine to come
>>> up.  I don't know why it won't start.  hosted-engine --vm-start says its
>>> starting, but it doesn't start (virsh doesn't show any VMs running).  I'm
>>> currently running --deploy, as I had run out of options for anything else I
>>> can come up with.  I hope this will allow me to re-import all my existing
>>> VMs and allow me to start them back up after everything comes back up.
>>>
>>> I do have an unverified geo-rep backup; I don't know if it is a good
>>> backup (there were several prior messages to this list, but I didn't get
>>> replies to my questions.  It was running in what I believe to be "strange",
>>> and the data directories are larger than their source).
>>>
>>> I'll see if my --deploy works, and if not, I'll be back with another
>>> message/help request.
>>>
>>> When the dust settles and I'm at least minimally functional again, I
>>> really want to understand why all these technologies designed to offer
>>> redundancy conspired to reduce uptime and create failures where there
>>> weren't any otherwise.  I thought with hosted engine, 3 ovirt servers and
>>> glusterfs with minimum replica 2+arb or replica 3 should have offered
>>> strong resilience against server failure or disk failure, and should have
>>> prevented / recovered from data corruption.  Instead, all of the above
>>> happened (once I get my cluster back up, I still have to try and recover my
>>> webserver VM, which won't boot due to XFS corrupt journal issues created
>>> during the gluster crashes).  I think a lot of these issues were rooted
>>> from the upgrade from 4.1 to 4.2.
>>>
>>> --Jim
>>>
>>> On Tue, May 29, 2018 at 6:25 PM, Jim Kusznir 
>>> wrote:
>>>
>>>> I also finally found the following in my system log on one server:
>>>>
>>>> [10679.524491] INFO: task glusterclogro:14933 blocked for more than 120
>>>> seconds.
>>>>

[ovirt-users] Re: Gluster problems, cluster performance issues

2018-05-30 Thread Jim Kusznir

hosted-engine --deploy failed (would not come up on my existing gluster
storage).  However, I realized no changes were written to my existing
storage.  So, I went back to trying to get my old engine running.

hosted-engine --vm-status is now taking a very long time (5+minutes) to
return, and it returns stail information everywhere.  I thought perhaps the
lockspace is corrupt, so tried to clean that and metadata, but both are
failing (--cleam-metadata has hung and I can't even ctrl-c out of it).

How can I reinitialize all the lockspace/metadata safely?  There is no
engine or VMs running currently

--Jim

On Tue, May 29, 2018 at 9:33 PM, Jim Kusznir  wrote:

> Well, things went from bad to very, very bad
>
> It appears that during one of the 2 minute lockups, the fencing agents
> decided that another node in the cluster was down.  As a result, 2 of the 3
> nodes were simultaneously reset with fencing agent reboot.  After the nodes
> came back up, the engine would not start.  All running VMs (including VMs
> on the 3rd node that was not rebooted) crashed.
>
> I've now been working for about 3 hours trying to get the engine to come
> up.  I don't know why it won't start.  hosted-engine --vm-start says its
> starting, but it doesn't start (virsh doesn't show any VMs running).  I'm
> currently running --deploy, as I had run out of options for anything else I
> can come up with.  I hope this will allow me to re-import all my existing
> VMs and allow me to start them back up after everything comes back up.
>
> I do have an unverified geo-rep backup; I don't know if it is a good
> backup (there were several prior messages to this list, but I didn't get
> replies to my questions.  It was running in what I believe to be "strange",
> and the data directories are larger than their source).
>
> I'll see if my --deploy works, and if not, I'll be back with another
> message/help request.
>
> When the dust settles and I'm at least minimally functional again, I
> really want to understand why all these technologies designed to offer
> redundancy conspired to reduce uptime and create failures where there
> weren't any otherwise.  I thought with hosted engine, 3 ovirt servers and
> glusterfs with minimum replica 2+arb or replica 3 should have offered
> strong resilience against server failure or disk failure, and should have
> prevented / recovered from data corruption.  Instead, all of the above
> happened (once I get my cluster back up, I still have to try and recover my
> webserver VM, which won't boot due to XFS corrupt journal issues created
> during the gluster crashes).  I think a lot of these issues were rooted
> from the upgrade from 4.1 to 4.2.
>
> --Jim
>
> On Tue, May 29, 2018 at 6:25 PM, Jim Kusznir  wrote:
>
>> I also finally found the following in my system log on one server:
>>
>> [10679.524491] INFO: task glusterclogro:14933 blocked for more than 120
>> seconds.
>> [10679.525826] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [10679.527144] glusterclogro   D 97209832bf40 0 14933  1
>> 0x0080
>> [10679.527150] Call Trace:
>> [10679.527161]  [] schedule+0x29/0x70
>> [10679.527218]  [] _xfs_log_force_lsn+0x2e8/0x340 [xfs]
>> [10679.527225]  [] ? wake_up_state+0x20/0x20
>> [10679.527254]  [] xfs_file_fsync+0x107/0x1e0 [xfs]
>> [10679.527260]  [] do_fsync+0x67/0xb0
>> [10679.527268]  [] ? system_call_after_swapgs+0xbc/
>> 0x160
>> [10679.527271]  [] SyS_fsync+0x10/0x20
>> [10679.527275]  [] system_call_fastpath+0x1c/0x21
>> [10679.527279]  [] ? system_call_after_swapgs+0xc8/
>> 0x160
>> [10679.527283] INFO: task glusterposixfsy:14941 blocked for more than 120
>> seconds.
>> [10679.528608] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [10679.529956] glusterposixfsy D 972495f84f10 0 14941  1
>> 0x0080
>> [10679.529961] Call Trace:
>> [10679.529966]  [] schedule+0x29/0x70
>> [10679.530003]  [] _xfs_log_force_lsn+0x2e8/0x340 [xfs]
>> [10679.530008]  [] ? wake_up_state+0x20/0x20
>> [10679.530038]  [] xfs_file_fsync+0x107/0x1e0 [xfs]
>> [10679.530042]  [] do_fsync+0x67/0xb0
>> [10679.530046]  [] ? system_call_after_swapgs+0xbc/
>> 0x160
>> [10679.530050]  [] SyS_fdatasync+0x13/0x20
>> [10679.530054]  [] system_call_fastpath+0x1c/0x21
>> [10679.530058]  [] ? system_call_after_swapgs+0xc8/
>> 0x160
>> [10679.530062] INFO: task glusteriotwr13:15486 blocked for more than 120
>> seconds.
>> [10679.531805] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [10679.533732

[ovirt-users] Re: Gluster problems, cluster performance issues

2018-05-29 Thread Jim Kusznir

Well, things went from bad to very, very bad

It appears that during one of the 2 minute lockups, the fencing agents
decided that another node in the cluster was down.  As a result, 2 of the 3
nodes were simultaneously reset with fencing agent reboot.  After the nodes
came back up, the engine would not start.  All running VMs (including VMs
on the 3rd node that was not rebooted) crashed.

I've now been working for about 3 hours trying to get the engine to come
up.  I don't know why it won't start.  hosted-engine --vm-start says its
starting, but it doesn't start (virsh doesn't show any VMs running).  I'm
currently running --deploy, as I had run out of options for anything else I
can come up with.  I hope this will allow me to re-import all my existing
VMs and allow me to start them back up after everything comes back up.

I do have an unverified geo-rep backup; I don't know if it is a good backup
(there were several prior messages to this list, but I didn't get replies
to my questions.  It was running in what I believe to be "strange", and the
data directories are larger than their source).

I'll see if my --deploy works, and if not, I'll be back with another
message/help request.

When the dust settles and I'm at least minimally functional again, I really
want to understand why all these technologies designed to offer redundancy
conspired to reduce uptime and create failures where there weren't any
otherwise.  I thought with hosted engine, 3 ovirt servers and glusterfs
with minimum replica 2+arb or replica 3 should have offered strong
resilience against server failure or disk failure, and should have
prevented / recovered from data corruption.  Instead, all of the above
happened (once I get my cluster back up, I still have to try and recover my
webserver VM, which won't boot due to XFS corrupt journal issues created
during the gluster crashes).  I think a lot of these issues were rooted
from the upgrade from 4.1 to 4.2.

--Jim

On Tue, May 29, 2018 at 6:25 PM, Jim Kusznir  wrote:

> I also finally found the following in my system log on one server:
>
> [10679.524491] INFO: task glusterclogro:14933 blocked for more than 120
> seconds.
> [10679.525826] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [10679.527144] glusterclogro   D 97209832bf40 0 14933  1
> 0x0080
> [10679.527150] Call Trace:
> [10679.527161]  [] schedule+0x29/0x70
> [10679.527218]  [] _xfs_log_force_lsn+0x2e8/0x340 [xfs]
> [10679.527225]  [] ? wake_up_state+0x20/0x20
> [10679.527254]  [] xfs_file_fsync+0x107/0x1e0 [xfs]
> [10679.527260]  [] do_fsync+0x67/0xb0
> [10679.527268]  [] ? system_call_after_swapgs+0xbc/0x160
> [10679.527271]  [] SyS_fsync+0x10/0x20
> [10679.527275]  [] system_call_fastpath+0x1c/0x21
> [10679.527279]  [] ? system_call_after_swapgs+0xc8/0x160
> [10679.527283] INFO: task glusterposixfsy:14941 blocked for more than 120
> seconds.
> [10679.528608] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [10679.529956] glusterposixfsy D 972495f84f10 0 14941  1
> 0x0080
> [10679.529961] Call Trace:
> [10679.529966]  [] schedule+0x29/0x70
> [10679.530003]  [] _xfs_log_force_lsn+0x2e8/0x340 [xfs]
> [10679.530008]  [] ? wake_up_state+0x20/0x20
> [10679.530038]  [] xfs_file_fsync+0x107/0x1e0 [xfs]
> [10679.530042]  [] do_fsync+0x67/0xb0
> [10679.530046]  [] ? system_call_after_swapgs+0xbc/0x160
> [10679.530050]  [] SyS_fdatasync+0x13/0x20
> [10679.530054]  [] system_call_fastpath+0x1c/0x21
> [10679.530058]  [] ? system_call_after_swapgs+0xc8/0x160
> [10679.530062] INFO: task glusteriotwr13:15486 blocked for more than 120
> seconds.
> [10679.531805] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [10679.533732] glusteriotwr13  D 9720a83f 0 15486  1
> 0x0080
> [10679.533738] Call Trace:
> [10679.533747]  [] schedule+0x29/0x70
> [10679.533799]  [] _xfs_log_force_lsn+0x2e8/0x340 [xfs]
> [10679.533806]  [] ? wake_up_state+0x20/0x20
> [10679.533846]  [] xfs_file_fsync+0x107/0x1e0 [xfs]
> [10679.533852]  [] do_fsync+0x67/0xb0
> [10679.533858]  [] ? system_call_after_swapgs+0xbc/0x160
> [10679.533863]  [] SyS_fdatasync+0x13/0x20
> [10679.533868]  [] system_call_fastpath+0x1c/0x21
> [10679.533873]  [] ? system_call_after_swapgs+0xc8/0x160
> [10919.512757] INFO: task glusterclogro:14933 blocked for more than 120
> seconds.
> [10919.514714] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [10919.516663] glusterclogro   D 97209832bf40 0 14933  1
> 0x0080
> [10919.516677] Call Trace:
> [10919.516690]  [] schedule+0x29/0x70
> [10919.516696]  [] schedule_timeout+0x239/0x2c0
> [10919.516703]  [] ? blk_finis

[ovirt-users] Re: Gluster problems, cluster performance issues

2018-05-29 Thread Jim Kusznir

0x0080
[11279.504635] Call Trace:
[11279.504640]  [] schedule+0x29/0x70
[11279.504676]  [] _xfs_log_force_lsn+0x2e8/0x340 [xfs]
[11279.504681]  [] ? wake_up_state+0x20/0x20
[11279.504710]  [] xfs_file_fsync+0x107/0x1e0 [xfs]
[11279.504714]  [] do_fsync+0x67/0xb0
[11279.504718]  [] ? system_call_after_swapgs+0xbc/0x160
[11279.504722]  [] SyS_fsync+0x10/0x20
[11279.504725]  [] system_call_fastpath+0x1c/0x21
[11279.504730]  [] ? system_call_after_swapgs+0xc8/0x160
[12127.466494] perf: interrupt took too long (8263 > 8150), lowering
kernel.perf_event_max_sample_rate to 24000


I think this is the cause of the massive ovirt performance issues
irrespective of gluster volume.  At the time this happened, I was also
ssh'ed into the host, and was doing some rpm querry commands.  I had just
run rpm -qa |grep glusterfs (to verify what version was actually
installed), and that command took almost 2 minutes to return!  Normally it
takes less than 2 seconds.  That is all pure local SSD IO, too

I'm no expert, but its my understanding that anytime a software causes
these kinds of issues, its a serious bug in the software, even if its
mis-handled exceptions.  Is this correct?

--Jim

On Tue, May 29, 2018 at 3:01 PM, Jim Kusznir  wrote:

> I think this is the profile information for one of the volumes that lives
> on the SSDs and is fully operational with no down/problem disks:
>
> [root@ovirt2 yum.repos.d]# gluster volume profile data info
> Brick: ovirt2.nwfiber.com:/gluster/brick2/data
> --
> Cumulative Stats:
>Block Size:256b+ 512b+
> 1024b+
>  No. of Reads:  983  2696
> 1059
> No. of Writes:0  1113
>  302
>
>Block Size:   2048b+4096b+
> 8192b+
>  No. of Reads:  852 88608
>  53526
> No. of Writes:  522812340
>  76257
>
>Block Size:  16384b+   32768b+
>  65536b+
>  No. of Reads:54351241901
>  15024
> No. of Writes:21636  8656
> 8976
>
>Block Size: 131072b+
>  No. of Reads:   524156
> No. of Writes:   296071
>  %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls
>  Fop
>  -   ---   ---   ---   
> 
>   0.00   0.00 us   0.00 us   0.00 us   4189
>  RELEASE
>   0.00   0.00 us   0.00 us   0.00 us   1257
> RELEASEDIR
>   0.00  46.19 us  12.00 us 187.00 us 69
>  FLUSH
>   0.00 147.00 us  78.00 us 367.00 us 86
> REMOVEXATTR
>   0.00 223.46 us  24.00 us1166.00 us149
>  READDIR
>   0.00 565.34 us  76.00 us3639.00 us 88
>  FTRUNCATE
>   0.00 263.28 us  20.00 us   28385.00 us228
>   LK
>   0.00  98.84 us   2.00 us 880.00 us   1198
>  OPENDIR
>   0.00  91.59 us  26.00 us   10371.00 us   3853
> STATFS
>   0.00 494.14 us  17.00 us  193439.00 us   1171
> GETXATTR
>   0.00 299.42 us  35.00 us9799.00 us   2044
> READDIRP
>   0.001965.31 us 110.00 us  382258.00 us321
>  XATTROP
>   0.01 113.40 us  24.00 us   61061.00 us   8134
> STAT
>   0.01 755.38 us  57.00 us  607603.00 us   3196
>  DISCARD
>   0.052690.09 us  58.00 us 2704761.00 us   3206
> OPEN
>   0.10  119978.25 us  97.00 us 9406684.00 us154
>  SETATTR
>   0.18 101.73 us  28.00 us  700477.00 us 313379
>  FSTAT
>   0.231059.84 us  25.00 us 2716124.00 us  38255
> LOOKUP
>   0.471024.11 us  54.00 us 6197164.00 us  81455
> FXATTROP
>   1.722984.00 us  15.00 us 37098954.00 us 103020
> FINODELK
>   5.92   44315.32 us  51.00 us 24731536.00 us  23957
>  FSYNC
>  13.272399.78 us  25.00 us 22089540.00 us 991005
>   READ
>  37.005980.43 us  52.00 us 22099889.00 us1108976
>  WRITE
>  41.045452.75 us  13.00 us 22102452.00 us1349053
>  INODELK
>
> Duration: 10026 seconds
>Data Read: 80046027759 bytes
> Data Written: 44496632320 bytes
>
> Interval 1 Stats:
>Block Size:256b+ 512b+
> 1024b+
>  No. of Reads:  983  2696
> 1059
> No. of Writes:0

[ovirt-users] Re: Gluster problems, cluster performance issues

2018-05-29 Thread Jim Kusznir

  51.00 us 3595513.00 us 131642
 WRITE
 17.71 957.08 us  16.00 us 13700466.00 us1508160
 INODELK
 24.562546.42 us  26.00 us 5077347.00 us 786060
READ
 31.54   49651.63 us  47.00 us 3746331.00 us  51777
 FSYNC

Duration: 10101 seconds
   Data Read: 101562897361 bytes
Data Written: 4834450432 bytes


On Tue, May 29, 2018 at 2:55 PM, Jim Kusznir  wrote:

> Thank you for your response.
>
> I have 4 gluster volumes.  3 are replica 2 + arbitrator.  replica bricks
> are on ovirt1 and ovirt2, arbitrator on ovirt3.  The 4th volume is replica
> 3, with a brick on all three ovirt machines.
>
> The first 3 volumes are on an SSD disk; the 4th is on a Seagate SSHD (same
> in all three machines).  On ovirt3, the SSHD has reported hard IO failures,
> and that brick is offline.  However, the other two replicas are fully
> operational (although they still show contents in the heal info command
> that won't go away, but that may be the case until I replace the failed
> disk).
>
> What is bothering me is that ALL 4 gluster volumes are showing horrible
> performance issues.  At this point, as the bad disk has been completely
> offlined, I would expect gluster to perform at normal speed, but that is
> definitely not the case.
>
> I've also noticed that the performance hits seem to come in waves: things
> seem to work acceptably (but slow) for a while, then suddenly, its as if
> all disk IO on all volumes (including non-gluster local OS disk volumes for
> the hosts) pause for about 30 seconds, then IO resumes again.  During those
> times, I start getting VM not responding and host not responding notices as
> well as the applications having major issues.
>
> I've shut down most of my VMs and am down to just my essential core VMs
> (shedded about 75% of my VMs).  I still am experiencing the same issues.
>
> Am I correct in believing that once the failed disk was brought offline
> that performance should return to normal?
>
> On Tue, May 29, 2018 at 1:27 PM, Alex K  wrote:
>
>> I would check disks status and accessibility of mount points where your
>> gluster volumes reside.
>>
>> On Tue, May 29, 2018, 22:28 Jim Kusznir  wrote:
>>
>>> On one ovirt server, I'm now seeing these messages:
>>> [56474.239725] blk_update_request: 63 callbacks suppressed
>>> [56474.239732] blk_update_request: I/O error, dev dm-2, sector 0
>>> [56474.240602] blk_update_request: I/O error, dev dm-2, sector 3905945472
>>> [56474.241346] blk_update_request: I/O error, dev dm-2, sector 3905945584
>>> [56474.242236] blk_update_request: I/O error, dev dm-2, sector 2048
>>> [56474.243072] blk_update_request: I/O error, dev dm-2, sector 3905943424
>>> [56474.243997] blk_update_request: I/O error, dev dm-2, sector 3905943536
>>> [56474.247347] blk_update_request: I/O error, dev dm-2, sector 0
>>> [56474.248315] blk_update_request: I/O error, dev dm-2, sector 3905945472
>>> [56474.249231] blk_update_request: I/O error, dev dm-2, sector 3905945584
>>> [56474.250221] blk_update_request: I/O error, dev dm-2, sector 2048
>>>
>>>
>>>
>>>
>>> On Tue, May 29, 2018 at 11:59 AM, Jim Kusznir 
>>> wrote:
>>>
>>>> I see in messages on ovirt3 (my 3rd machine, the one upgraded to 4.2):
>>>>
>>>> May 29 11:54:41 ovirt3 ovs-vsctl: 
>>>> ovs|1|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock:
>>>> database connection failed (No such file or directory)
>>>> May 29 11:54:51 ovirt3 ovs-vsctl: 
>>>> ovs|1|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock:
>>>> database connection failed (No such file or directory)
>>>> May 29 11:55:01 ovirt3 ovs-vsctl: 
>>>> ovs|1|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock:
>>>> database connection failed (No such file or directory)
>>>> (appears a lot).
>>>>
>>>> I also found on the ssh session of that, some sysv warnings about the
>>>> backing disk for one of the gluster volumes (straight replica 3).  The
>>>> glusterfs process for that disk on that machine went offline.  Its my
>>>> understanding that it should continue to work with the other two machines
>>>> while I attempt to replace that disk, right?  Attempted writes (touching an
>>>> empty file) can take 15 seconds, repeating it later will be much faster.
>>>>
>>>> Gluster generates a bunch of different log files, I don't know what
>>>> ones you want, or from which machine(s).
>>>>
>>>> How do I do "volume profiling&quo

[ovirt-users] Re: Gluster quorum

2018-05-29 Thread Jim Kusznir

I had the same problem when I upgraded to 4.2.  I found that if I went to
the brick in the UI and selected it, there was a "start" button in the
upper-right of the gui.  clicking that resolved this problem a few minutes
later.

I had to repeat for each volume that showed a brick down for which that
brick was not actually down.

--Jim

On Tue, May 29, 2018 at 6:34 AM, Demeter Tibor  wrote:

> Hi,
>
> I've successfully upgraded my hosts and I could raise the cluster level to
> 4.2.
> Everything seems fine, but the monitoring problem does not resolved. My
> bricks on first node are shown down (red) , but the glusterfs working fine
> (I verified in terminal).
>
> I've attached my engine.log.
>
> Thanks in advance,
>
> R,
> Tibor
>
> - 2018. máj.. 28., 14:59, Demeter Tibor  írta:
>
> Hi,
> Ok I will try it.
>
> In this case, is it possible to remove and re-add a host that member of HA
> gluster ? This is an another task, but I need to separate my gluster
> network from my ovirtmgmt network.
> What is the recommended way for do this?
>
> It is not important now, but I need to do in future.
>
> I will attach my engine.log after upgrade my host.
>
> Thanks,
> Regards.
>
> Tibor
>
>
> - 2018. máj.. 28., 14:44, Sahina Bose  írta:
>
>
>
> On Mon, May 28, 2018 at 4:47 PM, Demeter Tibor 
> wrote:
>
>> Dear Sahina,
>>
>> Yes, exactly. I can check that check box, but I don't know how is safe
>> that. Is it safe?
>>
>
> It is safe - if you can ensure that only one host is put into maintenance
> at a time.
>
>>
>> I want to upgrade all of my host. If it will done, then the monitoring
>> will work perfectly?
>>
>
> If it does not please provide engine.log again once you've upgraded all
> the hosts.
>
>
>> Thanks.
>> R.
>>
>> Tibor
>>
>>
>>
>> - 2018. máj.. 28., 10:09, Sahina Bose  írta:
>>
>>
>>
>> On Mon, May 28, 2018 at 1:06 PM, Demeter Tibor 
>> wrote:
>>
>>> Hi,
>>>
>>> Somebody could answer to my question please?
>>> It is very important for me, I could no finish my upgrade process (from
>>> 4.1 to 4.2) since 9th May!
>>>
>>
>> Can you explain how the upgrade process is blocked due to the monitoring?
>> If it's because you cannot move the host to maintenance, can you try with
>> the option "Ignore quorum checks" enabled?
>>
>>
>>> Meanwhile - I don't know why - one of my two gluster volume seems UP
>>> (green) on the GUI. So, now only one is down.
>>>
>>> I need help. What can I do?
>>>
>>> Thanks in advance,
>>>
>>> Regards,
>>>
>>> Tibor
>>>
>>>
>>> - 2018. máj.. 23., 21:09, Demeter Tibor  írta:
>>>
>>> Hi,
>>>
>>> I've updated again to the latest version, but there are no changes. All
>>> of bricks on my first node are down in the GUI (in console are ok)
>>> An Interesting thing, the "Self-Heal info" column show "OK" for all
>>> hosts and all bricks, but "Space used" column is zero for all hosts/bricks.
>>> Can I force remove and re-add my host to cluster awhile it is a gluster
>>> member? Is it safe ?
>>> What can I do?
>>>
>>> I haven't update other hosts while gluster not working fine, or the GUI
>>> does not detect . So my other hosts is remained 4.1 yet :(
>>>
>>> Thanks in advance,
>>>
>>> Regards
>>>
>>> Tibor
>>>
>>> - 2018. máj.. 23., 14:45, Denis Chapligin 
>>> írta:
>>>
>>> Hello!
>>>
>>> On Tue, May 22, 2018 at 11:10 AM, Demeter Tibor 
>>> wrote:
>>>

 Is there any changes with this bug?

 Still I haven't finish my upgrade process that i've started on 9th may:(

 Please help me if you can.


>>>
>>> Looks like all required patches are already merged, so could you please
>>> to update your engine again to the latest night build?
>>>
>>>
>>> ___
>>> Users mailing list -- users@ovirt.org
>>> To unsubscribe send an email to users-le...@ovirt.org
>>>
>>>
>>> ___
>>> Users mailing list -- users@ovirt.org
>>> To unsubscribe send an email to users-le...@ovirt.org
>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-
>>> guidelines/
>>> List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/
>>> message/MRAAPZSRIXLAJZBV6TRDXXK7R2ISPSDK/
>>>
>>>
>>
>
> ___
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-
> guidelines/
> List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/
> message/OWA2I6AFZPO56Z2N6D25HUHLW6CGOUWL/
>
>
>
> ___
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-
> guidelines/
> List Archives:

[ovirt-users] Re: Gluster problems, cluster performance issues

2018-05-29 Thread Jim Kusznir

Thank you for your response.

I have 4 gluster volumes.  3 are replica 2 + arbitrator.  replica bricks
are on ovirt1 and ovirt2, arbitrator on ovirt3.  The 4th volume is replica
3, with a brick on all three ovirt machines.

The first 3 volumes are on an SSD disk; the 4th is on a Seagate SSHD (same
in all three machines).  On ovirt3, the SSHD has reported hard IO failures,
and that brick is offline.  However, the other two replicas are fully
operational (although they still show contents in the heal info command
that won't go away, but that may be the case until I replace the failed
disk).

What is bothering me is that ALL 4 gluster volumes are showing horrible
performance issues.  At this point, as the bad disk has been completely
offlined, I would expect gluster to perform at normal speed, but that is
definitely not the case.

I've also noticed that the performance hits seem to come in waves: things
seem to work acceptably (but slow) for a while, then suddenly, its as if
all disk IO on all volumes (including non-gluster local OS disk volumes for
the hosts) pause for about 30 seconds, then IO resumes again.  During those
times, I start getting VM not responding and host not responding notices as
well as the applications having major issues.

I've shut down most of my VMs and am down to just my essential core VMs
(shedded about 75% of my VMs).  I still am experiencing the same issues.

Am I correct in believing that once the failed disk was brought offline
that performance should return to normal?

On Tue, May 29, 2018 at 1:27 PM, Alex K  wrote:

> I would check disks status and accessibility of mount points where your
> gluster volumes reside.
>
> On Tue, May 29, 2018, 22:28 Jim Kusznir  wrote:
>
>> On one ovirt server, I'm now seeing these messages:
>> [56474.239725] blk_update_request: 63 callbacks suppressed
>> [56474.239732] blk_update_request: I/O error, dev dm-2, sector 0
>> [56474.240602] blk_update_request: I/O error, dev dm-2, sector 3905945472
>> [56474.241346] blk_update_request: I/O error, dev dm-2, sector 3905945584
>> [56474.242236] blk_update_request: I/O error, dev dm-2, sector 2048
>> [56474.243072] blk_update_request: I/O error, dev dm-2, sector 3905943424
>> [56474.243997] blk_update_request: I/O error, dev dm-2, sector 3905943536
>> [56474.247347] blk_update_request: I/O error, dev dm-2, sector 0
>> [56474.248315] blk_update_request: I/O error, dev dm-2, sector 3905945472
>> [56474.249231] blk_update_request: I/O error, dev dm-2, sector 3905945584
>> [56474.250221] blk_update_request: I/O error, dev dm-2, sector 2048
>>
>>
>>
>>
>> On Tue, May 29, 2018 at 11:59 AM, Jim Kusznir 
>> wrote:
>>
>>> I see in messages on ovirt3 (my 3rd machine, the one upgraded to 4.2):
>>>
>>> May 29 11:54:41 ovirt3 ovs-vsctl: ovs|1|db_ctl_base|ERR|
>>> unix:/var/run/openvswitch/db.sock: database connection failed (No such
>>> file or directory)
>>> May 29 11:54:51 ovirt3 ovs-vsctl: ovs|1|db_ctl_base|ERR|
>>> unix:/var/run/openvswitch/db.sock: database connection failed (No such
>>> file or directory)
>>> May 29 11:55:01 ovirt3 ovs-vsctl: ovs|1|db_ctl_base|ERR|
>>> unix:/var/run/openvswitch/db.sock: database connection failed (No such
>>> file or directory)
>>> (appears a lot).
>>>
>>> I also found on the ssh session of that, some sysv warnings about the
>>> backing disk for one of the gluster volumes (straight replica 3).  The
>>> glusterfs process for that disk on that machine went offline.  Its my
>>> understanding that it should continue to work with the other two machines
>>> while I attempt to replace that disk, right?  Attempted writes (touching an
>>> empty file) can take 15 seconds, repeating it later will be much faster.
>>>
>>> Gluster generates a bunch of different log files, I don't know what ones
>>> you want, or from which machine(s).
>>>
>>> How do I do "volume profiling"?
>>>
>>> Thanks!
>>>
>>> On Tue, May 29, 2018 at 11:53 AM, Sahina Bose  wrote:
>>>
>>>> Do you see errors reported in the mount logs for the volume? If so,
>>>> could you attach the logs?
>>>> Any issues with your underlying disks. Can you also attach output of
>>>> volume profiling?
>>>>
>>>> On Wed, May 30, 2018 at 12:13 AM, Jim Kusznir 
>>>> wrote:
>>>>
>>>>> Ok, things have gotten MUCH worse this morning.  I'm getting random
>>>>> errors from VMs, right now, about a third of my VMs have been paused due 
>>>>> to
>>>>> storage issues, and most of the remainin

[ovirt-users] Re: Gluster problems, cluster performance issues

2018-05-29 Thread Jim Kusznir

Due to the cluster spiraling downward and increasing customer complaints, I
went ahead and finished the upgrade of the nodes to ovirt 4.2 and gluster
3.12.  It didn't seem to help at all.

I DO have one brick down on ONE of my 4 gluster
filesystems/exports/whatever.  The other 3 are fully available.  However, I
still see heavy IO wait, including on the perfectly healthy filesystem.
its bad enough that I get ovirt e-mails warning of hosts down and back up,
and VMs on the good gluster filesystem are reporting IO Waits of greater
than 60% in top!  I have applications that are crashing due to the IO Wait
issues.

I do think I got glusterfs profiling running, but I don't know how to get a
useful report out (its in the ovirt gui).  I did see read and write
operations showing about 30 seconds; I would have expected that to be MUCH
better.  (As I write this, my core VoIP server is now showing 99.1% IOWait
loadAnd that is customer calls failing/dropping).

PLEASE...how do I FIX this?

--JIm

On Tue, May 29, 2018 at 12:14 PM, Jim Kusznir  wrote:

> On one ovirt server, I'm now seeing these messages:
> [56474.239725] blk_update_request: 63 callbacks suppressed
> [56474.239732] blk_update_request: I/O error, dev dm-2, sector 0
> [56474.240602] blk_update_request: I/O error, dev dm-2, sector 3905945472
> [56474.241346] blk_update_request: I/O error, dev dm-2, sector 3905945584
> [56474.242236] blk_update_request: I/O error, dev dm-2, sector 2048
> [56474.243072] blk_update_request: I/O error, dev dm-2, sector 3905943424
> [56474.243997] blk_update_request: I/O error, dev dm-2, sector 3905943536
> [56474.247347] blk_update_request: I/O error, dev dm-2, sector 0
> [56474.248315] blk_update_request: I/O error, dev dm-2, sector 3905945472
> [56474.249231] blk_update_request: I/O error, dev dm-2, sector 3905945584
> [56474.250221] blk_update_request: I/O error, dev dm-2, sector 2048
>
>
>
>
> On Tue, May 29, 2018 at 11:59 AM, Jim Kusznir  wrote:
>
>> I see in messages on ovirt3 (my 3rd machine, the one upgraded to 4.2):
>>
>> May 29 11:54:41 ovirt3 ovs-vsctl: 
>> ovs|1|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock:
>> database connection failed (No such file or directory)
>> May 29 11:54:51 ovirt3 ovs-vsctl: 
>> ovs|1|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock:
>> database connection failed (No such file or directory)
>> May 29 11:55:01 ovirt3 ovs-vsctl: 
>> ovs|1|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock:
>> database connection failed (No such file or directory)
>> (appears a lot).
>>
>> I also found on the ssh session of that, some sysv warnings about the
>> backing disk for one of the gluster volumes (straight replica 3).  The
>> glusterfs process for that disk on that machine went offline.  Its my
>> understanding that it should continue to work with the other two machines
>> while I attempt to replace that disk, right?  Attempted writes (touching an
>> empty file) can take 15 seconds, repeating it later will be much faster.
>>
>> Gluster generates a bunch of different log files, I don't know what ones
>> you want, or from which machine(s).
>>
>> How do I do "volume profiling"?
>>
>> Thanks!
>>
>> On Tue, May 29, 2018 at 11:53 AM, Sahina Bose  wrote:
>>
>>> Do you see errors reported in the mount logs for the volume? If so,
>>> could you attach the logs?
>>> Any issues with your underlying disks. Can you also attach output of
>>> volume profiling?
>>>
>>> On Wed, May 30, 2018 at 12:13 AM, Jim Kusznir 
>>> wrote:
>>>
>>>> Ok, things have gotten MUCH worse this morning.  I'm getting random
>>>> errors from VMs, right now, about a third of my VMs have been paused due to
>>>> storage issues, and most of the remaining VMs are not performing well.
>>>>
>>>> At this point, I am in full EMERGENCY mode, as my production services
>>>> are now impacted, and I'm getting calls coming in with problems...
>>>>
>>>> I'd greatly appreciate help...VMs are running VERY slowly (when they
>>>> run), and they are steadily getting worse.  I don't know why.  I was seeing
>>>> CPU peaks (to 100%) on several VMs, in perfect sync, for a few minutes at a
>>>> time (while the VM became unresponsive and any VMs I was logged into that
>>>> were linux were giving me the CPU stuck messages in my origional post).  Is
>>>> all this storage related?
>>>>
>>>> I also have two different gluster volumes for VM storage, and only one
>>>> had the issues, but now VMs in both are being affected at the same time and
>>

[ovirt-users] Re: Gluster problems, cluster performance issues

2018-05-29 Thread Jim Kusznir

On one ovirt server, I'm now seeing these messages:
[56474.239725] blk_update_request: 63 callbacks suppressed
[56474.239732] blk_update_request: I/O error, dev dm-2, sector 0
[56474.240602] blk_update_request: I/O error, dev dm-2, sector 3905945472
[56474.241346] blk_update_request: I/O error, dev dm-2, sector 3905945584
[56474.242236] blk_update_request: I/O error, dev dm-2, sector 2048
[56474.243072] blk_update_request: I/O error, dev dm-2, sector 3905943424
[56474.243997] blk_update_request: I/O error, dev dm-2, sector 3905943536
[56474.247347] blk_update_request: I/O error, dev dm-2, sector 0
[56474.248315] blk_update_request: I/O error, dev dm-2, sector 3905945472
[56474.249231] blk_update_request: I/O error, dev dm-2, sector 3905945584
[56474.250221] blk_update_request: I/O error, dev dm-2, sector 2048




On Tue, May 29, 2018 at 11:59 AM, Jim Kusznir  wrote:

> I see in messages on ovirt3 (my 3rd machine, the one upgraded to 4.2):
>
> May 29 11:54:41 ovirt3 ovs-vsctl: ovs|1|db_ctl_base|ERR|
> unix:/var/run/openvswitch/db.sock: database connection failed (No such
> file or directory)
> May 29 11:54:51 ovirt3 ovs-vsctl: ovs|1|db_ctl_base|ERR|
> unix:/var/run/openvswitch/db.sock: database connection failed (No such
> file or directory)
> May 29 11:55:01 ovirt3 ovs-vsctl: ovs|1|db_ctl_base|ERR|
> unix:/var/run/openvswitch/db.sock: database connection failed (No such
> file or directory)
> (appears a lot).
>
> I also found on the ssh session of that, some sysv warnings about the
> backing disk for one of the gluster volumes (straight replica 3).  The
> glusterfs process for that disk on that machine went offline.  Its my
> understanding that it should continue to work with the other two machines
> while I attempt to replace that disk, right?  Attempted writes (touching an
> empty file) can take 15 seconds, repeating it later will be much faster.
>
> Gluster generates a bunch of different log files, I don't know what ones
> you want, or from which machine(s).
>
> How do I do "volume profiling"?
>
> Thanks!
>
> On Tue, May 29, 2018 at 11:53 AM, Sahina Bose  wrote:
>
>> Do you see errors reported in the mount logs for the volume? If so, could
>> you attach the logs?
>> Any issues with your underlying disks. Can you also attach output of
>> volume profiling?
>>
>> On Wed, May 30, 2018 at 12:13 AM, Jim Kusznir 
>> wrote:
>>
>>> Ok, things have gotten MUCH worse this morning.  I'm getting random
>>> errors from VMs, right now, about a third of my VMs have been paused due to
>>> storage issues, and most of the remaining VMs are not performing well.
>>>
>>> At this point, I am in full EMERGENCY mode, as my production services
>>> are now impacted, and I'm getting calls coming in with problems...
>>>
>>> I'd greatly appreciate help...VMs are running VERY slowly (when they
>>> run), and they are steadily getting worse.  I don't know why.  I was seeing
>>> CPU peaks (to 100%) on several VMs, in perfect sync, for a few minutes at a
>>> time (while the VM became unresponsive and any VMs I was logged into that
>>> were linux were giving me the CPU stuck messages in my origional post).  Is
>>> all this storage related?
>>>
>>> I also have two different gluster volumes for VM storage, and only one
>>> had the issues, but now VMs in both are being affected at the same time and
>>> same way.
>>>
>>> --Jim
>>>
>>> On Mon, May 28, 2018 at 10:50 PM, Sahina Bose  wrote:
>>>
>>>> [Adding gluster-users to look at the heal issue]
>>>>
>>>> On Tue, May 29, 2018 at 9:17 AM, Jim Kusznir 
>>>> wrote:
>>>>
>>>>> Hello:
>>>>>
>>>>> I've been having some cluster and gluster performance issues lately.
>>>>> I also found that my cluster was out of date, and was trying to apply
>>>>> updates (hoping to fix some of these), and discovered the ovirt 4.1 repos
>>>>> were taken completely offline.  So, I was forced to begin an upgrade to
>>>>> 4.2.  According to docs I found/read, I needed only add the new repo, do a
>>>>> yum update, reboot, and be good on my hosts (did the yum update, the
>>>>> engine-setup on my hosted engine).  Things seemed to work relatively well,
>>>>> except for a gluster sync issue that showed up.
>>>>>
>>>>> My cluster is a 3 node hyperconverged cluster.  I upgraded the hosted
>>>>> engine first, then engine 3.  When engine 3 came back up, for some reason
>>>>> one of my gluster volumes would not s

[ovirt-users] Re: Gluster problems, cluster performance issues

2018-05-29 Thread Jim Kusznir

I see in messages on ovirt3 (my 3rd machine, the one upgraded to 4.2):

May 29 11:54:41 ovirt3 ovs-vsctl:
ovs|1|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: database
connection failed (No such file or directory)
May 29 11:54:51 ovirt3 ovs-vsctl:
ovs|1|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: database
connection failed (No such file or directory)
May 29 11:55:01 ovirt3 ovs-vsctl:
ovs|1|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: database
connection failed (No such file or directory)
(appears a lot).

I also found on the ssh session of that, some sysv warnings about the
backing disk for one of the gluster volumes (straight replica 3).  The
glusterfs process for that disk on that machine went offline.  Its my
understanding that it should continue to work with the other two machines
while I attempt to replace that disk, right?  Attempted writes (touching an
empty file) can take 15 seconds, repeating it later will be much faster.

Gluster generates a bunch of different log files, I don't know what ones
you want, or from which machine(s).

How do I do "volume profiling"?

Thanks!

On Tue, May 29, 2018 at 11:53 AM, Sahina Bose  wrote:

> Do you see errors reported in the mount logs for the volume? If so, could
> you attach the logs?
> Any issues with your underlying disks. Can you also attach output of
> volume profiling?
>
> On Wed, May 30, 2018 at 12:13 AM, Jim Kusznir  wrote:
>
>> Ok, things have gotten MUCH worse this morning.  I'm getting random
>> errors from VMs, right now, about a third of my VMs have been paused due to
>> storage issues, and most of the remaining VMs are not performing well.
>>
>> At this point, I am in full EMERGENCY mode, as my production services are
>> now impacted, and I'm getting calls coming in with problems...
>>
>> I'd greatly appreciate help...VMs are running VERY slowly (when they
>> run), and they are steadily getting worse.  I don't know why.  I was seeing
>> CPU peaks (to 100%) on several VMs, in perfect sync, for a few minutes at a
>> time (while the VM became unresponsive and any VMs I was logged into that
>> were linux were giving me the CPU stuck messages in my origional post).  Is
>> all this storage related?
>>
>> I also have two different gluster volumes for VM storage, and only one
>> had the issues, but now VMs in both are being affected at the same time and
>> same way.
>>
>> --Jim
>>
>> On Mon, May 28, 2018 at 10:50 PM, Sahina Bose  wrote:
>>
>>> [Adding gluster-users to look at the heal issue]
>>>
>>> On Tue, May 29, 2018 at 9:17 AM, Jim Kusznir 
>>> wrote:
>>>
>>>> Hello:
>>>>
>>>> I've been having some cluster and gluster performance issues lately.  I
>>>> also found that my cluster was out of date, and was trying to apply updates
>>>> (hoping to fix some of these), and discovered the ovirt 4.1 repos were
>>>> taken completely offline.  So, I was forced to begin an upgrade to 4.2.
>>>> According to docs I found/read, I needed only add the new repo, do a yum
>>>> update, reboot, and be good on my hosts (did the yum update, the
>>>> engine-setup on my hosted engine).  Things seemed to work relatively well,
>>>> except for a gluster sync issue that showed up.
>>>>
>>>> My cluster is a 3 node hyperconverged cluster.  I upgraded the hosted
>>>> engine first, then engine 3.  When engine 3 came back up, for some reason
>>>> one of my gluster volumes would not sync.  Here's sample output:
>>>>
>>>> [root@ovirt3 ~]# gluster volume heal data-hdd info
>>>> Brick 172.172.1.11:/gluster/brick3/data-hdd
>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4
>>>> 725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9
>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4
>>>> cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971
>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-4
>>>> 46b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba
>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-4
>>>> 4f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625
>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-4
>>>> 2a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b
>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4
>>>> ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32
>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4
>>>> 810-b45b-185e3ed65727/16f08231

[ovirt-users] Re: Gluster problems, cluster performance issues

2018-05-29 Thread Jim Kusznir

Ok, things have gotten MUCH worse this morning.  I'm getting random errors
from VMs, right now, about a third of my VMs have been paused due to
storage issues, and most of the remaining VMs are not performing well.

At this point, I am in full EMERGENCY mode, as my production services are
now impacted, and I'm getting calls coming in with problems...

I'd greatly appreciate help...VMs are running VERY slowly (when they run),
and they are steadily getting worse.  I don't know why.  I was seeing CPU
peaks (to 100%) on several VMs, in perfect sync, for a few minutes at a
time (while the VM became unresponsive and any VMs I was logged into that
were linux were giving me the CPU stuck messages in my origional post).  Is
all this storage related?

I also have two different gluster volumes for VM storage, and only one had
the issues, but now VMs in both are being affected at the same time and
same way.

--Jim

On Mon, May 28, 2018 at 10:50 PM, Sahina Bose  wrote:

> [Adding gluster-users to look at the heal issue]
>
> On Tue, May 29, 2018 at 9:17 AM, Jim Kusznir  wrote:
>
>> Hello:
>>
>> I've been having some cluster and gluster performance issues lately.  I
>> also found that my cluster was out of date, and was trying to apply updates
>> (hoping to fix some of these), and discovered the ovirt 4.1 repos were
>> taken completely offline.  So, I was forced to begin an upgrade to 4.2.
>> According to docs I found/read, I needed only add the new repo, do a yum
>> update, reboot, and be good on my hosts (did the yum update, the
>> engine-setup on my hosted engine).  Things seemed to work relatively well,
>> except for a gluster sync issue that showed up.
>>
>> My cluster is a 3 node hyperconverged cluster.  I upgraded the hosted
>> engine first, then engine 3.  When engine 3 came back up, for some reason
>> one of my gluster volumes would not sync.  Here's sample output:
>>
>> [root@ovirt3 ~]# gluster volume heal data-hdd info
>> Brick 172.172.1.11:/gluster/brick3/data-hdd
>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-
>> 4725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9
>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-
>> 4cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971
>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-
>> 446b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba
>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-
>> 44f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625
>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-
>> 42a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b
>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-
>> 4ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32
>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-
>> 4810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa
>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-
>> 4aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2
>> Status: Connected
>> Number of entries: 8
>>
>> Brick 172.172.1.12:/gluster/brick3/data-hdd
>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-
>> 4725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9
>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-
>> 4cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971
>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-
>> 4ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32
>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-
>> 446b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba
>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-
>> 44f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625
>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-
>> 4810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa
>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-
>> 4aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2
>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-
>> 42a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b
>> Status: Connected
>> Number of entries: 8
>>
>> Brick 172.172.1.13:/gluster/brick3/data-hdd
>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-
>> 4ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32
>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-
>> 4810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa
>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-
>> 4aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2
>> /cc65f671-3377-494a

[ovirt-users] Gluster problems, cluster performance issues

2018-05-28 Thread Jim Kusznir

Hello:

I've been having some cluster and gluster performance issues lately.  I
also found that my cluster was out of date, and was trying to apply updates
(hoping to fix some of these), and discovered the ovirt 4.1 repos were
taken completely offline.  So, I was forced to begin an upgrade to 4.2.
According to docs I found/read, I needed only add the new repo, do a yum
update, reboot, and be good on my hosts (did the yum update, the
engine-setup on my hosted engine).  Things seemed to work relatively well,
except for a gluster sync issue that showed up.

My cluster is a 3 node hyperconverged cluster.  I upgraded the hosted
engine first, then engine 3.  When engine 3 came back up, for some reason
one of my gluster volumes would not sync.  Here's sample output:

[root@ovirt3 ~]# gluster volume heal data-hdd info
Brick 172.172.1.11:/gluster/brick3/data-hdd
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-446b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-44f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-42a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-4aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2
Status: Connected
Number of entries: 8

Brick 172.172.1.12:/gluster/brick3/data-hdd
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-446b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-44f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-4aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-42a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b
Status: Connected
Number of entries: 8

Brick 172.172.1.13:/gluster/brick3/data-hdd
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-4aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-42a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-446b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-44f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625
Status: Connected
Number of entries: 8

-
Its been in this state for a couple days now, and bandwidth monitoring
shows no appreciable data moving.  I've tried repeatedly commanding a full
heal from all three clusters in the node.  Its always the same files that
need healing.

When running gluster volume heal data-hdd statistics, I see sometimes
different information, but always some number of "heal failed" entries.  It
shows 0 for split brain.

I'm not quite sure what to do.  I suspect it may be due to nodes 1 and 2
still being on the older ovirt/gluster release, but I'm afraid to upgrade
and reboot them until I have a good gluster sync (don't need to create a
split brain issue).  How do I proceed with this?

Second issue: I've been experiencing VERY POOR performance on most of my
VMs.  To the tune that logging into a windows 10 vm via remote desktop can
take 5 minutes, launching quickbooks inside said vm can easily take 10
minutes.  On some linux VMs, I get random messages like this:
Message from syslogd@unifi at May 28 20:39:23 ...
 kernel:[6171996.308904] NMI watchdog: BUG: soft

Re: [ovirt-users] Gluster: VM disk stuck in transfer; georep gone wonky

2018-03-20 Thread Jim Kusznir

Thank you for the replies.

While waiting, I found one more google responce that said to run
engine-setup.  I did that, and it fixed the issue.  the VM is now running
again.

As to checking the logs, I'm not sure which ones to check...there are so
many in so many different places.

I was not able to detach the disk, as "an operation is currently in
process"  No matter what i did to the disk, it was essentially still
locked, even though it no longer said "locked" after I removed it with the
unlock script.

So, it appears running engine-setup can really fix a bunch of stuff!  An
important tip to remember...

--Jim

On Mon, Mar 19, 2018 at 11:55 PM, Tony Brian Albers <t...@kb.dk> wrote:

> I read somewhere about clearing out wrong stuff from the UI by manually
> editing the database, maybe you can try searching for something like that.
>
> With regards to the VM, I'd probably just delete it, edit the DB and
> remove all sorts of references to it and then recover it from backup.
>
> Is there nothing about all this in the ovirt logs on the engine and the
> host? It might point you in the right direction.
>
> HTH
>
> /tony
>
>
> On 20/03/18 07:48, Jim Kusznir wrote:
> > Unfortunately, I came under heavy pressure to get this vm back up.  So,
> > i did more googling and attempted to recover myself.  I've gotten
> > closer, but still not quite.
> >
> > I found this post:
> >
> > http://lists.ovirt.org/pipermail/users/2015-November/035686.html
> >
> > Which gave me the unlock tool, which was successful in unlocking the
> > disk.  Unfortunately, it did not delete the task, nor did ovirt do so on
> > its own after the disk was unlocked.
> >
> > So I found the taskcleaner.sh in the same directory and attempted to
> > clean the task outexcept it doesn't seem to see the task (none of
> > the show tasks options seemed to work or the delete all options).  I did
> > still have the task uuid from the gui, so i attempted to use that, but
> > all I got back was a "t" on one line and a "0" on the next, so I have no
> > idea what that was supposed to mean.  In any case, the web UI still
> > shows the task, still won't let me start the VM and appears convinced
> > its still copying.  I've tried restarting the engine and vdsm on the
> > SPM, neither have helped.  I can't find any evidence of the task on the
> > command line; only in the UI.
> >
> > I'd create a new VM if i could rescue the image, but I'm not sure I can
> > manage to get this image accepted in another VM
> >
> > How do i recover now?
> >
> > --Jim
> >
> > On Mon, Mar 19, 2018 at 9:38 AM, Jim Kusznir <j...@palousetech.com
> > <mailto:j...@palousetech.com>> wrote:
> >
> > Hi all:
> >
> > Sorry for yet another semi-related message to the list.  In my
> > attempts to troubleshoot and verify some suspicions on the nature of
> > the performance problems I posted under "Major Performance Issues
> > with gluster", I attempted to move one of my problem VM's back to
> > the original storage (SSD-backed).  It appeared to be moving fine,
> > but last night froze at 84%.  This morning (8hrs later), its still
> > at 84%.
> >
> > I need to get that VM back up and running, but I don't know how...It
> > seems to be stuck in limbo.
> >
> > The only thing I explicitly did last night as well that may have
> > caused an issue is finally set up and activated georep to an offsite
> > backup machine.  That too seems to have gone a bit wonky.  On the
> > ovirt server side, it shows normal with all but data-hdd show a last
> > sync'ed time of 3am (which matches my bandwidth graphs for the WAN
> > connections involved).  data-hdd (the new disk-backed storage with
> > most of my data in it) shows not yet synced, but I'm also not
> > currently seeing bandwidth usage anymore.
> >
> > I logged into the georep destination box, and found system load a
> > bit high, a bunch of gluster and rsync processes running, and both
> > data and data-hdd using MORE disk space than the origional (data-hdd
> > using 4x more disk space than is on the master node).  Not sure what
> > to do about this; I paused the replication from the cluster, but
> > that hasn't seem to had an effect on the georep destination.
> >
> > I promise I'll stop trying things until I get guidance from the
> > list!  Please do help; I need the VM HDD unstuck so I can start it.
> >
> > Thanks!
> > --Jim
> >
&g

Re: [ovirt-users] Gluster: VM disk stuck in transfer; georep gone wonky

2018-03-20 Thread Jim Kusznir

Unfortunately, I came under heavy pressure to get this vm back up.  So, i
did more googling and attempted to recover myself.  I've gotten closer, but
still not quite.

I found this post:

http://lists.ovirt.org/pipermail/users/2015-November/035686.html

Which gave me the unlock tool, which was successful in unlocking the disk.
Unfortunately, it did not delete the task, nor did ovirt do so on its own
after the disk was unlocked.

So I found the taskcleaner.sh in the same directory and attempted to clean
the task outexcept it doesn't seem to see the task (none of the show
tasks options seemed to work or the delete all options).  I did still have
the task uuid from the gui, so i attempted to use that, but all I got back
was a "t" on one line and a "0" on the next, so I have no idea what that
was supposed to mean.  In any case, the web UI still shows the task, still
won't let me start the VM and appears convinced its still copying.  I've
tried restarting the engine and vdsm on the SPM, neither have helped.  I
can't find any evidence of the task on the command line; only in the UI.

I'd create a new VM if i could rescue the image, but I'm not sure I can
manage to get this image accepted in another VM

How do i recover now?

--Jim

On Mon, Mar 19, 2018 at 9:38 AM, Jim Kusznir <j...@palousetech.com> wrote:

> Hi all:
>
> Sorry for yet another semi-related message to the list.  In my attempts to
> troubleshoot and verify some suspicions on the nature of the performance
> problems I posted under "Major Performance Issues with gluster", I
> attempted to move one of my problem VM's back to the original storage
> (SSD-backed).  It appeared to be moving fine, but last night froze at 84%.
> This morning (8hrs later), its still at 84%.
>
> I need to get that VM back up and running, but I don't know how...It seems
> to be stuck in limbo.
>
> The only thing I explicitly did last night as well that may have caused an
> issue is finally set up and activated georep to an offsite backup machine.
> That too seems to have gone a bit wonky.  On the ovirt server side, it
> shows normal with all but data-hdd show a last sync'ed time of 3am (which
> matches my bandwidth graphs for the WAN connections involved).  data-hdd
> (the new disk-backed storage with most of my data in it) shows not yet
> synced, but I'm also not currently seeing bandwidth usage anymore.
>
> I logged into the georep destination box, and found system load a bit
> high, a bunch of gluster and rsync processes running, and both data and
> data-hdd using MORE disk space than the origional (data-hdd using 4x more
> disk space than is on the master node).  Not sure what to do about this; I
> paused the replication from the cluster, but that hasn't seem to had an
> effect on the georep destination.
>
> I promise I'll stop trying things until I get guidance from the list!
> Please do help; I need the VM HDD unstuck so I can start it.
>
> Thanks!
> --Jim
>
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

Re: [ovirt-users] Major Performance Issues with gluster

2018-03-19 Thread Jim Kusznir

Here's gluster volume info:

[root@ovirt2 ~]# gluster volume info

Volume Name: data
Type: Replicate
Volume ID: e670c488-ac16-4dd1-8bd3-e43b2e42cc59
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: ovirt1.nwfiber.com:/gluster/brick2/data
Brick2: ovirt2.nwfiber.com:/gluster/brick2/data
Brick3: ovirt3.nwfiber.com:/gluster/brick2/data (arbiter)
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
server.allow-insecure: on
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-wait-qlength: 1
cluster.shd-max-threads: 8
network.ping-timeout: 30
user.cifs: off
nfs.disable: on
performance.strict-o-direct: on

Volume Name: data-hdd
Type: Replicate
Volume ID: d342a3ab-16f3-49f0-bbcf-f788be8ac5f1
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 172.172.1.11:/gluster/brick3/data-hdd
Brick2: 172.172.1.12:/gluster/brick3/data-hdd
Brick3: 172.172.1.13:/gluster/brick3/data-hdd
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
transport.address-family: inet
performance.readdir-ahead: on

Volume Name: engine
Type: Replicate
Volume ID: 87ad86b9-d88b-457e-ba21-5d3173c612de
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: ovirt1.nwfiber.com:/gluster/brick1/engine
Brick2: ovirt2.nwfiber.com:/gluster/brick1/engine
Brick3: ovirt3.nwfiber.com:/gluster/brick1/engine (arbiter)
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: off
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-wait-qlength: 1
cluster.shd-max-threads: 6
network.ping-timeout: 30
user.cifs: off
nfs.disable: on
performance.strict-o-direct: on

Volume Name: iso
Type: Replicate
Volume ID: b1ba15f5-0f0f-4411-89d0-595179f02b92
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: ovirt1.nwfiber.com:/gluster/brick4/iso
Brick2: ovirt2.nwfiber.com:/gluster/brick4/iso
Brick3: ovirt3.nwfiber.com:/gluster/brick4/iso (arbiter)
Options Reconfigured:
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: off
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
performance.low-prio-threads: 32
cluster.data-self-heal-algorithm: full
cluster.locking-scheme: granular
cluster.shd-wait-qlength: 1
cluster.shd-max-threads: 6
network.ping-timeout: 30
user.cifs: off
nfs.disable: on
performance.strict-o-direct: on

--

When I try and turn on profiling, I get:

[root@ovirt2 ~]# gluster volume profile data-hdd start
Another transaction is in progress for data-hdd. Please try again after
sometime.

I don't know what that other transaction is, but I am having some "odd
behavior" this morning, like a vm disk move between data and data-hdd that
stuck at 84% overnight.

I've been asking on IRC how to "un-stick" this transfer, as the VM cannot
be started, and I can't seem to do anything about it.

--Jim

On Mon, Mar 19, 2018 at 2:14 AM, Sahina Bose <sab...@redhat.com> wrote:

>
>
> On Mon, Mar 19, 2018 at 7:39 AM, Jim Kusznir <j...@palousetech.com> wrote:
>
>> Hello:
>>
>> This past week, I created a new gluster store, as I was running out of
>> disk space on my main, SSD-backed storage pool.  I used 2TB Seagate
>> FireCuda drives (hybrid SSD/spinning).  Hardware is Dell R610's with
>> integral PERC/6i cards.  I placed one disk per machine, exported the disk
>> as a single disk volume from the raid controller, formatted it XFS, mounted
>> it, and dedicated it to a new replica 3 gluster volume.
>>
>> Since doing so, I've been having major performance problems.  One of my
>> windows VMs sits at 100% disk utilization nearly continou

[ovirt-users] gluster self-heal takes cluster offline

2018-03-15 Thread Jim Kusznir

Hi all:

I'm trying to understand why/how (and most importantly, how to fix) an
substantial issue I had last night.  This happened one other time, but I
didn't know/understand all the parts associated with it until last night.

I have a 3 node hyperconverged (self-hosted engine, Gluster on each node)
cluster.  Gluster is Replica 2 + arbitrar.  Current network configuration
is 2x GigE on load balance ("LAG Group" on switch), plus one GigE from each
server on a separate vlan, intended for Gluster (but not used).  Server
hardware is Dell R610's, each server as an SSD in it.  Server 1 and 2 have
the full replica, server 3 is the arbitrar.

I put server 2 into maintence so I can work on the hardware, including turn
it off and such.  In the course of the work, I found that I needed to
reconfigure the SSD's partitioning somewhat, and it resulted in wiping the
data partition (storing VM images).  I figure, its no big deal, gluster
will rebuild that in short order.  I did take care of the extended attr
settings and the like, and when I booted it up, gluster came up as expected
and began rebuilding the disk.

The problem is that suddenly my entire cluster got very sluggish.  The
entine was marking nodes and VMs failed and unfaling them throughout the
system, fairly randomly.  It didn't matter what node the engine or VM was
on.  At one point, it power cycled server 1 for "non-responsive" (even
though everything was running on it, and the gluster rebuild was working on
it).  As a result of this, about 6 VMs were killed and my entire gluster
system went down hard (suspending all remaining VMs and the engine), as
there were no remaining full copies of the data.  After several minutes
(these are Dell servers, after all...), server 1 came back up, and gluster
resumed the rebuild, and came online on the cluster.  I had to manually
(virtsh command) unpause the engine, and then struggle through trying to
get critical VMs back up.  Everything was super slow, and load averages on
the servers were often seen in excess of 80 (these are 8 core / 16 thread
boxes).  Actual CPU usage (reported by top) was rarely above 40% (inclusive
of all CPUs) for any one server. Glusterfs was often seen using 180%-350%
of a CPU on server 1 and 2.

I ended up putting the cluster in global HA maintence mode and disabling
power fencing on the nodes until the process finished.  It appeared on at
least two occasions a functional node was marked bad and had the fencing
not been disabled, a node would have rebooted, just further exacerbating
the problem.

Its clear that the gluster rebuild overloaded things and caused the
problem.  I don't know why the load was so high (even IOWait was low), but
load averages were definately tied to the glusterfs cpu utilization %.   At
no point did I have any problems pinging any machine (host or VM) unless
the engine decided it was dead and killed it.

Why did my system bite it so hard with the rebuild?  I baby'ed it along
until the rebuild was complete, after which it returned to normal operation.

As of this event, all networking (host/engine management, gluster, and VM
network) were on the same vlan.  I'd love to move things off, but so far
any attempt to do so breaks my cluster.  How can I move my management
interfaces to a separate VLAN/IP Space?  I also want to move Gluster to its
own private space, but it seems if I change anything in the peers file, the
entire gluster cluster goes down.  The dedicated gluster network is listed
as a secondary hostname for all peers already.

Will the above network reconfigurations be enough?  I got the impression
that the issue may not have been purely network based, but possibly server
IO overload.  Is this likely / right?

I appreciate input.  I don't think gluster's recovery is supposed to do as
much damage as it did the last two or three times any healing was required.

Thanks!
--Jim
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

Re: [ovirt-users] hyperconverged question

2017-09-01 Thread Jim Kusznir

I can confirm that I did set it up manually, and I did specify backupvol,
and in the "manage domain" storage settings, I do have under mount
options, backup-volfile-servers=192.168.8.12:192.168.8.13  (and this was
done at initial install time).

The "used managed gluster" checkbox is NOT checked, and if I check it and
save settings, next time I go in it is not checked.

--Jim

On Fri, Sep 1, 2017 at 2:08 PM, Charles Kozler <ckozler...@gmail.com> wrote:

> @ Jim - here is my setup which I will test in a few (brand new cluster)
> and report back what I found in my tests
>
> - 3x servers direct connected via 10Gb
> - 2 of those 3 setup in ovirt as hosts
> - Hosted engine
> - Gluster replica 3 (no arbiter) for all volumes
> - 1x engine volume gluster replica 3 manually configured (not using ovirt
> managed gluster)
> - 1x datatest volume (20gb) replica 3 manually configured (not using ovirt
> managed gluster)
> - 1x nfstest domain served from some other server in my infrastructure
> which, at the time of my original testing, was master domain
>
> I tested this earlier and all VMs stayed online. However, ovirt cluster
> reported DC/cluster down, all VM's stayed up
>
> As I am now typing this, can you confirm you setup your gluster storage
> domain with backupvol? Also, confirm you updated hosted-engine.conf with
> backupvol mount option as well?
>
> On Fri, Sep 1, 2017 at 4:22 PM, Jim Kusznir <j...@palousetech.com> wrote:
>
>> So, after reading the first document twice and the 2nd link thoroughly
>> once, I believe that the arbitrator volume should be sufficient and count
>> for replica / split brain.  EG, if any one full replica is down, and the
>> arbitrator and the other replica is up, then it should have quorum and all
>> should be good.
>>
>> I think my underlying problem has to do more with config than the replica
>> state.  That said, I did size the drive on my 3rd node planning to have an
>> identical copy of all data on it, so I'm still not opposed to making it a
>> full replica.
>>
>> Did I miss something here?
>>
>> Thanks!
>>
>> On Fri, Sep 1, 2017 at 11:59 AM, Charles Kozler <ckozler...@gmail.com>
>> wrote:
>>
>>> These can get a little confusing but this explains it best:
>>> https://gluster.readthedocs.io/en/latest/Administrator
>>> %20Guide/arbiter-volumes-and-quorum/#replica-2-and-replica-3-volumes
>>>
>>> Basically in the first paragraph they are explaining why you cant have
>>> HA with quorum for 2 nodes. Here is another overview doc that explains some
>>> more
>>>
>>> http://openmymind.net/Does-My-Replica-Set-Need-An-Arbiter/
>>>
>>> From my understanding arbiter is good for resolving split brains. Quorum
>>> and arbiter are two different things though quorum is a mechanism to help
>>> you **avoid** split brain and the arbiter is to help gluster resolve split
>>> brain by voting and other internal mechanics (as outlined in link 1). How
>>> did you create the volume exactly - what command? It looks to me like you
>>> created it with 'gluster volume create replica 2 arbiter 1 {}' per your
>>> earlier mention of "replica 2 arbiter 1". That being said, if you did that
>>> and then setup quorum in the volume configuration, this would cause your
>>> gluster to halt up since quorum was lost (as you saw until you recovered
>>> node 1)
>>>
>>> As you can see from the docs, there is still a corner case for getting
>>> in to split brain with replica 3, which again, is where arbiter would help
>>> gluster resolve it
>>>
>>> I need to amend my previous statement: I was told that arbiter volume
>>> does not store data, only metadata. I cannot find anything in the docs
>>> backing this up however it would make sense for it to be. That being said,
>>> in my setup, I would not include my arbiter or my third node in my ovirt VM
>>> cluster component. I would keep it completely separate
>>>
>>>
>>> On Fri, Sep 1, 2017 at 2:46 PM, Jim Kusznir <j...@palousetech.com> wrote:
>>>
>>>> I'm now also confused as to what the point of an arbiter is / what it
>>>> does / why one would use it.
>>>>
>>>> On Fri, Sep 1, 2017 at 11:44 AM, Jim Kusznir <j...@palousetech.com>
>>>> wrote:
>>>>
>>>>> Thanks for the help!
>>>>>
>>>>> Here's my gluster volume info for the data export/brick (I have 3:
>>>>> data, engine, and iso, but they're all configured the same

Re: [ovirt-users] hyperconverged question

2017-09-01 Thread Jim Kusznir

So, after reading the first document twice and the 2nd link thoroughly
once, I believe that the arbitrator volume should be sufficient and count
for replica / split brain.  EG, if any one full replica is down, and the
arbitrator and the other replica is up, then it should have quorum and all
should be good.

I think my underlying problem has to do more with config than the replica
state.  That said, I did size the drive on my 3rd node planning to have an
identical copy of all data on it, so I'm still not opposed to making it a
full replica.

Did I miss something here?

Thanks!

On Fri, Sep 1, 2017 at 11:59 AM, Charles Kozler <ckozler...@gmail.com>
wrote:

> These can get a little confusing but this explains it best:
> https://gluster.readthedocs.io/en/latest/Administrator%20Guide/arbiter-
> volumes-and-quorum/#replica-2-and-replica-3-volumes
>
> Basically in the first paragraph they are explaining why you cant have HA
> with quorum for 2 nodes. Here is another overview doc that explains some
> more
>
> http://openmymind.net/Does-My-Replica-Set-Need-An-Arbiter/
>
> From my understanding arbiter is good for resolving split brains. Quorum
> and arbiter are two different things though quorum is a mechanism to help
> you **avoid** split brain and the arbiter is to help gluster resolve split
> brain by voting and other internal mechanics (as outlined in link 1). How
> did you create the volume exactly - what command? It looks to me like you
> created it with 'gluster volume create replica 2 arbiter 1 {}' per your
> earlier mention of "replica 2 arbiter 1". That being said, if you did that
> and then setup quorum in the volume configuration, this would cause your
> gluster to halt up since quorum was lost (as you saw until you recovered
> node 1)
>
> As you can see from the docs, there is still a corner case for getting in
> to split brain with replica 3, which again, is where arbiter would help
> gluster resolve it
>
> I need to amend my previous statement: I was told that arbiter volume does
> not store data, only metadata. I cannot find anything in the docs backing
> this up however it would make sense for it to be. That being said, in my
> setup, I would not include my arbiter or my third node in my ovirt VM
> cluster component. I would keep it completely separate
>
>
> On Fri, Sep 1, 2017 at 2:46 PM, Jim Kusznir <j...@palousetech.com> wrote:
>
>> I'm now also confused as to what the point of an arbiter is / what it
>> does / why one would use it.
>>
>> On Fri, Sep 1, 2017 at 11:44 AM, Jim Kusznir <j...@palousetech.com> wrote:
>>
>>> Thanks for the help!
>>>
>>> Here's my gluster volume info for the data export/brick (I have 3: data,
>>> engine, and iso, but they're all configured the same):
>>>
>>> Volume Name: data
>>> Type: Replicate
>>> Volume ID: e670c488-ac16-4dd1-8bd3-e43b2e42cc59
>>> Status: Started
>>> Snapshot Count: 0
>>> Number of Bricks: 1 x (2 + 1) = 3
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: ovirt1.nwfiber.com:/gluster/brick2/data
>>> Brick2: ovirt2.nwfiber.com:/gluster/brick2/data
>>> Brick3: ovirt3.nwfiber.com:/gluster/brick2/data (arbiter)
>>> Options Reconfigured:
>>> performance.strict-o-direct: on
>>> nfs.disable: on
>>> user.cifs: off
>>> network.ping-timeout: 30
>>> cluster.shd-max-threads: 8
>>> cluster.shd-wait-qlength: 1
>>> cluster.locking-scheme: granular
>>> cluster.data-self-heal-algorithm: full
>>> performance.low-prio-threads: 32
>>> features.shard-block-size: 512MB
>>> features.shard: on
>>> storage.owner-gid: 36
>>> storage.owner-uid: 36
>>> cluster.server-quorum-type: server
>>> cluster.quorum-type: auto
>>> network.remote-dio: enable
>>> cluster.eager-lock: enable
>>> performance.stat-prefetch: off
>>> performance.io-cache: off
>>> performance.read-ahead: off
>>> performance.quick-read: off
>>> performance.readdir-ahead: on
>>> server.allow-insecure: on
>>> [root@ovirt1 ~]#
>>>
>>>
>>> all 3 of my brick nodes ARE also members of the virtualization cluster
>>> (including ovirt3).  How can I convert it into a full replica instead of
>>> just an arbiter?
>>>
>>> Thanks!
>>> --Jim
>>>
>>> On Fri, Sep 1, 2017 at 9:09 AM, Charles Kozler <ckozler...@gmail.com>
>>> wrote:
>>>
>>>> @Kasturi - Looks good now. Cluster showed down for a moment but VM's
>>>> stayed up in their appropriate places. Thanks!
>>

Re: [ovirt-users] hyperconverged question

2017-09-01 Thread Jim Kusznir

Thank you!

I created my cluster following these instructions:

https://www.ovirt.org/blog/2016/08/up-and-running-with-ovirt-4-0-and-gluster-storage/

(I built it about 10 months ago)

I used their recipe for automated gluster node creation.  Originally I
thought I had 3 replicas, then I started realizing that node 3's disk usage
was essentially nothing compared to node 1 and 2, and eventually on this
list discovered that I had an arbiter.  Currently I am running on a 1Gbps
backbone, but I can dedicate a gig port (or even do bonded gig -- my
servers have 4 1Gbps interfaces, and my switch is only used for this
cluster, so it has the ports to hook them all up).  I am planning on a
10gbps upgrade once I bring in some more cash to pay for it.

Last night, node 2 and 3 were up, and I rebooted node 1 for updates.  As
soon as it shut down, my cluster halted (including the hosted engine), and
everything went messy.  When the node came back up, I still had to recover
the hosted engine via command line, then could go in and start unpausing my
VMs.  I'm glad it happened at 8pm at night...That would have been very ugly
if it happened during the day.  I had thought I had enough redundancy in
the cluster that I could take down any 1 node and not have an issue...That
definitely is not what happened.

--Jim

On Fri, Sep 1, 2017 at 11:59 AM, Charles Kozler <ckozler...@gmail.com>
wrote:

> These can get a little confusing but this explains it best:
> https://gluster.readthedocs.io/en/latest/Administrator%20Guide/arbiter-
> volumes-and-quorum/#replica-2-and-replica-3-volumes
>
> Basically in the first paragraph they are explaining why you cant have HA
> with quorum for 2 nodes. Here is another overview doc that explains some
> more
>
> http://openmymind.net/Does-My-Replica-Set-Need-An-Arbiter/
>
> From my understanding arbiter is good for resolving split brains. Quorum
> and arbiter are two different things though quorum is a mechanism to help
> you **avoid** split brain and the arbiter is to help gluster resolve split
> brain by voting and other internal mechanics (as outlined in link 1). How
> did you create the volume exactly - what command? It looks to me like you
> created it with 'gluster volume create replica 2 arbiter 1 {}' per your
> earlier mention of "replica 2 arbiter 1". That being said, if you did that
> and then setup quorum in the volume configuration, this would cause your
> gluster to halt up since quorum was lost (as you saw until you recovered
> node 1)
>
> As you can see from the docs, there is still a corner case for getting in
> to split brain with replica 3, which again, is where arbiter would help
> gluster resolve it
>
> I need to amend my previous statement: I was told that arbiter volume does
> not store data, only metadata. I cannot find anything in the docs backing
> this up however it would make sense for it to be. That being said, in my
> setup, I would not include my arbiter or my third node in my ovirt VM
> cluster component. I would keep it completely separate
>
>
> On Fri, Sep 1, 2017 at 2:46 PM, Jim Kusznir <j...@palousetech.com> wrote:
>
>> I'm now also confused as to what the point of an arbiter is / what it
>> does / why one would use it.
>>
>> On Fri, Sep 1, 2017 at 11:44 AM, Jim Kusznir <j...@palousetech.com> wrote:
>>
>>> Thanks for the help!
>>>
>>> Here's my gluster volume info for the data export/brick (I have 3: data,
>>> engine, and iso, but they're all configured the same):
>>>
>>> Volume Name: data
>>> Type: Replicate
>>> Volume ID: e670c488-ac16-4dd1-8bd3-e43b2e42cc59
>>> Status: Started
>>> Snapshot Count: 0
>>> Number of Bricks: 1 x (2 + 1) = 3
>>> Transport-type: tcp
>>> Bricks:
>>> Brick1: ovirt1.nwfiber.com:/gluster/brick2/data
>>> Brick2: ovirt2.nwfiber.com:/gluster/brick2/data
>>> Brick3: ovirt3.nwfiber.com:/gluster/brick2/data (arbiter)
>>> Options Reconfigured:
>>> performance.strict-o-direct: on
>>> nfs.disable: on
>>> user.cifs: off
>>> network.ping-timeout: 30
>>> cluster.shd-max-threads: 8
>>> cluster.shd-wait-qlength: 1
>>> cluster.locking-scheme: granular
>>> cluster.data-self-heal-algorithm: full
>>> performance.low-prio-threads: 32
>>> features.shard-block-size: 512MB
>>> features.shard: on
>>> storage.owner-gid: 36
>>> storage.owner-uid: 36
>>> cluster.server-quorum-type: server
>>> cluster.quorum-type: auto
>>> network.remote-dio: enable
>>> cluster.eager-lock: enable
>>> performance.stat-prefetch: off
>>> performance.io-cache: off
>>> performance.read-ahead: o

Re: [ovirt-users] hyperconverged question

2017-09-01 Thread Jim Kusznir

I'm now also confused as to what the point of an arbiter is / what it does
/ why one would use it.

On Fri, Sep 1, 2017 at 11:44 AM, Jim Kusznir <j...@palousetech.com> wrote:

> Thanks for the help!
>
> Here's my gluster volume info for the data export/brick (I have 3: data,
> engine, and iso, but they're all configured the same):
>
> Volume Name: data
> Type: Replicate
> Volume ID: e670c488-ac16-4dd1-8bd3-e43b2e42cc59
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x (2 + 1) = 3
> Transport-type: tcp
> Bricks:
> Brick1: ovirt1.nwfiber.com:/gluster/brick2/data
> Brick2: ovirt2.nwfiber.com:/gluster/brick2/data
> Brick3: ovirt3.nwfiber.com:/gluster/brick2/data (arbiter)
> Options Reconfigured:
> performance.strict-o-direct: on
> nfs.disable: on
> user.cifs: off
> network.ping-timeout: 30
> cluster.shd-max-threads: 8
> cluster.shd-wait-qlength: 1
> cluster.locking-scheme: granular
> cluster.data-self-heal-algorithm: full
> performance.low-prio-threads: 32
> features.shard-block-size: 512MB
> features.shard: on
> storage.owner-gid: 36
> storage.owner-uid: 36
> cluster.server-quorum-type: server
> cluster.quorum-type: auto
> network.remote-dio: enable
> cluster.eager-lock: enable
> performance.stat-prefetch: off
> performance.io-cache: off
> performance.read-ahead: off
> performance.quick-read: off
> performance.readdir-ahead: on
> server.allow-insecure: on
> [root@ovirt1 ~]#
>
>
> all 3 of my brick nodes ARE also members of the virtualization cluster
> (including ovirt3).  How can I convert it into a full replica instead of
> just an arbiter?
>
> Thanks!
> --Jim
>
> On Fri, Sep 1, 2017 at 9:09 AM, Charles Kozler <ckozler...@gmail.com>
> wrote:
>
>> @Kasturi - Looks good now. Cluster showed down for a moment but VM's
>> stayed up in their appropriate places. Thanks!
>>
>> < Anyone on this list please feel free to correct my response to Jim if
>> its wrong>
>>
>> @ Jim - If you can share your gluster volume info / status I can confirm
>> (to the best of my knowledge). From my understanding, If you setup the
>> volume with something like 'gluster volume set  group virt' this will
>> configure some quorum options as well, Ex: http://i.imgur.com/Mya4N5o.png
>>
>> While, yes, you are configured for arbiter node you're still losing
>> quorum by dropping from 2 -> 1. You would need 4 node with 1 being arbiter
>> to configure quorum which is in effect 3 writable nodes and 1 arbiter. If
>> one gluster node drops, you still have 2 up. Although in this case, you
>> probably wouldnt need arbiter at all
>>
>> If you are configured, you can drop quorum settings and just let arbiter
>> run since you're not using arbiter node in your VM cluster part (I
>> believe), just storage cluster part. When using quorum, you need > 50% of
>> the cluster being up at one time. Since you have 3 nodes with 1 arbiter,
>> you're actually losing 1/2 which == 50 which == degraded / hindered gluster
>>
>> Again, this is to the best of my knowledge based on other quorum backed
>> softwareand this is what I understand from testing with gluster and
>> ovirt thus far
>>
>> On Fri, Sep 1, 2017 at 11:53 AM, Jim Kusznir <j...@palousetech.com> wrote:
>>
>>> Huh...Ok., how do I convert the arbitrar to full replica, then?  I was
>>> misinformed when I created this setup.  I thought the arbitrator held
>>> enough metadata that it could validate or refudiate  any one replica (kinda
>>> like the parity drive for a RAID-4 array).  I was also under the impression
>>> that one replica  + Arbitrator is enough to keep the array online and
>>> functional.
>>>
>>> --Jim
>>>
>>> On Fri, Sep 1, 2017 at 5:22 AM, Charles Kozler <ckozler...@gmail.com>
>>> wrote:
>>>
>>>> @ Jim - you have only two data volumes and lost quorum. Arbitrator only
>>>> stores metadata, no actual files. So yes, you were running in degraded mode
>>>> so some operations were hindered.
>>>>
>>>> @ Sahina - Yes, this actually worked fine for me once I did that.
>>>> However, the issue I am still facing, is when I go to create a new gluster
>>>> storage domain (replica 3, hyperconverged) and I tell it "Host to use" and
>>>> I select that host. If I fail that host, all VMs halt. I do not recall this
>>>> in 3.6 or early 4.0. This to me makes it seem like this is "pinning" a node
>>>> to a volume and vice versa like you could, for instance, for a singular
>>>

Re: [ovirt-users] hyperconverged question

2017-09-01 Thread Jim Kusznir

Thanks for the help!

Here's my gluster volume info for the data export/brick (I have 3: data,
engine, and iso, but they're all configured the same):

Volume Name: data
Type: Replicate
Volume ID: e670c488-ac16-4dd1-8bd3-e43b2e42cc59
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: ovirt1.nwfiber.com:/gluster/brick2/data
Brick2: ovirt2.nwfiber.com:/gluster/brick2/data
Brick3: ovirt3.nwfiber.com:/gluster/brick2/data (arbiter)
Options Reconfigured:
performance.strict-o-direct: on
nfs.disable: on
user.cifs: off
network.ping-timeout: 30
cluster.shd-max-threads: 8
cluster.shd-wait-qlength: 1
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
performance.low-prio-threads: 32
features.shard-block-size: 512MB
features.shard: on
storage.owner-gid: 36
storage.owner-uid: 36
cluster.server-quorum-type: server
cluster.quorum-type: auto
network.remote-dio: enable
cluster.eager-lock: enable
performance.stat-prefetch: off
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
performance.readdir-ahead: on
server.allow-insecure: on
[root@ovirt1 ~]#


all 3 of my brick nodes ARE also members of the virtualization cluster
(including ovirt3).  How can I convert it into a full replica instead of
just an arbiter?

Thanks!
--Jim

On Fri, Sep 1, 2017 at 9:09 AM, Charles Kozler <ckozler...@gmail.com> wrote:

> @Kasturi - Looks good now. Cluster showed down for a moment but VM's
> stayed up in their appropriate places. Thanks!
>
> < Anyone on this list please feel free to correct my response to Jim if
> its wrong>
>
> @ Jim - If you can share your gluster volume info / status I can confirm
> (to the best of my knowledge). From my understanding, If you setup the
> volume with something like 'gluster volume set  group virt' this will
> configure some quorum options as well, Ex: http://i.imgur.com/Mya4N5o.png
>
> While, yes, you are configured for arbiter node you're still losing quorum
> by dropping from 2 -> 1. You would need 4 node with 1 being arbiter to
> configure quorum which is in effect 3 writable nodes and 1 arbiter. If one
> gluster node drops, you still have 2 up. Although in this case, you
> probably wouldnt need arbiter at all
>
> If you are configured, you can drop quorum settings and just let arbiter
> run since you're not using arbiter node in your VM cluster part (I
> believe), just storage cluster part. When using quorum, you need > 50% of
> the cluster being up at one time. Since you have 3 nodes with 1 arbiter,
> you're actually losing 1/2 which == 50 which == degraded / hindered gluster
>
> Again, this is to the best of my knowledge based on other quorum backed
> softwareand this is what I understand from testing with gluster and
> ovirt thus far
>
> On Fri, Sep 1, 2017 at 11:53 AM, Jim Kusznir <j...@palousetech.com> wrote:
>
>> Huh...Ok., how do I convert the arbitrar to full replica, then?  I was
>> misinformed when I created this setup.  I thought the arbitrator held
>> enough metadata that it could validate or refudiate  any one replica (kinda
>> like the parity drive for a RAID-4 array).  I was also under the impression
>> that one replica  + Arbitrator is enough to keep the array online and
>> functional.
>>
>> --Jim
>>
>> On Fri, Sep 1, 2017 at 5:22 AM, Charles Kozler <ckozler...@gmail.com>
>> wrote:
>>
>>> @ Jim - you have only two data volumes and lost quorum. Arbitrator only
>>> stores metadata, no actual files. So yes, you were running in degraded mode
>>> so some operations were hindered.
>>>
>>> @ Sahina - Yes, this actually worked fine for me once I did that.
>>> However, the issue I am still facing, is when I go to create a new gluster
>>> storage domain (replica 3, hyperconverged) and I tell it "Host to use" and
>>> I select that host. If I fail that host, all VMs halt. I do not recall this
>>> in 3.6 or early 4.0. This to me makes it seem like this is "pinning" a node
>>> to a volume and vice versa like you could, for instance, for a singular
>>> hyperconverged to ex: export a local disk via NFS and then mount it via
>>> ovirt domain. But of course, this has its caveats. To that end, I am using
>>> gluster replica 3, when configuring it I say "host to use: " node 1, then
>>> in the connection details I give it node1:/data. I fail node1, all VMs
>>> halt. Did I miss something?
>>>
>>> On Fri, Sep 1, 2017 at 2:13 AM, Sahina Bose <sab...@redhat.com> wrote:
>>>
>>>> To the OP question, when you set up a gluster storage domain, you need
>>>> to specify backup-volfile-serve

Re: [ovirt-users] hyperconverged question

2017-09-01 Thread Jim Kusznir

Speaking of the "use managed gluster", I created this gluster setup under
ovirt 4.0 when that wasn't there.  I've gone into my settings and checked
the box and saved it at least twice, but when I go back into the storage
settings, its not checked again.

The "about" box in the gui reports that I'm using this version: oVirt
Engine Version: 4.1.1.8-1.el7.centos

I thought I was staying up to date, but I'm not sure if I'm doing
everything right on the upgrade...The documentation says to click for
hosted engine upgrade instructions, which takes me to a page not found
error...For several versions now, and I haven't found those instructions,
so I've been "winging it".

--Jim

On Fri, Sep 1, 2017 at 8:53 AM, Jim Kusznir <j...@palousetech.com> wrote:

> Huh...Ok., how do I convert the arbitrar to full replica, then?  I was
> misinformed when I created this setup.  I thought the arbitrator held
> enough metadata that it could validate or refudiate  any one replica (kinda
> like the parity drive for a RAID-4 array).  I was also under the impression
> that one replica  + Arbitrator is enough to keep the array online and
> functional.
>
> --Jim
>
> On Fri, Sep 1, 2017 at 5:22 AM, Charles Kozler <ckozler...@gmail.com>
> wrote:
>
>> @ Jim - you have only two data volumes and lost quorum. Arbitrator only
>> stores metadata, no actual files. So yes, you were running in degraded mode
>> so some operations were hindered.
>>
>> @ Sahina - Yes, this actually worked fine for me once I did that.
>> However, the issue I am still facing, is when I go to create a new gluster
>> storage domain (replica 3, hyperconverged) and I tell it "Host to use" and
>> I select that host. If I fail that host, all VMs halt. I do not recall this
>> in 3.6 or early 4.0. This to me makes it seem like this is "pinning" a node
>> to a volume and vice versa like you could, for instance, for a singular
>> hyperconverged to ex: export a local disk via NFS and then mount it via
>> ovirt domain. But of course, this has its caveats. To that end, I am using
>> gluster replica 3, when configuring it I say "host to use: " node 1, then
>> in the connection details I give it node1:/data. I fail node1, all VMs
>> halt. Did I miss something?
>>
>> On Fri, Sep 1, 2017 at 2:13 AM, Sahina Bose <sab...@redhat.com> wrote:
>>
>>> To the OP question, when you set up a gluster storage domain, you need
>>> to specify backup-volfile-servers=: where server2 and
>>> server3 also have bricks running. When server1 is down, and the volume is
>>> mounted again - server2 or server3 are queried to get the gluster volfiles.
>>>
>>> @Jim, if this does not work, are you using 4.1.5 build with libgfapi
>>> access? If not, please provide the vdsm and gluster mount logs to analyse
>>>
>>> If VMs go to paused state - this could mean the storage is not
>>> available. You can check "gluster volume status " to see if
>>> atleast 2 bricks are running.
>>>
>>> On Fri, Sep 1, 2017 at 11:31 AM, Johan Bernhardsson <jo...@kafit.se>
>>> wrote:
>>>
>>>> If gluster drops in quorum so that it has less votes than it should it
>>>> will stop file operations until quorum is back to normal.If i rember it
>>>> right you need two bricks to write for quorum to be met and that the
>>>> arbiter only is a vote to avoid split brain.
>>>>
>>>>
>>>> Basically what you have is a raid5 solution without a spare. And when
>>>> one disk dies it will run in degraded mode. And some raid systems will stop
>>>> the raid until you have removed the disk or forced it to run anyway.
>>>>
>>>> You can read up on it here: https://gluster.readthed
>>>> ocs.io/en/latest/Administrator%20Guide/arbiter-volumes-and-quorum/
>>>>
>>>> /Johan
>>>>
>>>> On Thu, 2017-08-31 at 22:33 -0700, Jim Kusznir wrote:
>>>>
>>>> Hi all:
>>>>
>>>> Sorry to hijack the thread, but I was about to start essentially the
>>>> same thread.
>>>>
>>>> I have a 3 node cluster, all three are hosts and gluster nodes (replica
>>>> 2 + arbitrar).  I DO have the mnt_options=backup-volfile-servers= set:
>>>>
>>>> storage=192.168.8.11:/engine
>>>> mnt_options=backup-volfile-servers=192.168.8.12:192.168.8.13
>>>>
>>>> I had an issue today where 192.168.8.11 went down.  ALL VMs immediately
>>>> paused, including the engine (all VMs were running on ho

Re: [ovirt-users] hyperconverged question

2017-09-01 Thread Jim Kusznir

Huh...Ok., how do I convert the arbitrar to full replica, then?  I was
misinformed when I created this setup.  I thought the arbitrator held
enough metadata that it could validate or refudiate  any one replica (kinda
like the parity drive for a RAID-4 array).  I was also under the impression
that one replica  + Arbitrator is enough to keep the array online and
functional.

--Jim

On Fri, Sep 1, 2017 at 5:22 AM, Charles Kozler <ckozler...@gmail.com> wrote:

> @ Jim - you have only two data volumes and lost quorum. Arbitrator only
> stores metadata, no actual files. So yes, you were running in degraded mode
> so some operations were hindered.
>
> @ Sahina - Yes, this actually worked fine for me once I did that. However,
> the issue I am still facing, is when I go to create a new gluster storage
> domain (replica 3, hyperconverged) and I tell it "Host to use" and I select
> that host. If I fail that host, all VMs halt. I do not recall this in 3.6
> or early 4.0. This to me makes it seem like this is "pinning" a node to a
> volume and vice versa like you could, for instance, for a singular
> hyperconverged to ex: export a local disk via NFS and then mount it via
> ovirt domain. But of course, this has its caveats. To that end, I am using
> gluster replica 3, when configuring it I say "host to use: " node 1, then
> in the connection details I give it node1:/data. I fail node1, all VMs
> halt. Did I miss something?
>
> On Fri, Sep 1, 2017 at 2:13 AM, Sahina Bose <sab...@redhat.com> wrote:
>
>> To the OP question, when you set up a gluster storage domain, you need to
>> specify backup-volfile-servers=: where server2 and
>> server3 also have bricks running. When server1 is down, and the volume is
>> mounted again - server2 or server3 are queried to get the gluster volfiles.
>>
>> @Jim, if this does not work, are you using 4.1.5 build with libgfapi
>> access? If not, please provide the vdsm and gluster mount logs to analyse
>>
>> If VMs go to paused state - this could mean the storage is not available.
>> You can check "gluster volume status " to see if atleast 2 bricks
>> are running.
>>
>> On Fri, Sep 1, 2017 at 11:31 AM, Johan Bernhardsson <jo...@kafit.se>
>> wrote:
>>
>>> If gluster drops in quorum so that it has less votes than it should it
>>> will stop file operations until quorum is back to normal.If i rember it
>>> right you need two bricks to write for quorum to be met and that the
>>> arbiter only is a vote to avoid split brain.
>>>
>>>
>>> Basically what you have is a raid5 solution without a spare. And when
>>> one disk dies it will run in degraded mode. And some raid systems will stop
>>> the raid until you have removed the disk or forced it to run anyway.
>>>
>>> You can read up on it here: https://gluster.readthed
>>> ocs.io/en/latest/Administrator%20Guide/arbiter-volumes-and-quorum/
>>>
>>> /Johan
>>>
>>> On Thu, 2017-08-31 at 22:33 -0700, Jim Kusznir wrote:
>>>
>>> Hi all:
>>>
>>> Sorry to hijack the thread, but I was about to start essentially the
>>> same thread.
>>>
>>> I have a 3 node cluster, all three are hosts and gluster nodes (replica
>>> 2 + arbitrar).  I DO have the mnt_options=backup-volfile-servers= set:
>>>
>>> storage=192.168.8.11:/engine
>>> mnt_options=backup-volfile-servers=192.168.8.12:192.168.8.13
>>>
>>> I had an issue today where 192.168.8.11 went down.  ALL VMs immediately
>>> paused, including the engine (all VMs were running on host2:192.168.8.12).
>>> I couldn't get any gluster stuff working until host1 (192.168.8.11) was
>>> restored.
>>>
>>> What's wrong / what did I miss?
>>>
>>> (this was set up "manually" through the article on setting up
>>> self-hosted gluster cluster back when 4.0 was new..I've upgraded it to 4.1
>>> since).
>>>
>>> Thanks!
>>> --Jim
>>>
>>>
>>> On Thu, Aug 31, 2017 at 12:31 PM, Charles Kozler <ckozler...@gmail.com>
>>> wrote:
>>>
>>> Typo..."Set it up and then failed that **HOST**"
>>>
>>> And upon that host going down, the storage domain went down. I only have
>>> hosted storage domain and this new one - is this why the DC went down and
>>> no SPM could be elected?
>>>
>>> I dont recall this working this way in early 4.0 or 3.6
>>>
>>> On Thu, Aug 31, 2017 at 3:30 PM, Charles Kozler <ckozler...@gmail.com>
>>>

Re: [ovirt-users] Storage slowly expanding

2017-09-01 Thread Jim Kusznir

Thank you!

I created all the VMs using the sparce allocation method.  I wanted a
method that would create disks that did not immediately occupy their full
declared size (eg, allow overcommit of disk space, as most VM hard drives
are 30-50% empty for their entire life).

I kinda figured that it would not free space on the underlying storage when
a file is deleted within the disk.  What confuses me is a disk that is only
30GB to the OS is using 53GB of space on gluster.  In my understanding, the
actual on-disk usage should be limited to 30GB max if I don't take
snapshots.  (I do like having the ability to take snapshots, and I do use
them from time to time, but I usually don't keep the snapshot for an
extended time...long enough to verify whatever operation I did was
successful).

I did find the "sparcify" command within ovirt and ran that; it reclaimed
some space (the above example of the 30GB disk which is actually using 20GB
inside the VM but was using 53GB on gluster shrunk to 50GB on gluster...But
there's still at least 20GB unaccounted for there.

I would love it if there was something I could do to reclaim the space
inside the disk that isn't in use too (eg, get that disk down to just the
21GB that the actual VM is using).  If I change to virtio-scsi (its
currently just "virtio"), will that enable the DISCARD support, and is
Gluster a supported underlying storage?

Thanks!
--Jim

On Fri, Sep 1, 2017 at 5:45 AM, Yaniv Kaul <yk...@redhat.com> wrote:

>
>
> On Fri, Sep 1, 2017 at 8:41 AM, Jim Kusznir <j...@palousetech.com> wrote:
>
>> Hi all:
>>
>> I have several VMs, all thin provisioned, on my small storage
>> (self-hosted gluster / hyperconverged cluster).  I'm now noticing that some
>> of my VMs (espicially my only Windows VM) are using even MORE disk space
>> than the blank it was allocated.
>>
>> Example: windows VM: virtual size created at creation: 30GB (thin
>> provisioned).  Actual disk space in use: 19GB.  According to the storage ->
>> Disks tab, its currently using 39GB.  How do I get that down?
>>
>> I have two other VMs that are somewhat heavy DB load (Zabbix and Unifi);
>> both of those are also larger than their created max size despite disk in
>> machine not being fully utilized.
>>
>> None of these have snapshots.
>>
>
> How come you have qcow2 and not raw-sparse, if you are not using
> snapshots? is it a VM from a template?
>
> Generally, this is how thin provisioning works. The underlying qcow2
> doesn't know when you delete a file from within the guest - as file
> deletion is merely marking entries in the file system tables as free, not
> really doing any deletion IO.
> You could run virt-sparsify on the disks to sparsify them, which will, if
> the underlying storage supports it, reclaim storage space.
> You could use IDE or virtio-SCSI and enable DISCARD support, which will,
> if the underlying storage supports it, reclaim storage space.
>
> Those are not exclusive, btw.
> Y.
>
>
>> How do I fix this?
>>
>> Thanks!
>> --Jim
>>
>> ___
>> Users mailing list
>> Users@ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/users
>>
>>
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

[ovirt-users] Storage slowly expanding

2017-08-31 Thread Jim Kusznir

Hi all:

I have several VMs, all thin provisioned, on my small storage (self-hosted
gluster / hyperconverged cluster).  I'm now noticing that some of my VMs
(espicially my only Windows VM) are using even MORE disk space than the
blank it was allocated.

Example: windows VM: virtual size created at creation: 30GB (thin
provisioned).  Actual disk space in use: 19GB.  According to the storage ->
Disks tab, its currently using 39GB.  How do I get that down?

I have two other VMs that are somewhat heavy DB load (Zabbix and Unifi);
both of those are also larger than their created max size despite disk in
machine not being fully utilized.

None of these have snapshots.

How do I fix this?

Thanks!
--Jim
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

Re: [ovirt-users] hyperconverged question

2017-08-31 Thread Jim Kusznir

Hi all:

Sorry to hijack the thread, but I was about to start essentially the same
thread.

I have a 3 node cluster, all three are hosts and gluster nodes (replica 2 +
arbitrar).  I DO have the mnt_options=backup-volfile-servers= set:

storage=192.168.8.11:/engine
mnt_options=backup-volfile-servers=192.168.8.12:192.168.8.13

I had an issue today where 192.168.8.11 went down.  ALL VMs immediately
paused, including the engine (all VMs were running on host2:192.168.8.12).
I couldn't get any gluster stuff working until host1 (192.168.8.11) was
restored.

What's wrong / what did I miss?

(this was set up "manually" through the article on setting up self-hosted
gluster cluster back when 4.0 was new..I've upgraded it to 4.1 since).

Thanks!
--Jim


On Thu, Aug 31, 2017 at 12:31 PM, Charles Kozler 
wrote:

> Typo..."Set it up and then failed that **HOST**"
>
> And upon that host going down, the storage domain went down. I only have
> hosted storage domain and this new one - is this why the DC went down and
> no SPM could be elected?
>
> I dont recall this working this way in early 4.0 or 3.6
>
> On Thu, Aug 31, 2017 at 3:30 PM, Charles Kozler 
> wrote:
>
>> So I've tested this today and I failed a node. Specifically, I setup a
>> glusterfs domain and selected "host to use: node1". Set it up and then
>> failed that VM
>>
>> However, this did not work and the datacenter went down. My engine stayed
>> up, however, it seems configuring a domain to pin to a host to use will
>> obviously cause it to fail
>>
>> This seems counter-intuitive to the point of glusterfs or any redundant
>> storage. If a single host has to be tied to its function, this introduces a
>> single point of failure
>>
>> Am I missing something obvious?
>>
>> On Thu, Aug 31, 2017 at 9:43 AM, Kasturi Narra  wrote:
>>
>>> yes, right.  What you can do is edit the hosted-engine.conf file and
>>> there is a parameter as shown below [1] and replace h2 and h3 with your
>>> second and third storage servers. Then you will need to restart
>>> ovirt-ha-agent and ovirt-ha-broker services in all the nodes .
>>>
>>> [1] 'mnt_options=backup-volfile-servers=:'
>>>
>>> On Thu, Aug 31, 2017 at 5:54 PM, Charles Kozler 
>>> wrote:
>>>
 Hi Kasturi -

 Thanks for feedback

 > If cockpit+gdeploy plugin would be have been used then that would
 have automatically detected glusterfs replica 3 volume created during
 Hosted Engine deployment and this question would not have been asked

 Actually, doing hosted-engine --deploy it too also auto detects
 glusterfs.  I know glusterfs fuse client has the ability to failover
 between all nodes in cluster, but I am still curious given the fact that I
 see in ovirt config node1:/engine (being node1 I set it to in hosted-engine
 --deploy). So my concern was to ensure and find out exactly how engine
 works when one node goes away and the fuse client moves over to the other
 node in the gluster cluster

 But you did somewhat answer my question, the answer seems to be no (as
 default) and I will have to use hosted-engine.conf and change the parameter
 as you list

 So I need to do something manual to create HA for engine on gluster?
 Yes?

 Thanks so much!

 On Thu, Aug 31, 2017 at 3:03 AM, Kasturi Narra 
 wrote:

> Hi,
>
>During Hosted Engine setup question about glusterfs volume is being
> asked because you have setup the volumes yourself. If cockpit+gdeploy
> plugin would be have been used then that would have automatically detected
> glusterfs replica 3 volume created during Hosted Engine deployment and 
> this
> question would not have been asked.
>
>During new storage domain creation when glusterfs is selected there
> is a feature called 'use managed gluster volumes' and upon checking this
> all glusterfs volumes managed will be listed and you could choose the
> volume of your choice from the dropdown list.
>
> There is a conf file called /etc/hosted-engine/hosted-engine.conf
> where there is a parameter called backup-volfile-servers="h1:h2" and if 
> one
> of the gluster node goes down engine uses this parameter to provide ha /
> failover.
>
>  Hope this helps !!
>
> Thanks
> kasturi
>
>
>
> On Wed, Aug 30, 2017 at 8:09 PM, Charles Kozler 
> wrote:
>
>> Hello -
>>
>> I have successfully created a hyperconverged hosted engine setup
>> consisting of 3 nodes - 2 for VM's and the third purely for storage. I
>> manually configured it all, did not use ovirt node or anything. Built the
>> gluster volumes myself
>>
>> However, I noticed that when setting up the hosted engine and even
>> when adding a new storage domain with glusterfs type, it

Re: [ovirt-users] Recovering from a multi-node failure

2017-08-18 Thread Jim Kusznir

the heal info command shows perfect consistency between nodes; that's what
confused me.  At the moment, the physical partitions (lvm partitions) that
gluster is using are different sizes, but I expected to see the "least
common denominator" for the total size, and I expected to see it consistant
accross the cluster.

As this issue was from a couple weeks ago, I don't know what logs to give
you anymore.  Since the origional issue, the entire cluster has been
rebooted, with not all nodes down at the same time, but every node having
been rebooted.  Now things look a bit different:
[root@ovirt1 ~]# df -h
Filesystem Size  Used Avail Use% Mounted on
/dev/mapper/centos_ovirt-root   20G  5.1G   15G  26% /
devtmpfs16G 0   16G   0% /dev
tmpfs   16G 0   16G   0% /dev/shm
tmpfs   16G   34M   16G   1% /run
tmpfs   16G 0   16G   0% /sys/fs/cgroup
/dev/mapper/gluster-iso 25G  7.3G   18G  29% /gluster/brick4
/dev/sda1  497M  315M  183M  64% /boot
/dev/mapper/gluster-engine  25G   13G   13G  49% /gluster/brick1
/dev/mapper/gluster-data   136G  126G   11G  93% /gluster/brick2
192.168.8.11:/engine15G   10G  5.1G  67%
/rhev/data-center/mnt/glusterSD/192.168.8.11:_engine
192.168.8.11:/data 136G  126G   11G  93%
/rhev/data-center/mnt/glusterSD/192.168.8.11:_data
192.168.8.11:/iso   13G  7.3G  5.8G  56%
/rhev/data-center/mnt/glusterSD/192.168.8.11:_iso
tmpfs  3.2G 0  3.2G   0% /run/user/0

[root@ovirt2 ~]# df -h
Filesystem Size  Used Avail Use% Mounted on
/dev/mapper/centos_ovirt-root  8.0G  3.1G  5.0G  39% /
devtmpfs16G 0   16G   0% /dev
tmpfs   16G   16K   16G   1% /dev/shm
tmpfs   16G   90M   16G   1% /run
tmpfs   16G 0   16G   0% /sys/fs/cgroup
/dev/mapper/gluster-engine  15G   10G  5.1G  67% /gluster/brick1
/dev/sda1  497M  307M  191M  62% /boot
/dev/mapper/gluster-iso 13G  7.3G  5.8G  56% /gluster/brick4
/dev/mapper/gluster-data   174G  121G   54G  70% /gluster/brick2
192.168.8.11:/engine15G   10G  5.1G  67%
/rhev/data-center/mnt/glusterSD/192.168.8.11:_engine
192.168.8.11:/data 136G  126G   11G  93%
/rhev/data-center/mnt/glusterSD/192.168.8.11:_data
192.168.8.11:/iso   13G  7.3G  5.8G  56%
/rhev/data-center/mnt/glusterSD/192.168.8.11:_iso
tmpfs  3.2G 0  3.2G   0% /run/user/0


The thing that still bothers me is that for engine (brick1)  ovirt1's
physical disk space used is still higher than ovirt2's physical disk space
used, but the smaller number is reported on the gluster fs.  For data
(brick2), ovirt1 and ovirt2 physical usage are still different, but the
larger number is reported by glsuterfs.

the main question is still:
Is there cause for concern with the fact that physical usage for the bricks
are not consistent between the replicas that the heal info show completely
healed?  (again, I was so concerned that with ovirt2, I re-deleted
everything and let gluster re-heal the volume, and it came to the exact
same amount of (less) disk usage and claimed fully healed.

--Jim


On Wed, Aug 16, 2017 at 5:22 AM, Sahina Bose <sab...@redhat.com> wrote:

>
>
> On Sun, Aug 6, 2017 at 4:42 AM, Jim Kusznir <j...@palousetech.com> wrote:
>
>> Well, after a very stressful weekend, I think I have things largely
>> working.  Turns out that most of the above issues were caused by the linux
>> permissions of the exports for all three volumes (they had been reset to
>> 600; setting them to 774 or 770 fixed many of the issues).  Of course, I
>> didn't find that until a much more harrowing outage, and hours and hours of
>> work, including beginning to look at rebuilding my cluster
>>
>> So, now my cluster is operating again, and everything looks good EXCEPT
>> for one major Gluster issue/question that I haven't found any references or
>> info on.
>>
>> my host ovirt2, one of the replica gluster servers, is the one that lost
>> its storage and had to reinitialize it from the cluster.  the iso volume is
>> perfectly fine and complete, but the engine and data volumes are smaller on
>> disk on this node than on the other node (and this node before the crash).
>> On the engine store, the entire cluster reports the smaller utilization on
>> mounted gluster filesystems; on the data partition, it reports the larger
>> size (rest of cluster).  Here's some df statments to help clarify:
>>
>> (brick1 = engine; brick2=data, brick4=iso):
>> Filesystem Size  Used Avail Use% Mounted on
>> /dev/mapper/gluster

Re: [ovirt-users] Recovering from a multi-node failure

2017-08-05 Thread Jim Kusznir

Well, after a very stressful weekend, I think I have things largely
working.  Turns out that most of the above issues were caused by the linux
permissions of the exports for all three volumes (they had been reset to
600; setting them to 774 or 770 fixed many of the issues).  Of course, I
didn't find that until a much more harrowing outage, and hours and hours of
work, including beginning to look at rebuilding my cluster

So, now my cluster is operating again, and everything looks good EXCEPT for
one major Gluster issue/question that I haven't found any references or
info on.

my host ovirt2, one of the replica gluster servers, is the one that lost
its storage and had to reinitialize it from the cluster.  the iso volume is
perfectly fine and complete, but the engine and data volumes are smaller on
disk on this node than on the other node (and this node before the crash).
On the engine store, the entire cluster reports the smaller utilization on
mounted gluster filesystems; on the data partition, it reports the larger
size (rest of cluster).  Here's some df statments to help clarify:

(brick1 = engine; brick2=data, brick4=iso):
Filesystem Size  Used Avail Use% Mounted on
/dev/mapper/gluster-engine  25G   12G   14G  47% /gluster/brick1
/dev/mapper/gluster-data   136G  125G   12G  92% /gluster/brick2
/dev/mapper/gluster-iso 25G  7.3G   18G  29% /gluster/brick4
192.168.8.11:/engine15G  9.7G  5.4G  65%
/rhev/data-center/mnt/glusterSD/192.168.8.11:_engine
192.168.8.11:/data 136G  125G   12G  92%
/rhev/data-center/mnt/glusterSD/192.168.8.11:_data
192.168.8.11:/iso   13G  7.3G  5.8G  56%
/rhev/data-center/mnt/glusterSD/192.168.8.11:_iso

View from ovirt2:
Filesystem Size  Used Avail Use% Mounted on
/dev/mapper/gluster-engine  15G  9.7G  5.4G  65% /gluster/brick1
/dev/mapper/gluster-data   174G  119G   56G  69% /gluster/brick2
/dev/mapper/gluster-iso 13G  7.3G  5.8G  56% /gluster/brick4
192.168.8.11:/engine15G  9.7G  5.4G  65%
/rhev/data-center/mnt/glusterSD/192.168.8.11:_engine
192.168.8.11:/data 136G  125G   12G  92%
/rhev/data-center/mnt/glusterSD/192.168.8.11:_data
192.168.8.11:/iso   13G  7.3G  5.8G  56%
/rhev/data-center/mnt/glusterSD/192.168.8.11:_iso

As you can see, in the process of rebuilding the hard drive for ovirt2, I
did resize some things to give more space to data, where I desperately need
it.  If this goes well and the storage is given a clean bill of health at
this time, then I will take ovirt1 down and resize to match ovirt2, and
thus score a decent increase in storage for data.  I fully realize that
right now the gluster mounted volumes should have the total size as the
least common denominator.

So, is this size reduction appropriate?  A big part of me thinks data is
missing, but I even went through and shut down ovirt2's gluster daemons,
wiped all the gluster data, and restarted gluster to allow it a fresh heal
attempt, and it again came back to the exact same size.  This cluster was
originally built about the time ovirt 4.0 came out, and has been upgraded
to 'current', so perhaps some new gluster features are making more
efficient use of space (dedupe or something)?

Thank  you for your assistance!
--JIm

On Fri, Aug 4, 2017 at 7:49 PM, Jim Kusznir <j...@palousetech.com> wrote:

> Hi all:
>
> Today has been rough.  two of my three nodes went down today, and self
> heal has not been healing well.  4 hours later, VMs are running.  but the
> engine is not happy.  It claims the storage domain is down (even though it
> is up on all hosts and VMs are running).  I'm getting a ton of these
> messages logging:
>
> VDSM engine3 command HSMGetAllTasksStatusesVDS failed: Not SPM
>
> Aug 4, 2017 7:23:00 PM
>
> VDSM engine3 command SpmStatusVDS failed: Error validating master storage
> domain: ('MD read error',)
>
> Aug 4, 2017 7:22:49 PM
>
> VDSM engine3 command ConnectStoragePoolVDS failed: Cannot find master
> domain: u'spUUID=5868392a-0148-02cf-014d-0121,
> msdUUID=cdaf180c-fde6-4cb3-b6e5-b6bd869c8770'
>
> Aug 4, 2017 7:22:47 PM
>
> VDSM engine1 command ConnectStoragePoolVDS failed: Cannot find master
> domain: u'spUUID=5868392a-0148-02cf-014d-0121,
> msdUUID=cdaf180c-fde6-4cb3-b6e5-b6bd869c8770'
>
> Aug 4, 2017 7:22:46 PM
>
> VDSM engine2 command SpmStatusVDS failed: Error validating master storage
> domain: ('MD read error',)
>
> Aug 4, 2017 7:22:44 PM
>
> VDSM engine2 command ConnectStoragePoolVDS failed: Cannot find master
> domain: u'spUUID=5868392a-0148-02cf-014d-0121,
> msdUUID=cdaf180c-fde6-4cb3-b6e5-b6bd869c8770'
>
> Aug 4, 2017 7:22:42 PM
>
> VDSM engine1 command HSMGetAllTasksStatusesVDS failed: Not SPM: ()
>
>
> 
> I cannot set an SPM as it claims the storage domain is

[ovirt-users] Recovering from a multi-node failure

2017-08-04 Thread Jim Kusznir

Hi all:

Today has been rough.  two of my three nodes went down today, and self heal
has not been healing well.  4 hours later, VMs are running.  but the engine
is not happy.  It claims the storage domain is down (even though it is up
on all hosts and VMs are running).  I'm getting a ton of these messages
logging:

VDSM engine3 command HSMGetAllTasksStatusesVDS failed: Not SPM

Aug 4, 2017 7:23:00 PM

VDSM engine3 command SpmStatusVDS failed: Error validating master storage
domain: ('MD read error',)

Aug 4, 2017 7:22:49 PM

VDSM engine3 command ConnectStoragePoolVDS failed: Cannot find master
domain: u'spUUID=5868392a-0148-02cf-014d-0121,
msdUUID=cdaf180c-fde6-4cb3-b6e5-b6bd869c8770'

Aug 4, 2017 7:22:47 PM

VDSM engine1 command ConnectStoragePoolVDS failed: Cannot find master
domain: u'spUUID=5868392a-0148-02cf-014d-0121,
msdUUID=cdaf180c-fde6-4cb3-b6e5-b6bd869c8770'

Aug 4, 2017 7:22:46 PM

VDSM engine2 command SpmStatusVDS failed: Error validating master storage
domain: ('MD read error',)

Aug 4, 2017 7:22:44 PM

VDSM engine2 command ConnectStoragePoolVDS failed: Cannot find master
domain: u'spUUID=5868392a-0148-02cf-014d-0121,
msdUUID=cdaf180c-fde6-4cb3-b6e5-b6bd869c8770'

Aug 4, 2017 7:22:42 PM

VDSM engine1 command HSMGetAllTasksStatusesVDS failed: Not SPM: ()



I cannot set an SPM as it claims the storage domain is down; I cannot set
the storage domain up.

Also in the storage realm, one of my exports shows substantially less data
than is actually there.

Here's what happened, as best as I understood them:
I went to do maintence on ovirt2 (needed to replace a faulty ram stick and
rework the disk).  I put it in maintence mode, then shut it down and did my
work.  In the process, much of the disk contents was lost (all the gluster
data).  I figure, no big deal, the gluster data is redundant on the
network, it will heal when it comes back up.

While I was doing maintence, all but one of the VMs were running on
engine1.  When I turned on engine2, all of the sudden, all vms including
the main engine stop and go non-responsive.  As far as I can tell, this
should not have happened, as I turned ON one host, but none the less, I
waited for recovery to occur (while customers started calling asking why
everything stopped working).  As I waited, I  was checking, and gluster
volume status only showed ovirt1 and ovirt2Apparently gluster had
stopped/failed at some point on ovirt3.  I assume that was the cause of the
outage, still, if everything was working fine with ovirt1 gluster, and
ovirt2 powers on with a very broke gluster (the volume status was showing
NA for the port fileds for the gluster volumes), I would not expect to have
a working gluster go stupid like that.

After starting ovirt3 glusterd and checking the status, all three showed
ovirt1 and ovirt3 as operational, and ovirt2 as NA.  Unfortunately,
recovery was still not happening, so I did some googling and found about
the commands to inquire about the hosted-engine status.  It appeared to be
stuck "paused" and I couldn't find a way to unpause it, so I poweroff'ed
it, then started it manually on engine 1, and the cluster came back up.  It
showed all VMs paused.  I was able to unpause them and they worked again.

So now I began to work the ovirt2 gluster healing problem.  It didn't
appear to be self-healing, but eventually I found this document:
https://support.rackspace.com/how-to/recover-from-a-failed-server-in-a-glusterfs-array/
and from that found the magic xattr commands.  After setting them, gluster
volumes on ovirt2 came online.  I told iso to heal, and it did but only
came up about half as much data as it should have.  I told it heal full,
and it did finish off the remaining data, and came up to full.  I then told
engine to do a full heal (gluster volume heal engine full), and it
transferred its data from the other gluster hosts too.  However, it said it
was done when it hit 9.7GB while there was 15GB on disk!  It is still stuck
that way; ovirt gui and gluster volume heal engine info both show the
volume fully healed, but it is not:
[root@ovirt1 ~]# df -h
Filesystem Size  Used Avail Use% Mounted on
/dev/mapper/centos_ovirt-root   20G  4.2G   16G  21% /
devtmpfs16G 0   16G   0% /dev
tmpfs   16G   16K   16G   1% /dev/shm
tmpfs   16G   26M   16G   1% /run
tmpfs   16G 0   16G   0% /sys/fs/cgroup
/dev/mapper/gluster-engine  25G   12G   14G  47% /gluster/brick1
/dev/sda1  497M  315M  183M  64% /boot
/dev/mapper/gluster-data   136G  124G   13G  92% /gluster/brick2
/dev/mapper/gluster-iso 25G  7.3G   18G  29% /gluster/brick4
tmpfs  3.2G 0  3.2G   0% /run/user/0
192.168.8.11:/engine15G  9.7G  5.4G  65%
/rhev/data-center/mnt/glusterSD/192.168.8.11:_engine
192.168.8.11:/data 136G  124G   13G  92%

Re: [ovirt-users] ovirt-hosted-engine state transition messages

2017-07-17 Thread Jim Kusznir

rators::88::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
Timeout cleared while transitioning  -> 
MainThread::INFO::2017-07-17
08:16:04,710::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Trying: notify time=1500304564.71 type=state_transition
detail=EngineUpBadHealth-EngineUp hostname='ovirt1.nwfiber.com'
MainThread::INFO::2017-07-17
08:16:04,798::brokerlink::121::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Success, was notification of state_transition (EngineUpBadHealth-EngineUp)
sent? sent
MainThread::INFO::2017-07-17
08:16:04,799::hosted_engine::604::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_vdsm)
Initializing VDSM
MainThread::INFO::2017-07-17
08:16:07,435::hosted_engine::630::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images)
Connecting the storage
MainThread::INFO::2017-07-17
08:16:07,491::storage_server::219::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
Connecting storage server
MainThread::INFO::2017-07-17
08:16:13,906::storage_server::226::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
Connecting storage server
MainThread::INFO::2017-07-17
08:16:14,131::storage_server::233::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server)
Refreshing the storage domain
MainThread::INFO::2017-07-17
08:16:14,437::hosted_engine::657::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images)
Preparing images
MainThread::INFO::2017-07-17
08:16:14,438::image::126::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images)
Preparing images

On Thu, Mar 30, 2017 at 5:58 AM, Simone Tiraboschi <stira...@redhat.com>
wrote:

> Could you please check your /var/log/ovirt-hosted-engine-ha/agent.log ?
>
> On Thu, Mar 30, 2017 at 3:10 AM, Jim Kusznir <j...@palousetech.com> wrote:
>
>> Hello:
>>
>> I find that I often get random-seeming messages.  A lot of them mention
>> "ReintializeFSM", but I also get engine down, engine start, etc.
>>  messages.  All the time, nothing appears to be happening on the cluster,
>> and I rarely can find anything wrong or any trigger/cause.  Is this
>> normal?  What causes this (beyond obvious hardware issues / hosts
>> rebooting)?  Most of the time when I get these, my cluster is going along
>> smoothly, and nothing (not even administrative access) is interrupted.
>>
>> Could ISP issues cause these messages to be generated?
>>
>> Thanks!
>> --Jim
>>
>> ___
>> Users mailing list
>> Users@ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/users
>>
>>
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

Re: [ovirt-users] Setting up GeoReplication

2017-05-15 Thread Jim Kusznir

I tried to create a gluster volume on the georep node by running:

gluster volume create engine-rep replica 1 georep.nwfiber.com:
/mnt/gluster/engine-rep

I got back an error saying replica must be > 1.  So I tried to create it
again:

gluster volume create engine-rep replica 2
georep.nwfiber.com:/mnt/gluster/engine-rep
server2.nwfiber.com:/mnt/gluster/engine-rep

where server2 did not exist.  That failed too, but I don't recall the error
message.

gluster is installed, but when I try and start it with the init script, it
fails to start with a complaint about reading the block file; my googling
indicated that's the error you get until you've created a gluster volume,
and that was the first clue to me that maybe I needed to create one first.

So, how do I create a replica 1 volume?

Thinking way ahead, I have a related replica question:  Currently my ovirt
nodes are also my gluster nodes (replica 2 arbitrar 1).  Eventually I'll
want to pull my gluster off onto dedicated hardware I suspect.  If I do so,
do I need 3 servers, or is a replica 2 sufficient?  I guess I could have an
ovirt node continue to be an arbitrar...  I would eventually like to
distribute my ovirt cluster accross multiple locations with the option for
remote failover (say location A looses all its network and/or power; have
important VMs started at location B in addition to location B's normal
VMs).  I assume at this point the recommended arch would be:

2 Gluster servers at each location
Each location has a gluster volume for that location, and is georep for the
other location (so all my data will physically exist on 4 gluster
servers).  I probably won't have more than 2 or 3 ovirt hosts at each
location, so I don't expect this to be a "heavy use" system.

Am I on track?  I'd be interested to learn what others suggest for this
deployment model.

On Sun, May 14, 2017 at 11:09 PM, Sahina Bose <sab...@redhat.com> wrote:

> Adding Aravinda
>
> On Sat, May 13, 2017 at 11:21 PM, Jim Kusznir <j...@palousetech.com> wrote:
>
>> Hi All:
>>
>> I've been trying to set up georeplication for a while now, but can't seem
>> to make it work.  I've found documentation on the web (mostly
>> https://gluster.readthedocs.io/en/refactor/Administr
>> ator%20Guide/Geo%20Replication/), and I found http://blog.gluster.org/
>> 2015/09/introducing-georepsetup-gluster-geo-replication-setup-tool/
>>
>> Unfortunately, it seems that some critical steps are missing from both,
>> and I can't figure out for sure what they are.
>>
>> My environment:
>>
>> Production: replica 2 + arbitrator running on my 3-node oVirt cluster, 3
>> volumes (engine, data, iso).
>>
>> New geo-replication: Raspberry Pi3 with USB hard drive shoved in some
>> other data closet off-site.
>>
>> I've installed rasbian-lite, and after much fighting, got
>> glusterfs-*-3.8.11 installed.  I've created my mountpoint (USB hard drive,
>> much larger than my gluster volumes), and then ran the command.  I get this
>> far:
>>
>> [OK] georep.nwfiber.com is Reachable(Port 22)
>> [OK] SSH Connection established r...@georep.nwfiber.com
>> [OK] Master Volume and Slave Volume are compatible (Version: 3.8.11)
>> [NOT OK] Unable to Mount Gluster Volume georep.nwfiber.com:engine-rep
>>
>> Trying it with the steps in the gluster docs also has the same problem.
>> No long files are generated on the slave.  Log files on the master include:
>>
>> [root@ovirt1 geo-replication]# more georepsetup.mount.log
>> [2017-05-13 17:26:27.318599] I [MSGID: 100030] [glusterfsd.c:2454:main]
>> 0-glusterfs: Started running glusterfs version 3.8.11 (args:
>>  glusterfs --xlator-option="*dht.lookup-unhashed=off" --volfile-server
>> localhost --volfile-id engine -l /var/log/glusterfs/geo-repli
>> cation/georepsetup.mount.log --client-pid=-1 /tmp/georepsetup_wZtfkN)
>> [2017-05-13 17:26:27.341170] I [MSGID: 101190]
>> [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread
>> with index 1
>> [2017-05-13 17:26:27.341260] E [socket.c:2309:socket_connect_finish]
>> 0-glusterfs: connection to ::1:24007 failed (Connection refused
>> )
>> [2017-05-13 17:26:27.341846] E [glusterfsd-mgmt.c:1908:mgmt_rpc_notify]
>> 0-glusterfsd-mgmt: failed to connect with remote-host: local
>> host (Transport endpoint is not connected)
>> [2017-05-13 17:26:31.335849] I [MSGID: 101190]
>> [event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread
>> with index 2
>> [2017-05-13 17:26:31.337545] I [MSGID: 114020] [client.c:2356:notify]
>> 0-engine-client-0: parent translators are ready, attempting co
>> nnect on transport
>> [2017-05-13 17:26:31.344485] I [MS

[ovirt-users] Setting up GeoReplication

2017-05-13 Thread Jim Kusznir

Hi All:

I've been trying to set up georeplication for a while now, but can't seem
to make it work.  I've found documentation on the web (mostly
https://gluster.readthedocs.io/en/refactor/Administrator%20Guide/Geo%20Replication/),
and I found
http://blog.gluster.org/2015/09/introducing-georepsetup-gluster-geo-replication-setup-tool/

Unfortunately, it seems that some critical steps are missing from both, and
I can't figure out for sure what they are.

My environment:

Production: replica 2 + arbitrator running on my 3-node oVirt cluster, 3
volumes (engine, data, iso).

New geo-replication: Raspberry Pi3 with USB hard drive shoved in some other
data closet off-site.

I've installed rasbian-lite, and after much fighting, got
glusterfs-*-3.8.11 installed.  I've created my mountpoint (USB hard drive,
much larger than my gluster volumes), and then ran the command.  I get this
far:

[OK] georep.nwfiber.com is Reachable(Port 22)
[OK] SSH Connection established r...@georep.nwfiber.com
[OK] Master Volume and Slave Volume are compatible (Version: 3.8.11)
[NOT OK] Unable to Mount Gluster Volume georep.nwfiber.com:engine-rep

Trying it with the steps in the gluster docs also has the same problem.  No
long files are generated on the slave.  Log files on the master include:

[root@ovirt1 geo-replication]# more georepsetup.mount.log
[2017-05-13 17:26:27.318599] I [MSGID: 100030] [glusterfsd.c:2454:main]
0-glusterfs: Started running glusterfs version 3.8.11 (args:
 glusterfs --xlator-option="*dht.lookup-unhashed=off" --volfile-server
localhost --volfile-id engine -l /var/log/glusterfs/geo-repli
cation/georepsetup.mount.log --client-pid=-1 /tmp/georepsetup_wZtfkN)
[2017-05-13 17:26:27.341170] I [MSGID: 101190]
[event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread
with index 1
[2017-05-13 17:26:27.341260] E [socket.c:2309:socket_connect_finish]
0-glusterfs: connection to ::1:24007 failed (Connection refused
)
[2017-05-13 17:26:27.341846] E [glusterfsd-mgmt.c:1908:mgmt_rpc_notify]
0-glusterfsd-mgmt: failed to connect with remote-host: local
host (Transport endpoint is not connected)
[2017-05-13 17:26:31.335849] I [MSGID: 101190]
[event-epoll.c:628:event_dispatch_epoll_worker] 0-epoll: Started thread
with index 2
[2017-05-13 17:26:31.337545] I [MSGID: 114020] [client.c:2356:notify]
0-engine-client-0: parent translators are ready, attempting co
nnect on transport
[2017-05-13 17:26:31.344485] I [MSGID: 114020] [client.c:2356:notify]
0-engine-client-1: parent translators are ready, attempting co
nnect on transport
[2017-05-13 17:26:31.345146] I [rpc-clnt.c:1965:rpc_clnt_reconfig]
0-engine-client-0: changing port to 49157 (from 0)
[2017-05-13 17:26:31.350868] I [MSGID: 114020] [client.c:2356:notify]
0-engine-client-2: parent translators are ready, attempting co
nnect on transport
[2017-05-13 17:26:31.355946] I [MSGID: 114057]
[client-handshake.c:1440:select_server_supported_programs]
0-engine-client-0: Using P
rogram GlusterFS 3.3, Num (1298437), Version (330)
[2017-05-13 17:26:31.356280] I [rpc-clnt.c:1965:rpc_clnt_reconfig]
0-engine-client-1: changing port to 49157 (from 0)
Final graph:
+--+
  1: volume engine-client-0
  2: type protocol/client
  3: option clnt-lk-version 1
  4: option volfile-checksum 0
  5: option volfile-key engine
  6: option client-version 3.8.11
  7: option process-uuid
ovirt1.nwfiber.com-25660-2017/05/13-17:26:27:311929-engine-client-0-0-0
  8: option fops-version 1298437
  9: option ping-timeout 30
 10: option remote-host ovirt1.nwfiber.com
 11: option remote-subvolume /gluster/brick1/engine
 12: option transport-type socket
 13: option username 028984cf-0399-42e6-b04b-bb9b1685c536
 14: option password eae737cc-9659-405f-865e-9a7ef97a3307
 15: option filter-O_DIRECT off
 16: option send-gids true
 17: end-volume
 18:
 19: volume engine-client-1
 20: type protocol/client
 21: option ping-timeout 30
 22: option remote-host ovirt2.nwfiber.com
 23: option remote-subvolume /gluster/brick1/engine
 24: option transport-type socket
 25: option username 028984cf-0399-42e6-b04b-bb9b1685c536
 26: option password eae737cc-9659-405f-865e-9a7ef97a3307
 27: option filter-O_DIRECT off
 28: option send-gids true
 29: end-volume
 30:
 31: volume engine-client-2
 32: type protocol/client
 33: option ping-timeout 30
 34: option remote-host ovirt3.nwfiber.com
 35: option remote-subvolume /gluster/brick1/engine
 36: option transport-type socket
 37: option username 028984cf-0399-42e6-b04b-bb9b1685c536
 38: option password eae737cc-9659-405f-865e-9a7ef97a3307
 39: option filter-O_DIRECT off
 40: option send-gids true
 41: end-volume
 42:
 43: volume engine-replicate-0
 44: type cluster/replicate
 45: option arbiter-count 1
 46: option data-self-heal-algorithm full
 47: option

Re: [ovirt-users] Ovirt tasks "stuck"

2017-04-25 Thread Jim Kusznir

(sorry. e-mail client sent message prematurely)

Ok, I figured out that this needs to be run on the engine, I figured out
that PGPASSWORD is the postgres password, and I finally figured out that
the db password is stored in:
/etc/ovirt-engine/engine.conf.d/10-setup-database.conf

Unfortunately, when I run the command provided, I get just an empty line
back, no UUIDs.

I looked in the gui, under the disks tab and found the ID there.  I ran the
command on the two UUIDs for the two disks in question:

[root@ovirt ~]# PGPASSWORD=
/usr/share/ovirt-engine/setup/dbutils/unlock_entity.sh -q -t disk -u engine

[root@ovirt ~]# PGPASSWORD=
/usr/share/ovirt-engine/setup/dbutils/unlock_entity.sh -t snapshot -u
engine 405fabe0-873c-4e8e-ae10-9990debf96c0
Caution, this operation may lead to data corruption and should be used with
care. Please contact support prior to running this command
Are you sure you want to proceed? [y/n]
y
select fn_db_unlock_snapshot('405fabe0-873c-4e8e-ae10-9990debf96c0');

INSERT 0 1
unlock snapshot 405fabe0-873c-4e8e-ae10-9990debf96c0 completed successfully.
[root@ovirt ~]# PGPASSWORD=
/usr/share/ovirt-engine/setup/dbutils/unlock_entity.sh -t snapshot -u
engine eada2c1c-1d99-4391-9be3-352c411a0a91
Caution, this operation may lead to data corruption and should be used with
care. Please contact support prior to running this command
Are you sure you want to proceed? [y/n]
y
select fn_db_unlock_snapshot('eada2c1c-1d99-4391-9be3-352c411a0a91');

INSERT 0 1
unlock snapshot eada2c1c-1d99-4391-9be3-352c411a0a91 completed successfully.

Unfortunately, this doesn't appear to have accomplished anything.  In the
web UI, the disks are still shown as locked, and the tasks are still shown
as pending.

I logged into a host node and found the directory by the same UUID:

root@ovirt1 images]# cd 405fabe0-873c-4e8e-ae10-9990debf96c0/
[root@ovirt1 405fabe0-873c-4e8e-ae10-9990debf96c0]# ls
8e4a02a7-760b-478c-a694-81466d601356
 8e4a02a7-760b-478c-a694-81466d601356.lease
 8e4a02a7-760b-478c-a694-81466d601356.meta
[root@ovirt1 405fabe0-873c-4e8e-ae10-9990debf96c0]# du -sh
514M .

I'm assuming I should NOT just rm these files and the containing
directory

Suggestions moving forward?

On Tue, Apr 25, 2017 at 8:50 AM, Jim Kusznir <j...@palousetech.com> wrote:

> Ok, I figured out that this needs to be run on the engine, I figured out
> that PGPASSWORD
>
> On Tue, Apr 4, 2017 at 2:02 AM, Nathanaël Blanchet <blanc...@abes.fr>
> wrote:
>
>> For instance
>>
>> PGPASSWORD=X /usr/share/ovirt-engine/setup/dbutils/unlock_entity.sh
>> -q -t disk -u engine
>> 296c010e-3c1d-4008-84b3-5cd39cff6aa1 | 525a4dda-dbbb-4872-a5f1-8ac2ae
>> d48392
>>
>> PGPASSWORD=X /usr/share/ovirt-engine/setup/dbutils/unlock_entity.sh
>> -t snapshot -u engine 525a4dda-dbbb-4872-a5f1-8ac2aed48392
>>
>> Le 01/04/2017 à 19:55, Jim Kusznir a écrit :
>>
>> Hi:
>>
>> A few days ago I attempted to create a new VM from one of the
>> ovirt-image-repository images.  I haven't really figured out how to use
>> this reliably yet, and in this case, while trying to import an image, one
>> of my nodes spontaneously rebooted (or at least, it looked like that to
>> ovirt...Not sure if it had an OOM issue or something else).  I assume it
>> was the node that got the task of importing those images, as ever since
>> then (several days now), on my management screen under "Tasks" it shows the
>> attempted imports, still stuck in "processing".  I'm quite certain its not
>> actually processing.  I do believe it used some of my storage up in the
>> partially downloaded images, though (they do show up as
>> GlanceDisk-, with a status of "Locked" under the main Disks tab.
>>
>> How do I "properly" recover from this (abort the task and delete the
>> partial download)?
>>
>> Thanks!
>>
>> --Jim
>>
>>
>> ___
>> Users mailing 
>> listUsers@ovirt.orghttp://lists.ovirt.org/mailman/listinfo/users
>>
>>
>> --
>> Nathanaël Blanchet
>>
>> Supervision réseau
>> Pôle Infrastrutures Informatiques
>> 227 avenue Professeur-Jean-Louis-Viala
>> 34193 MONTPELLIER CEDEX 5
>> Tél. 33 (0)4 67 54 84 55
>> Fax  33 (0)4 67 54 84 14blanc...@abes.fr
>>
>>
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

Re: [ovirt-users] Ovirt tasks "stuck"

2017-04-25 Thread Jim Kusznir

Ok, I figured out that this needs to be run on the engine, I figured out
that PGPASSWORD

On Tue, Apr 4, 2017 at 2:02 AM, Nathanaël Blanchet <blanc...@abes.fr> wrote:

> For instance
>
> PGPASSWORD=X /usr/share/ovirt-engine/setup/dbutils/unlock_entity.sh
> -q -t disk -u engine
> 296c010e-3c1d-4008-84b3-5cd39cff6aa1 | 525a4dda-dbbb-4872-a5f1-
> 8ac2aed48392
>
> PGPASSWORD=X /usr/share/ovirt-engine/setup/dbutils/unlock_entity.sh
> -t snapshot -u engine 525a4dda-dbbb-4872-a5f1-8ac2aed48392
>
> Le 01/04/2017 à 19:55, Jim Kusznir a écrit :
>
> Hi:
>
> A few days ago I attempted to create a new VM from one of the
> ovirt-image-repository images.  I haven't really figured out how to use
> this reliably yet, and in this case, while trying to import an image, one
> of my nodes spontaneously rebooted (or at least, it looked like that to
> ovirt...Not sure if it had an OOM issue or something else).  I assume it
> was the node that got the task of importing those images, as ever since
> then (several days now), on my management screen under "Tasks" it shows the
> attempted imports, still stuck in "processing".  I'm quite certain its not
> actually processing.  I do believe it used some of my storage up in the
> partially downloaded images, though (they do show up as
> GlanceDisk-, with a status of "Locked" under the main Disks tab.
>
> How do I "properly" recover from this (abort the task and delete the
> partial download)?
>
> Thanks!
>
> --Jim
>
>
> ___
> Users mailing listUsers@ovirt.orghttp://lists.ovirt.org/mailman/listinfo/users
>
>
> --
> Nathanaël Blanchet
>
> Supervision réseau
> Pôle Infrastrutures Informatiques
> 227 avenue Professeur-Jean-Louis-Viala
> 34193 MONTPELLIER CEDEX 5 
> Tél. 33 (0)4 67 54 84 55
> Fax  33 (0)4 67 54 84 14blanc...@abes.fr
>
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

[ovirt-users] Ovirt tasks "stuck"

2017-04-01 Thread Jim Kusznir

Hi:

A few days ago I attempted to create a new VM from one of the
ovirt-image-repository images.  I haven't really figured out how to use
this reliably yet, and in this case, while trying to import an image, one
of my nodes spontaneously rebooted (or at least, it looked like that to
ovirt...Not sure if it had an OOM issue or something else).  I assume it
was the node that got the task of importing those images, as ever since
then (several days now), on my management screen under "Tasks" it shows the
attempted imports, still stuck in "processing".  I'm quite certain its not
actually processing.  I do believe it used some of my storage up in the
partially downloaded images, though (they do show up as
GlanceDisk-, with a status of "Locked" under the main Disks tab.

How do I "properly" recover from this (abort the task and delete the
partial download)?

Thanks!

--Jim
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

Re: [ovirt-users] Gluster and oVirt 4.0 questions

2017-04-01 Thread Jim Kusznir

Based on the suggestions here, I did successfully remove the unused export
gluster brick and allocate all otherwise unassigned space to my data
export, then used xfs_growfs to realize the new size.  This should hold me
for a while longer before building  a "proper" storage solution.

--Jim

On Sat, Apr 1, 2017 at 10:02 AM, Jim Kusznir <j...@palousetech.com> wrote:

> Thank you!
>
> Here's the output of gluster volume info:
> [root@ovirt1 ~]# gluster volume info
>
> Volume Name: data
> Type: Replicate
> Volume ID: e670c488-ac16-4dd1-8bd3-e43b2e42cc59
> Status: Started
> Number of Bricks: 1 x (2 + 1) = 3
> Transport-type: tcp
> Bricks:
> Brick1: ovirt1.nwfiber.com:/gluster/brick2/data
> Brick2: ovirt2.nwfiber.com:/gluster/brick2/data
> Brick3: ovirt3.nwfiber.com:/gluster/brick2/data (arbiter)
> Options Reconfigured:
> performance.strict-o-direct: on
> nfs.disable: on
> user.cifs: off
> network.ping-timeout: 30
> cluster.shd-max-threads: 6
> cluster.shd-wait-qlength: 1
> cluster.locking-scheme: granular
> cluster.data-self-heal-algorithm: full
> performance.low-prio-threads: 32
> features.shard-block-size: 512MB
> features.shard: on
> storage.owner-gid: 36
> storage.owner-uid: 36
> cluster.server-quorum-type: server
> cluster.quorum-type: auto
> network.remote-dio: enable
> cluster.eager-lock: enable
> performance.stat-prefetch: off
> performance.io-cache: off
> performance.read-ahead: off
> performance.quick-read: off
> performance.readdir-ahead: on
> server.allow-insecure: on
>
> Volume Name: engine
> Type: Replicate
> Volume ID: 87ad86b9-d88b-457e-ba21-5d3173c612de
> Status: Started
> Number of Bricks: 1 x (2 + 1) = 3
> Transport-type: tcp
> Bricks:
> Brick1: ovirt1.nwfiber.com:/gluster/brick1/engine
> Brick2: ovirt2.nwfiber.com:/gluster/brick1/engine
> Brick3: ovirt3.nwfiber.com:/gluster/brick1/engine (arbiter)
> Options Reconfigured:
> performance.readdir-ahead: on
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> performance.stat-prefetch: off
> cluster.eager-lock: enable
> network.remote-dio: off
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> storage.owner-uid: 36
> storage.owner-gid: 36
> features.shard: on
> features.shard-block-size: 512MB
> performance.low-prio-threads: 32
> cluster.data-self-heal-algorithm: full
> cluster.locking-scheme: granular
> cluster.shd-wait-qlength: 1
> cluster.shd-max-threads: 6
> network.ping-timeout: 30
> user.cifs: off
> nfs.disable: on
> performance.strict-o-direct: on
>
> Volume Name: export
> Type: Replicate
> Volume ID: 04ee58c7-2ba1-454f-be99-26ac75a352b4
> Status: Stopped
> Number of Bricks: 1 x (2 + 1) = 3
> Transport-type: tcp
> Bricks:
> Brick1: ovirt1.nwfiber.com:/gluster/brick3/export
> Brick2: ovirt2.nwfiber.com:/gluster/brick3/export
> Brick3: ovirt3.nwfiber.com:/gluster/brick3/export (arbiter)
> Options Reconfigured:
> performance.readdir-ahead: on
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> performance.stat-prefetch: off
> cluster.eager-lock: enable
> network.remote-dio: off
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> storage.owner-uid: 36
> storage.owner-gid: 36
> features.shard: on
> features.shard-block-size: 512MB
> performance.low-prio-threads: 32
> cluster.data-self-heal-algorithm: full
> cluster.locking-scheme: granular
> cluster.shd-wait-qlength: 1
> cluster.shd-max-threads: 6
> network.ping-timeout: 30
> user.cifs: off
> nfs.disable: on
> performance.strict-o-direct: on
>
> Volume Name: iso
> Type: Replicate
> Volume ID: b1ba15f5-0f0f-4411-89d0-595179f02b92
> Status: Started
> Number of Bricks: 1 x (2 + 1) = 3
> Transport-type: tcp
> Bricks:
> Brick1: ovirt1.nwfiber.com:/gluster/brick4/iso
> Brick2: ovirt2.nwfiber.com:/gluster/brick4/iso
> Brick3: ovirt3.nwfiber.com:/gluster/brick4/iso (arbiter)
> Options Reconfigured:
> performance.readdir-ahead: on
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> performance.stat-prefetch: off
> cluster.eager-lock: enable
> network.remote-dio: off
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> storage.owner-uid: 36
> storage.owner-gid: 36
> features.shard: on
> features.shard-block-size: 512MB
> performance.low-prio-threads: 32
> cluster.data-self-heal-algorithm: full
> cluster.locking-scheme: granular
> cluster.shd-wait-qlength: 1
> cluster.shd-max-threads: 6
> network.ping-timeout: 30
> user.cifs: off
> nfs.disable: on
> performance.strict-o-direct: on
>
>
> The no

Re: [ovirt-users] Gluster and oVirt 4.0 questions

2017-04-01 Thread Jim Kusznir

 data   lvthinpool_tdata
  LV Status  available
  # open 4
  LV Size150.00 GiB
  Allocated pool data65.02%
  Allocated metadata 14.92%
  Current LE 38400
  Segments   1
  Allocation inherit
  Read ahead sectors auto
  - currently set to 256
  Block device   253:5

  --- Logical volume ---
  LV Path/dev/gluster/data
  LV Namedata
  VG Namegluster
  LV UUIDNBxLOJ-vp48-GM4I-D9ON-4OcB-hZrh-MrDacn
  LV Write Accessread/write
  LV Creation host, time ovirt1.nwfiber.com, 2016-12-31 14:40:11 -0800
  LV Pool name   lvthinpool
  LV Status  available
  # open 1
  LV Size100.00 GiB
  Mapped size90.28%
  Current LE 25600
  Segments   1
  Allocation inherit
  Read ahead sectors auto
  - currently set to 256
  Block device   253:7

  --- Logical volume ---
  LV Path/dev/gluster/export
  LV Nameexport
  VG Namegluster
  LV UUIDbih4nU-1QfI-tE12-ZLp0-fSR5-dlKt-YHkhx8
  LV Write Accessread/write
  LV Creation host, time ovirt1.nwfiber.com, 2016-12-31 14:40:20 -0800
  LV Pool name   lvthinpool
  LV Status  available
  # open 1
  LV Size25.00 GiB
  Mapped size0.12%
  Current LE 6400
  Segments   1
  Allocation inherit
  Read ahead sectors auto
  - currently set to 256
  Block device   253:8

  --- Logical volume ---
  LV Path/dev/gluster/iso
  LV Nameiso
  VG Namegluster
  LV UUIDl8l1JU-ViD3-IFiZ-TucN-tGPE-Toqc-Q3R6uX
  LV Write Accessread/write
  LV Creation host, time ovirt1.nwfiber.com, 2016-12-31 14:40:29 -0800
  LV Pool name   lvthinpool
  LV Status  available
  # open 1
  LV Size25.00 GiB
  Mapped size28.86%
  Current LE 6400
  Segments   1
  Allocation inherit
  Read ahead sectors auto
  - currently set to 256
  Block device   253:9

  --- Logical volume ---
  LV Path/dev/centos_ovirt/swap
  LV Nameswap
  VG Namecentos_ovirt
  LV UUIDPcVQ11-hQ9U-9KZT-QPuM-HwT6-8o49-2hzNkQ
  LV Write Accessread/write
  LV Creation host, time localhost, 2016-12-31 13:56:36 -0800
  LV Status  available
  # open 2
  LV Size16.00 GiB
  Current LE 4096
  Segments   1
  Allocation inherit
  Read ahead sectors auto
  - currently set to 256
  Block device   253:1

  --- Logical volume ---
  LV Path/dev/centos_ovirt/root
  LV Nameroot
  VG Namecentos_ovirt
  LV UUIDg2h2fn-sF0r-Peos-hAE1-WEo9-WENO-MlO3ly
  LV Write Accessread/write
  LV Creation host, time localhost, 2016-12-31 13:56:36 -0800
  LV Status  available
  # open 1
  LV Size20.00 GiB
  Current LE 5120
  Segments   1
  Allocation inherit
  Read ahead sectors auto
  - currently set to 256
  Block device   253:0



I don't use the export gluster volume, and I've never used lvthinpool-type
allocations before, so I'm not sure if there's anything special there.

I followed the setup instructions from an ovirt contributed documentation
that I can't find now that talked about how to install ovirt with gluster
on a 3-node cluster.

Thank you for your assistance!
--Jim

On Thu, Mar 30, 2017 at 1:27 AM, Sahina Bose <sab...@redhat.com> wrote:

>
>
> On Thu, Mar 30, 2017 at 1:23 PM, Liron Aravot <lara...@redhat.com> wrote:
>
>> Hi Jim, please see inline
>>
>> On Thu, Mar 30, 2017 at 4:08 AM, Jim Kusznir <j...@palousetech.com> wrote:
>>
>>> hello:
>>>
>>> I've been running my ovirt Version 4.0.5.5-1.el7.centos cluster for a
>>> while now, and am now revisiting some aspects of it for ensuring that I
>>> have good reliability.
>>>
>>> My cluster is a 3 node cluster, with gluster nodes running on each
>>> node.  After running my cluster a bit, I'm realizing I didn't do a very
>>> optimal job of allocating the space on my disk to the different gluster
>>> mount points.  Fortunately, they were created with LVM, so I'm hoping that
>>> I can resize them without much trouble.
>>>
>>> I have a domain for iso, domain for export, and domain for storage, all
>>> thin provisioned; then a domain for the engine, not thin provisioned.  I

[ovirt-users] ovirt-hosted-engine state transition messages

2017-03-29 Thread Jim Kusznir

Hello:

I find that I often get random-seeming messages.  A lot of them mention
"ReintializeFSM", but I also get engine down, engine start, etc.
 messages.  All the time, nothing appears to be happening on the cluster,
and I rarely can find anything wrong or any trigger/cause.  Is this
normal?  What causes this (beyond obvious hardware issues / hosts
rebooting)?  Most of the time when I get these, my cluster is going along
smoothly, and nothing (not even administrative access) is interrupted.

Could ISP issues cause these messages to be generated?

Thanks!
--Jim
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

[ovirt-users] Gluster and oVirt 4.0 questions

2017-03-29 Thread Jim Kusznir

hello:

I've been running my ovirt Version 4.0.5.5-1.el7.centos cluster for a while
now, and am now revisiting some aspects of it for ensuring that I have good
reliability.

My cluster is a 3 node cluster, with gluster nodes running on each node.
After running my cluster a bit, I'm realizing I didn't do a very optimal
job of allocating the space on my disk to the different gluster mount
points.  Fortunately, they were created with LVM, so I'm hoping that I can
resize them without much trouble.

I have a domain for iso, domain for export, and domain for storage, all
thin provisioned; then a domain for the engine, not thin provisioned.  I'd
like to expand the storage domain, and possibly shrink the engine domain
and make that space also available to the main storage domain.  Is it as
simple as expanding the LVM partition, or are there more steps involved?
Do I need to take the node offline?

second, I've noticed that the first two nodes seem to have a full copy of
the data (the disks are in use), but the 3rd node appears to not be using
any of its storage space...It is participating in the gluster cluster,
though.

Third, currently gluster shares the same network as the VM networks.  I'd
like to put it on its own network.  I'm not sure how to do this, as when I
tried to do it at install time, I never got the cluster to come online; I
had to make them share the same network to make that work.


Ovirt questions:
I've noticed that recently, I don't appear to be getting software updates
anymore.  I used to get update available notifications on my nodes every
few days; I haven't seen one for a couple weeks now.  is something wrong?

I have a windows 10 x64 VM.  I get a warning that my VM type does not match
the installed OS.  All works fine, but I've quadrouple-checked that it does
match.  Is this a known bug?

I have a UPS that all three nodes and the networking are on.  It is a USB
UPS.  How should I best integrate monitoring in?  I could put a raspberry
pi up and then run NUT or similar on it, but is there a "better" way with
oVirt?

Thanks!
--Jim
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

[ovirt-users] Hosted Engine migration problems

2017-02-10 Thread Jim Kusznir

Hi again:

I thought I had fixed the hosted engine migration that was preventing me
from updating the host the engine was running on.  Today it let me migrate
it from ovirt1 to ovirt2, and perform needed updates on ovirt1.  When I
tried to migrate it back to ovirt1 after the updates, I got errors that it
failed migration.  I tried an auto-migrate, and it claimed that the other
two nodes (including the node it was running on) do not meet minimum
requrements, specifically that they are not HA nodesBut I did
explicitly set them up as HA nodes.

Here's the engine.log output from the command:

2017-02-11 06:12:03,078 INFO
 [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (default task-41)
[252e1f97] Candidate host 'engine1'
('1e182fb9-8057-42ed-abd6-bc5bc343ccc6') was filtered out by
'VAR__FILTERTYPE__INTERNAL' filter 'HA' (correlation id: null)
2017-02-11 06:12:03,078 INFO
 [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (default task-41)
[252e1f97] Candidate host 'engine3'
('bac8ace2-cf7e-48ea-9113-b82343cd87f7') was filtered out by
'VAR__FILTERTYPE__INTERNAL' filter 'HA' (correlation id: null)
2017-02-11 06:12:03,081 INFO
 [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (default task-41)
[252e1f97] Candidate host 'engine2'
('76c075fc-1dfb-479d-98ef-57575ec11787') was filtered out by
'VAR__FILTERTYPE__INTERNAL' filter 'Migration' (correlation id: null)
2017-02-11 06:12:03,081 WARN  [org.ovirt.engine.core.bll.MigrateVmCommand]
(default task-41) [252e1f97] Validation of action 'MigrateVm' failed for
user admin@internal-authz. Reasons:
VAR__ACTION__MIGRATE,VAR__TYPE__VM,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName
engine1,$filterName
HA,VAR__DETAIL__NOT_HE_HOST,SCHEDULING_HOST_FILTERED_REASON_WITH_DETAIL,VAR__FILTERTYPE__INTERNAL,$hostName
engine3,$filterName
HA,VAR__DETAIL__NOT_HE_HOST,SCHEDULING_HOST_FILTERED_REASON_WITH_DETAIL,VAR__FILTERTYPE__INTERNAL,$hostName
engine2,$filterName
Migration,VAR__DETAIL__SAME_HOST,SCHEDULING_HOST_FILTERED_REASON_WITH_DETAIL

I'm a bit confused by thisI followed the ovirt+gluster howto referenced
from the contributed documentation page.

--Jim
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

[ovirt-users] oVirt maintence "best practices"

2017-02-10 Thread Jim Kusznir

hello:

Now that I've had my ovirt cluster running for about a month and a half, I
am realizing I don't necessarily know the best practices for keeping it up.

I've been seeing the notices in the ovirt hosts screen showing that there
are updates waiting for the hosts, and I'll put them in maintence mode one
at a time and apply the updates.

What about the engine itself?  Is it recommended/safe to log into the
engine and just run "yum update"?  Is there other procedures I should be
doing?

My oVirt cluster is a 3 node cluster with gluster, running on each node.

Thanks!
--Jim
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

Re: [ovirt-users] Optimizations for VoIP VM

2017-02-10 Thread Jim Kusznir

Sorry for the delayed response, I finally found where gmail hid this
response... :(

So the application is FusionPBX, a FreeSwitch-based VoIP system, running on
a very unloaded (1% cpu load, 2-4 VMs running) system.  I've been
experiencing intermittent call breakup, for which external support
immediately blamed on the virtualization solution claiming that "You can't
virtualize VoIP systems without causing voice breakup and other call
quality issues".  Previously, I had attempted to run FreePBX
(asterisk-based) on a Hyper-V system, and I did find that to be the case;
moving over to very weak, but dedicated hardware, fixed the problem
immediately.

Since I sent this message, I did extensive testing with my system, and it
appears that the breakup is in fact network related.  I've been able to do
phone to phone calls on the local network for extended durations without
issue, and even have phone to phone calls on external networks without
issue.  However, calls going to my VoIP provider do break up, so it appears
to be the network route to my provider.

So, oVirt does not appear to be to blame (which I didn't think so, but was
hoping for some "expert information" to support this...It appears that I
got that and more with my tests).

Thank you again for your work on such a great product!

--Jim

On Wed, Jan 4, 2017 at 10:08 AM, Chris Adams  wrote:

> Once upon a time, Yaniv Dary  said:
> > Can you please describe the application network requirements?
> > Does it relay on low latency? Pass-through or SR-IOV could help with
> > reducing that.
>
> For VoIP, latency can be an issue, but the amount of latency from adding
> VM networking overhead isn't a big deal (because other network latency
> will have a larger impact).  10ms isn't really a problem for VoIP for
> example.
>
> The bigger network concern for VoIP is jitter; for that, the only
> solution is to not over-provision hardware CPUs or total network
> bandwidth.
>
> --
> Chris Adams 
> ___
> Users mailing list
> Users@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

Re: [ovirt-users] Guest agent for CentOS

2017-01-08 Thread Jim Kusznir

Also, on the debian instructions, I found one error:

Under "starting the service", the 2nd line (after su - ) states:
service ovirt-guest-agent enable &&  service ovirt-guest-agent start

However, on debian systems, the first part won't work.  The working version
would read:
update-rc.d ovirt-guest-agent enable &&  service ovirt-guest-agent start

--Jim

On Sun, Jan 8, 2017 at 9:14 PM, Jim Kusznir <j...@palousetech.com> wrote:

> Hello:
>
> I'm wanting to install the guest agent in one of my CentOS VMs.  I looked
> on the documentation, and found this page:
>
> http://www.ovirt.org/documentation/internal/guest-
> agent/understanding-guest-agents-and-other-tools/
>
> It mentions that guest agent is available for CentOS, but does not provide
> a link to instructions.  Due to similarities with Fedora, I followed the
> link there.  It says to install "ovirt-guest-agent-common", but the package
> isn't found.  I tried adding the ovirt SIG repos for my CentOS (7), and
> re-searched, but still it is not found.  No links were provided to the RPM.
>
> Where do I find the CentOS guest agent?
>
> Perhaps the documentation online should be updated with this information,
> too...I'm sure I'm not the only one looking...
>
> --Jim
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

[ovirt-users] Guest agent for CentOS

2017-01-08 Thread Jim Kusznir

Hello:

I'm wanting to install the guest agent in one of my CentOS VMs.  I looked
on the documentation, and found this page:

http://www.ovirt.org/documentation/internal/guest-agent/understanding-guest-agents-and-other-tools/

It mentions that guest agent is available for CentOS, but does not provide
a link to instructions.  Due to similarities with Fedora, I followed the
link there.  It says to install "ovirt-guest-agent-common", but the package
isn't found.  I tried adding the ovirt SIG repos for my CentOS (7), and
re-searched, but still it is not found.  No links were provided to the RPM.

Where do I find the CentOS guest agent?

Perhaps the documentation online should be updated with this information,
too...I'm sure I'm not the only one looking...

--Jim
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

Re: [ovirt-users] unable to start VMs after upgrade

2017-01-08 Thread Jim Kusznir

Well, it turned out it was 100% of one core, percentage reported took into
account how many cores the VM had assigned.  Rebooting the node did fix the
problem.

Just to be clear, the "proper" procedure for rebooting a host in oVirt is
to put it in maintence mode, ssh to the node, issue the reboot, then after
confirming its back up, right click on the node in the web UI and select
"confirm node reboot", then take it out of maintence mode?

--Jim

On Sun, Jan 8, 2017 at 9:10 AM, Robert Story  wrote:

> On Sat, 7 Jan 2017 15:02:10 -0800 Jim wrote:
> JK> I went on about the work I came in to do, and tried to start up a VM.
> It
> JK> appeared to start, but it never booted.  It did  raise the CPU usage
> for
> JK> that VM, but console was all black, no resize or anything.  Tried
> several
> JK> settings.  This was on a VM I had just powered down.  I noticed it was
> JK> starting the VM on engine3, so I did a runonce specifying the vm start
> on
> JK> engine2.  Booted up just fine.  After booting, I could migrate to
> engine3,
> JK> and all was good.
> JK>
> JK> What happened?  I get no error messages, starting any vm on engine3,
> start
> JK> paused, attaching display, then running it, I always get the same
> thing:
> JK> blank console, about 50% cpu usage reported by the web interface, no
> JK> response on any network, and by all signs available to me, no actual
> JK> booting (reminds me of a PC that doesn't POST).  Simply changing the
> engine
> JK> it starts on to one that has not been upgraded fixes the problem.
>
> I had this issue too, except I had 100% cpu usage reported on the web
> interface. have you rebooted the troublesome host since it was upgraded? I
> think that was what solved it for me.
>
>
> Robert
>
> --
> Senior Software Engineer @ Parsons
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

[ovirt-users] ReinitiaalizeFSM-EngineDown -- what does this mean?

2017-01-07 Thread Jim Kusznir

Hello:

I've been getting a bunch of e-mails from my ovirt system stating that a
"state transition" has occurred, first: StartState-ReinitalizeFSM, then a
2nd e-mail ReinitailzeFSM-EngineDown.

These are all for my host2 system, my hosted engine is running on host1.
Host2 appears to be working just fine, and has the majority of my VMs on it
at the moment.

Timing is also a bit wierd:  Got my first one at 12:05AM this morning, then
2:40, 2:55am, 4:20, 4:25, and 4:40am, then 7:11, 7:40, 9:51am and 12:26PM.

I'd appreciate any insight!

--Jim
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

[ovirt-users] unable to start VMs after upgrade

2017-01-07 Thread Jim Kusznir

Hello:

I'm still fairly new to ovirt.  I'm running a 3-node cluster largely built
by Jason Brooks' howto for ovirt+gluster on the contributed docs section of
the ovirt webpage.

I had everything mostly working, and this morning when I logged in, I saw a
new symbol attached to all three of my hosts indicating an upgrade is
available.  So I clicked on egine3 and told it to upgrade.  It migrated my
VMs off, did its upgrade, and everything looked good.  I was able to
migrate a vm or two back, and they continued to function just fine.

Then I tried to upgrade eingine1, which was running my hosted engine.  In
theory, all three engines/hosts were set up to be able to run the engine,
per Jason's instructions.  However, it failed to migrate the engine off
host1, and I realized that I still have the same issue I had on an earlier
incarnation of this cluster: inability to migrate the engine around.  Ok,
I'll deal with that later (with help from this list, hopefully).

I went on about the work I came in to do, and tried to start up a VM.  It
appeared to start, but it never booted.  It did  raise the CPU usage for
that VM, but console was all black, no resize or anything.  Tried several
settings.  This was on a VM I had just powered down.  I noticed it was
starting the VM on engine3, so I did a runonce specifying the vm start on
engine2.  Booted up just fine.  After booting, I could migrate to engine3,
and all was good.

What happened?  I get no error messages, starting any vm on engine3, start
paused, attaching display, then running it, I always get the same thing:
blank console, about 50% cpu usage reported by the web interface, no
response on any network, and by all signs available to me, no actual
booting (reminds me of a PC that doesn't POST).  Simply changing the engine
it starts on to one that has not been upgraded fixes the problem.

I'd greatly appreciate your help:

1) how to fix it so the upgraded engine can start VMs again
2) How to fix the cluster so the HostedEngine can migrate between hosts
(and I'm able to put host1 in maintence mode).

Ovirt 4 series, latest in repos as of last weekend (Jan1).

--Jim
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

[ovirt-users] Optimizations for VoIP VM

2017-01-03 Thread Jim Kusznir

Hello:

I set up a FreeSwitch-based VoIP server as a host on my cluster, and am
having audio problems.  I'm not 100% sure if its virtualization related or
network related yet.  But I would like to optimize my VM for VoIP (or
rather, tell Ovirt all the "right settings" to optimize that VM to VoIP).
Does anyone have any specific suggestions?

Are there known issues with VoIP on Ovirt-managed clusters?  (I know well
reputed companies that sell VoIP server virtual hosting and guarantee the
performance, so I know VoIP Virtualization is possible, just need to know
if its recommended with Ovirt, and if so what do I need to do to give it
the best chance of success?)

Thanks!
--Jim
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

Re: [ovirt-users] New install: can't install engine

2017-01-02 Thread Jim Kusznir

I did eventually figure out the issue: I mis-understood the question about
"cloud-init" as "engine-init", once I answered yes to cloud-init, I was
allowed to set a root password and run engine-setup.

I am curious about some of your questions too, actually.

I was following the instructions from here:
http://www.ovirt.org/blog/2016/08/up-and-running-with-ovirt-4-0-and-gluster-storage/

In that, he said that in order to enable the gluster support in the engine,
one has to do it manually.  It seemed odd, or perhaps a misunderstanding of
the procedure.

My understanding was that after deploying my three hosts and gluster, I had
to deploy the hosted-appliance manually on host 1 so that after the initial
deploy, but before it was taken over by HA, I could go in and set the
gluster service to on, then finish the setup with the reboot, etc.

The other thing I was wondering about was setting up the additional hosts.
The instructions say to ssh to each host and deploy the engine there
through the ssh command line.  The tool itself says that I shouldn't be
doing it, I should be running through the website.  However, when I did
that, I had a number of issues that I wasn't able to correct, and when I
did it through ssh, I had the best functioning ovirt build yet.

One of the issues I had was when I tried to create a private network for
gluster sync'ing, the web interface saw the gluster IPs for hosts 2 and 3
(although host 1 was added with its ovirt/real management IP, and the two
networks were not routed).  The web itnerface host add ended up adding
hosts 2 and 3 with the gluster IP, and thus things broke a lot.  I had no
means of overriding those settings that I saw through the web interface.
When adding through SSH, I had a lot more control over adding the hosts,
and was able to add them "correctly".

I suspect this means the overall procedure was in error, and I'd like to
learn the "better" way to do it.

--Jim

On Mon, Jan 2, 2017 at 3:38 AM, Simone Tiraboschi <stira...@redhat.com>
wrote:

>
>
> On Mon, Jan 2, 2017 at 9:39 AM, Sandro Bonazzola <sbona...@redhat.com>
> wrote:
>
>>
>>
>> On Fri, Dec 30, 2016 at 7:31 PM, Jim Kusznir <j...@palousetech.com> wrote:
>>
>>> Hi all:
>>>
>>> I'm trying to set up a new ovirt cluster.  I got it "mostly working"
>>> earlier, but wanted to change some physical networking stuff, and so I
>>> thought I'd blow away my machines and rebuild.  I followed the same recipe
>>> to build it all, but now I'm failing at a point that previously worked.
>>>
>>> I've built a 3 node cluster with glusterfs backing (3 brick replica),
>>> and all that is good and well.  I run the engine-setup --deploy, and it
>>> does its stuff, asks me (among other things) the admin password, I type in
>>> the password I want it to use (just like last time), then it says to log
>>> into the new VM and run engine-setup.  Here's the problem: I try to ssh in
>>> as root, and it will NOT accept my password.  It worked a couple days ago,
>>> doing it the exact same way, but it will not work now.
>>>
>>
> Hi Jim,
> sorry but why do you need to manually run engine-setup in the engine VM?
> If you are deploying with the ovirt-engine-appliance, hosted-engine-setup
> will run it for you with the right parameters.
>
>
>> I've destroyed and re-deployed several times, I've even done a low level
>>> wipe of all three nodes and rebuild everything, and again, it doesn't work.
>>>
>>> My only guess is that one of the packages the gdeploy script changed,
>>> and it has a bug or "new feature" that breaks this for some reason.
>>> Unfortunately, I do not have the package versions that worked or the
>>> current list to compare to, so I cannot support this.
>>>
>>> In any case, I'm completely stuck here...I can't log in to run
>>> engine-deploy, and I don't know enough of the console/low level stuff to
>>> try and hack my way into the VM (eg, to manually mount the disk image and
>>> replace the password or put my SSH key in).
>>>
>>> Suggestions?  Can anyone else replicate this?
>>>
>>
>> Can you please provide logs?
>>
>>
>>
>>>
>>> --Jim
>>>
>>> ___
>>> Users mailing list
>>> Users@ovirt.org
>>> http://lists.ovirt.org/mailman/listinfo/users
>>>
>>>
>>
>>
>> --
>> Sandro Bonazzola
>> Better technology. Faster innovation. Powered by community collaboration.
>> See how it works at redhat.com
>>
>> ___
>> Users mailing list
>> Users@ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/users
>>
>>
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

Re: [ovirt-users] creating a vlan-tagged network

2017-01-02 Thread Jim Kusznir

Actually, I finally was able to identify the issue and fix it...Turns out
(as you probably expected), it wasn't ovirt...

My upstream provider had some wierd security left over, it limited the MAC
addresses permitted to exit the building, and my ovirt host made the list
somehow while my VMs did not.

I now have two VMs on two different nodes that are online!

Thank you for your help!

--Jim

On Sun, Jan 1, 2017 at 11:57 PM, Edward Haas <eh...@redhat.com> wrote:

>
>
> On Sun, Jan 1, 2017 at 7:16 PM, Jim Kusznir <j...@palousetech.com> wrote:
>
>> I pinged both the router on the subnet and a host IP in-between the two
>> ip's.
>>
>> [root@ovirt3 ~]# ping -I 162.248.147.33 162.248.147.1
>> PING 162.248.147.1 (162.248.147.1) from 162.248.147.33 : 56(84) bytes of
>> data.
>> 64 bytes from 162.248.147.1: icmp_seq=1 ttl=255 time=8.17 ms
>> 64 bytes from 162.248.147.1: icmp_seq=2 ttl=255 time=7.47 ms
>> 64 bytes from 162.248.147.1: icmp_seq=3 ttl=255 time=7.53 ms
>> 64 bytes from 162.248.147.1: icmp_seq=4 ttl=255 time=8.42 ms
>> ^C
>> --- 162.248.147.1 ping statistics ---
>> 4 packets transmitted, 4 received, 0% packet loss, time 3004ms
>> rtt min/avg/max/mdev = 7.475/7.901/8.424/0.420 ms
>> [root@ovirt3 ~]#
>>
>> The VM only has its public IP.
>>
>> --Jim
>>
>
> Very strange, all looks good to me.
>
> I can try to help you debug using tcpdump, just send me the details for
> remote connection on private.
> It will also help if you join the vdsm or ovir IRC channels.
>
>
>>
>> On Jan 1, 2017 01:26, "Edward Haas" <eh...@redhat.com> wrote:
>>
>>>
>>>
>>> On Sun, Jan 1, 2017 at 10:50 AM, Jim Kusznir <j...@palousetech.com>
>>> wrote:
>>>
>>>> I currently only have two IPs assigned to me...I can try and take
>>>> another, but that may not route out of the rack.  I've got the VM on one of
>>>> the IPs and the host on the other currently.
>>>>
>>>> The switch is a "web-managed" basic 8-port switch (thrown in for
>>>> testing while the "real" switch is in transit).  It has the 3 ports the
>>>> hosts are plugged in configured with vlan 1 untagged, set as PVID, and vlan
>>>> 2 tagged.  Another port on the switch is untagged on vlan 1 connected to
>>>> the router for the ovirtmgmt network (protected by a VPN, but not "burning"
>>>> public IPs for mgmt purposes), another couple ports are untagged on vlan
>>>> 2.  One of those ports goes out of the rack, another goes to the router's
>>>> internet port.  Router gets to the internet just fine.
>>>>
>>>> VM:
>>>> kusznir@FusionPBX:~$ ip address
>>>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
>>>> group default
>>>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>> inet 127.0.0.1/8 scope host lo
>>>>valid_lft forever preferred_lft forever
>>>> inet6 ::1/128 scope host
>>>>valid_lft forever preferred_lft forever
>>>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
>>>> state UP group default qlen 1000
>>>> link/ether 00:1a:4a:16:01:51 brd ff:ff:ff:ff:ff:ff
>>>> inet 162.248.147.31/24 brd 162.248.147.255 scope global eth0
>>>>valid_lft forever preferred_lft forever
>>>> inet6 fe80::21a:4aff:fe16:151/64 scope link
>>>>valid_lft forever preferred_lft forever
>>>> kusznir@FusionPBX:~$ ip route
>>>> default via 162.248.147.1 dev eth0
>>>> 162.248.147.0/24 dev eth0  proto kernel  scope link  src
>>>> 162.248.147.31
>>>> kusznir@FusionPBX:~$
>>>>
>>>> Host:
>>>> [root@ovirt3 ~]# ip address
>>>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
>>>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>>>> inet 127.0.0.1/8 scope host lo
>>>>valid_lft forever preferred_lft forever
>>>> inet6 ::1/128 scope host
>>>>valid_lft forever preferred_lft forever
>>>> 2: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master
>>>> ovirtmgmt state UP qlen 1000
>>>> link/ether 00:21:9b:98:2f:44 brd ff:ff:ff:ff:ff:ff
>>>> 3: em2: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN qlen 1000
>>>> link/ether 00:21:9b:98:2f:46 brd ff:ff:ff:ff:ff:ff
>>>

Re: [ovirt-users] creating a vlan-tagged network

2017-01-01 Thread Jim Kusznir

I pinged both the router on the subnet and a host IP in-between the two
ip's.

[root@ovirt3 ~]# ping -I 162.248.147.33 162.248.147.1
PING 162.248.147.1 (162.248.147.1) from 162.248.147.33 : 56(84) bytes of
data.
64 bytes from 162.248.147.1: icmp_seq=1 ttl=255 time=8.17 ms
64 bytes from 162.248.147.1: icmp_seq=2 ttl=255 time=7.47 ms
64 bytes from 162.248.147.1: icmp_seq=3 ttl=255 time=7.53 ms
64 bytes from 162.248.147.1: icmp_seq=4 ttl=255 time=8.42 ms
^C
--- 162.248.147.1 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 7.475/7.901/8.424/0.420 ms
[root@ovirt3 ~]#

The VM only has its public IP.

--Jim

On Jan 1, 2017 01:26, "Edward Haas" <eh...@redhat.com> wrote:

>
>
> On Sun, Jan 1, 2017 at 10:50 AM, Jim Kusznir <j...@palousetech.com> wrote:
>
>> I currently only have two IPs assigned to me...I can try and take
>> another, but that may not route out of the rack.  I've got the VM on one of
>> the IPs and the host on the other currently.
>>
>> The switch is a "web-managed" basic 8-port switch (thrown in for testing
>> while the "real" switch is in transit).  It has the 3 ports the hosts are
>> plugged in configured with vlan 1 untagged, set as PVID, and vlan 2
>> tagged.  Another port on the switch is untagged on vlan 1 connected to the
>> router for the ovirtmgmt network (protected by a VPN, but not "burning"
>> public IPs for mgmt purposes), another couple ports are untagged on vlan
>> 2.  One of those ports goes out of the rack, another goes to the router's
>> internet port.  Router gets to the internet just fine.
>>
>> VM:
>> kusznir@FusionPBX:~$ ip address
>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group
>> default
>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>> inet 127.0.0.1/8 scope host lo
>>valid_lft forever preferred_lft forever
>> inet6 ::1/128 scope host
>>valid_lft forever preferred_lft forever
>> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
>> state UP group default qlen 1000
>> link/ether 00:1a:4a:16:01:51 brd ff:ff:ff:ff:ff:ff
>> inet 162.248.147.31/24 brd 162.248.147.255 scope global eth0
>>valid_lft forever preferred_lft forever
>> inet6 fe80::21a:4aff:fe16:151/64 scope link
>>valid_lft forever preferred_lft forever
>> kusznir@FusionPBX:~$ ip route
>> default via 162.248.147.1 dev eth0
>> 162.248.147.0/24 dev eth0  proto kernel  scope link  src 162.248.147.31
>> kusznir@FusionPBX:~$
>>
>> Host:
>> [root@ovirt3 ~]# ip address
>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
>> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>> inet 127.0.0.1/8 scope host lo
>>valid_lft forever preferred_lft forever
>> inet6 ::1/128 scope host
>>valid_lft forever preferred_lft forever
>> 2: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master
>> ovirtmgmt state UP qlen 1000
>> link/ether 00:21:9b:98:2f:44 brd ff:ff:ff:ff:ff:ff
>> 3: em2: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN qlen 1000
>> link/ether 00:21:9b:98:2f:46 brd ff:ff:ff:ff:ff:ff
>> 4: em3: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN qlen 1000
>> link/ether 00:21:9b:98:2f:48 brd ff:ff:ff:ff:ff:ff
>> 5: em4: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN
>> qlen 1000
>> link/ether 00:21:9b:98:2f:4a brd ff:ff:ff:ff:ff:ff
>> 6: ;vdsmdummy;: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
>> link/ether 8e:1b:51:60:87:55 brd ff:ff:ff:ff:ff:ff
>> 7: ovirtmgmt: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
>> state UP
>> link/ether 00:21:9b:98:2f:44 brd ff:ff:ff:ff:ff:ff
>> inet 192.168.8.13/24 brd 192.168.8.255 scope global dynamic ovirtmgmt
>>valid_lft 54830sec preferred_lft 54830sec
>> 11: em1.2@em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue
>> master Public_Cable state UP
>> link/ether 00:21:9b:98:2f:44 brd ff:ff:ff:ff:ff:ff
>> 12: Public_Cable: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
>> noqueue state UP
>> link/ether 00:21:9b:98:2f:44 brd ff:ff:ff:ff:ff:ff
>> inet 162.248.147.33/24 brd 162.248.147.255 scope global Public_Cable
>>valid_lft forever preferred_lft forever
>> 14: vnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast
>> master ovirtmgmt state UNKNOWN qlen 500
>> link/ether fe:1a:4a:16:01:54 b

Re: [ovirt-users] creating a vlan-tagged network

2017-01-01 Thread Jim Kusznir

I currently only have two IPs assigned to me...I can try and take another,
but that may not route out of the rack.  I've got the VM on one of the IPs
and the host on the other currently.

The switch is a "web-managed" basic 8-port switch (thrown in for testing
while the "real" switch is in transit).  It has the 3 ports the hosts are
plugged in configured with vlan 1 untagged, set as PVID, and vlan 2
tagged.  Another port on the switch is untagged on vlan 1 connected to the
router for the ovirtmgmt network (protected by a VPN, but not "burning"
public IPs for mgmt purposes), another couple ports are untagged on vlan
2.  One of those ports goes out of the rack, another goes to the router's
internet port.  Router gets to the internet just fine.

VM:
kusznir@FusionPBX:~$ ip address
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group
default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
   valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
   valid_lft forever preferred_lft forever
2: eth0:  mtu 1500 qdisc pfifo_fast state
UP group default qlen 1000
link/ether 00:1a:4a:16:01:51 brd ff:ff:ff:ff:ff:ff
inet 162.248.147.31/24 brd 162.248.147.255 scope global eth0
   valid_lft forever preferred_lft forever
inet6 fe80::21a:4aff:fe16:151/64 scope link
   valid_lft forever preferred_lft forever
kusznir@FusionPBX:~$ ip route
default via 162.248.147.1 dev eth0
162.248.147.0/24 dev eth0  proto kernel  scope link  src 162.248.147.31
kusznir@FusionPBX:~$

Host:
[root@ovirt3 ~]# ip address
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
   valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
   valid_lft forever preferred_lft forever
2: em1:  mtu 1500 qdisc mq master
ovirtmgmt state UP qlen 1000
link/ether 00:21:9b:98:2f:44 brd ff:ff:ff:ff:ff:ff
3: em2:  mtu 1500 qdisc mq state DOWN qlen 1000
link/ether 00:21:9b:98:2f:46 brd ff:ff:ff:ff:ff:ff
4: em3:  mtu 1500 qdisc mq state DOWN qlen 1000
link/ether 00:21:9b:98:2f:48 brd ff:ff:ff:ff:ff:ff
5: em4:  mtu 1500 qdisc mq state DOWN
qlen 1000
link/ether 00:21:9b:98:2f:4a brd ff:ff:ff:ff:ff:ff
6: ;vdsmdummy;:  mtu 1500 qdisc noop state DOWN
link/ether 8e:1b:51:60:87:55 brd ff:ff:ff:ff:ff:ff
7: ovirtmgmt:  mtu 1500 qdisc noqueue
state UP
link/ether 00:21:9b:98:2f:44 brd ff:ff:ff:ff:ff:ff
inet 192.168.8.13/24 brd 192.168.8.255 scope global dynamic ovirtmgmt
   valid_lft 54830sec preferred_lft 54830sec
11: em1.2@em1:  mtu 1500 qdisc noqueue
master Public_Cable state UP
link/ether 00:21:9b:98:2f:44 brd ff:ff:ff:ff:ff:ff
12: Public_Cable:  mtu 1500 qdisc noqueue
state UP
link/ether 00:21:9b:98:2f:44 brd ff:ff:ff:ff:ff:ff
inet 162.248.147.33/24 brd 162.248.147.255 scope global Public_Cable
   valid_lft forever preferred_lft forever
14: vnet0:  mtu 1500 qdisc pfifo_fast
master ovirtmgmt state UNKNOWN qlen 500
link/ether fe:1a:4a:16:01:54 brd ff:ff:ff:ff:ff:ff
inet6 fe80::fc1a:4aff:fe16:154/64 scope link
   valid_lft forever preferred_lft forever
15: vnet1:  mtu 1500 qdisc pfifo_fast
master ovirtmgmt state UNKNOWN qlen 500
link/ether fe:1a:4a:16:01:52 brd ff:ff:ff:ff:ff:ff
inet6 fe80::fc1a:4aff:fe16:152/64 scope link
   valid_lft forever preferred_lft forever
16: vnet2:  mtu 1500 qdisc pfifo_fast
master ovirtmgmt state UNKNOWN qlen 500
link/ether fe:1a:4a:16:01:53 brd ff:ff:ff:ff:ff:ff
inet6 fe80::fc1a:4aff:fe16:153/64 scope link
   valid_lft forever preferred_lft forever
17: vnet3:  mtu 1500 qdisc pfifo_fast
master Public_Cable state UNKNOWN qlen 500
link/ether fe:1a:4a:16:01:51 brd ff:ff:ff:ff:ff:ff
inet6 fe80::fc1a:4aff:fe16:151/64 scope link
   valid_lft forever preferred_lft forever
[root@ovirt3 ~]# ip route
default via 192.168.8.1 dev ovirtmgmt
162.248.147.0/24 dev Public_Cable  proto kernel  scope link  src
162.248.147.33
169.254.0.0/16 dev ovirtmgmt  scope link  metric 1007
169.254.0.0/16 dev Public_Cable  scope link  metric 1012
192.168.8.0/24 dev ovirtmgmt  proto kernel  scope link  src 192.168.8.13
[root@ovirt3 ~]# brctl show
bridge name bridge id STP enabled interfaces
;vdsmdummy; 8000. no
Public_Cable 8000.00219b982f44 no em1.2
vnet3
ovirtmgmt 8000.00219b982f44 no em1
vnet0
vnet1
vnet2
[root@ovirt3 ~]#

I did see that the cluster settings has a switch type setting; currently at
the default "LEGACY", it also has "OVS" as an option.  Not

[ovirt-users] creating a vlan-tagged network

2016-12-31 Thread Jim Kusznir

Hi all:

I've got my ovirt cluster up, but am facing an odd situation that I haven't
pinned down.  I've also run into someone on the IRC channel with the same
bug, no solutions as of yet.  Google also hasn't helped.

My goal is this:

1 physical NIC; two networks:
ovirtmgmt (untagged)
Public (vlan 2)

ovirtmgmt works great.  a VM on Public cannot talk to anything off the host.

Steps to set up:

Datacenter -> networks: created network, checked vm network, checked vlan,
put 2 in the tag box.  Set required.  Save.

I only have one cluster (default), and it automatically added it there.  I
went to the hosts in the cluster, and dragged the unassigned Public network
onto the nic (which already has ovirtmgmt on it).  After completing on all
three of my hosts, the network shows online.

Create VM, assign to Public, inside VM assign its IP, and it cannot talk to
the world.

In troubleshooting, I assigned another IP to the host itself (click pencil
in host network settings).  VM can ping host.  SSH into host, host CAN ping
other machines on the net and the router for the net.  VM cannot ping
anything but host (only have one VM on that host currently).  VM is
isolated until I move it to ovirtmgmt network, then it can get off the host
to the world, etc.

I tried disabling iptables just in case, but that had no effect.

How do I troubleshoot this further?

--Jim
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

[ovirt-users] New install: can't install engine

2016-12-30 Thread Jim Kusznir

Hi all:

I'm trying to set up a new ovirt cluster.  I got it "mostly working"
earlier, but wanted to change some physical networking stuff, and so I
thought I'd blow away my machines and rebuild.  I followed the same recipe
to build it all, but now I'm failing at a point that previously worked.

I've built a 3 node cluster with glusterfs backing (3 brick replica), and
all that is good and well.  I run the engine-setup --deploy, and it does
its stuff, asks me (among other things) the admin password, I type in the
password I want it to use (just like last time), then it says to log into
the new VM and run engine-setup.  Here's the problem: I try to ssh in as
root, and it will NOT accept my password.  It worked a couple days ago,
doing it the exact same way, but it will not work now.

I've destroyed and re-deployed several times, I've even done a low level
wipe of all three nodes and rebuild everything, and again, it doesn't work.

My only guess is that one of the packages the gdeploy script changed, and
it has a bug or "new feature" that breaks this for some reason.
Unfortunately, I do not have the package versions that worked or the
current list to compare to, so I cannot support this.

In any case, I'm completely stuck here...I can't log in to run
engine-deploy, and I don't know enough of the console/low level stuff to
try and hack my way into the VM (eg, to manually mount the disk image and
replace the password or put my SSH key in).

Suggestions?  Can anyone else replicate this?

--Jim
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

[ovirt-users] New oVirt user

2016-12-28 Thread Jim Kusznir

Hello:

I've been involved in virtualization from its very early days, and been
running linux virtualization solutions off and on for a decade.
Previously, I was always frustrated with the long feature list offered by
many linux virtualization systems but with no reasonable way to manage
that.  It seemed that I had to spend an inordinate amount of time doing
everything by hand.  Thus, when I found oVirt, I was ecstatic!
Unfortunately, at that time I changed employment (or rather left employment
and became self-employed), and didn't have any reason to build my own virt
cluster..until now!

So I'm back with oVirt, and actually deploying a small 3-node cluster.  I
intend to run on it:
VoIP Server
Web Server
Business backend server
UniFi management server
Monitoring server (zabbix)

Not a heavy load, and 3 servers is probably overkill, but I need this to
work, and it sounds like 3 is the magic entry level for all the
cluster/failover stuff to work.  For now, my intent is to use a single SSD
on each node with gluster for the storage backend.  I figure if all the
failover stuff actually working, if I loose a node due to disk failure, its
not the end of the world.  I can rebuild it, reconnect gluster, and restart
everything.  As this is for a startup business, funds are thin at the
moment, so I'm trying to cut a couple corners that don't affect overall
reliability.  If this side of the business grows more, I would likely
invest in some dedicated servers.

So far, I've based my efforts around this guide on oVirt's website:
http://www.ovirt.org/blog/2016/08/up-and-running-with-ovirt-4-0-and-gluster-storage/

My cluster is currently functioning, but not entirely correctly.  Some of
it is gut feel, some of it is specific test cases (more to follow).  First,
some areas that lacked clarity and the choices I made in them:

Early on, Jason talks about using a dedicated gluster network for the
gluster storage sync'ing.  I liked that idea, and as I had 4 nics on each
machine, I thought dedicating one or two to gluster would be fine.  So, on
my clean, bare machines, I setup another network with private NiCs and put
it on a standalone switch.  I added hostnames with a designator (-g on the
end) for the IPs for all three nodes into /etc/hosts on all three nodes so
now each node can resolve itself and the other nodes on the -g name (and
private IP) as well as their main host name and "more public" (but not
public) IP.

Then, for gdeploy, I put the hostnames in as the -g hostnames, as I didn't
see anywhere to tell gluster to use the private network.  I think this is a
place I went wrong, but didn't realize it until the end

I set up the gdeploy script (it took a few times, and a few OS rebuilds to
get it just right...), and ran it, and it was successful!  When complete, I
had a working gluster cluster and the right software installed on each node!

I set up the engine on node1, and that worked, and I was able to log in to
the web gui.  I mistakenly skipped the web gui enable gluster service
before doing the engine vm reboot to complete the engine setup process, but
I did go back in after the reboot and do that.  After doing that, I was
notified in the gui that there were additional nodes, did I want to add
them.  Initially, I skipped that and went back to the command line as Jason
suggests.  Unfortunately, it could not find any other nodes through his
method, and it didn't work.  Combine that with the warnings that I should
not be using the command line method, and it would be removed in the next
release, I went back to the gui and attempted to add the nodes that way.

Here's where things appeared to go wrong...It showed me two additional
nodes, but ONLY by their -g (private gluster) hostname.  And the ssh
fingerprints were not populated, so it would not let me proceed.  After
messing with this for a bit, I realized that the engine cannot get to the
nodes via the gluster interface (and as far as I knew, it shouldn't).
Working late at night, I let myself "hack it up" a bit, and on the engine
VM, I added /etc/hosts entries for the -g hostnames pointing to the main
IPs.  It then populated the ssh host keys and let me add them in.  Ok, so
things appear to be working..kinda.  I noticed at this point that ALL
aspects of the gui became VERY slow.  Clicking in and typing in any field
felt like I was on ssh over a satellite link.  Everything felt a bit worse
than the early days of vSpherePainfully slow.  but it was still
working, so I pressed on.

I configured gluster storage.  Eventually I was successful, but initially
it would only let me add a "Data" storage domain, the drop-down menu did
NOT contain iso, export, or anything else...  Somehow, on its own, after
leaving and re-entering that tab a few times, iso and export materialized
on their own in the menu, so I was able to finish that setup.

Ok, all looks good.  I wanted to try out his little tip on adding a VM,
too.  I saw "ovirt-imiage-repository" in the "external

79 matches

Mail list logo