Due to the cluster spiraling downward and increasing customer complaints, I went ahead and finished the upgrade of the nodes to ovirt 4.2 and gluster 3.12. It didn't seem to help at all.
I DO have one brick down on ONE of my 4 gluster filesystems/exports/whatever. The other 3 are fully available. However, I still see heavy IO wait, including on the perfectly healthy filesystem. its bad enough that I get ovirt e-mails warning of hosts down and back up, and VMs on the good gluster filesystem are reporting IO Waits of greater than 60% in top! I have applications that are crashing due to the IO Wait issues. I do think I got glusterfs profiling running, but I don't know how to get a useful report out (its in the ovirt gui). I did see read and write operations showing about 30 seconds; I would have expected that to be MUCH better. (As I write this, my core VoIP server is now showing 99.1% IOWait load....And that is customer calls failing/dropping....). PLEASE...how do I FIX this? --JIm On Tue, May 29, 2018 at 12:14 PM, Jim Kusznir <[email protected]> wrote: > On one ovirt server, I'm now seeing these messages: > [56474.239725] blk_update_request: 63 callbacks suppressed > [56474.239732] blk_update_request: I/O error, dev dm-2, sector 0 > [56474.240602] blk_update_request: I/O error, dev dm-2, sector 3905945472 > [56474.241346] blk_update_request: I/O error, dev dm-2, sector 3905945584 > [56474.242236] blk_update_request: I/O error, dev dm-2, sector 2048 > [56474.243072] blk_update_request: I/O error, dev dm-2, sector 3905943424 > [56474.243997] blk_update_request: I/O error, dev dm-2, sector 3905943536 > [56474.247347] blk_update_request: I/O error, dev dm-2, sector 0 > [56474.248315] blk_update_request: I/O error, dev dm-2, sector 3905945472 > [56474.249231] blk_update_request: I/O error, dev dm-2, sector 3905945584 > [56474.250221] blk_update_request: I/O error, dev dm-2, sector 2048 > > > > > On Tue, May 29, 2018 at 11:59 AM, Jim Kusznir <[email protected]> wrote: > >> I see in messages on ovirt3 (my 3rd machine, the one upgraded to 4.2): >> >> May 29 11:54:41 ovirt3 ovs-vsctl: >> ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: >> database connection failed (No such file or directory) >> May 29 11:54:51 ovirt3 ovs-vsctl: >> ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: >> database connection failed (No such file or directory) >> May 29 11:55:01 ovirt3 ovs-vsctl: >> ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: >> database connection failed (No such file or directory) >> (appears a lot). >> >> I also found on the ssh session of that, some sysv warnings about the >> backing disk for one of the gluster volumes (straight replica 3). The >> glusterfs process for that disk on that machine went offline. Its my >> understanding that it should continue to work with the other two machines >> while I attempt to replace that disk, right? Attempted writes (touching an >> empty file) can take 15 seconds, repeating it later will be much faster. >> >> Gluster generates a bunch of different log files, I don't know what ones >> you want, or from which machine(s). >> >> How do I do "volume profiling"? >> >> Thanks! >> >> On Tue, May 29, 2018 at 11:53 AM, Sahina Bose <[email protected]> wrote: >> >>> Do you see errors reported in the mount logs for the volume? If so, >>> could you attach the logs? >>> Any issues with your underlying disks. Can you also attach output of >>> volume profiling? >>> >>> On Wed, May 30, 2018 at 12:13 AM, Jim Kusznir <[email protected]> >>> wrote: >>> >>>> Ok, things have gotten MUCH worse this morning. I'm getting random >>>> errors from VMs, right now, about a third of my VMs have been paused due to >>>> storage issues, and most of the remaining VMs are not performing well. >>>> >>>> At this point, I am in full EMERGENCY mode, as my production services >>>> are now impacted, and I'm getting calls coming in with problems... >>>> >>>> I'd greatly appreciate help...VMs are running VERY slowly (when they >>>> run), and they are steadily getting worse. I don't know why. I was seeing >>>> CPU peaks (to 100%) on several VMs, in perfect sync, for a few minutes at a >>>> time (while the VM became unresponsive and any VMs I was logged into that >>>> were linux were giving me the CPU stuck messages in my origional post). Is >>>> all this storage related? >>>> >>>> I also have two different gluster volumes for VM storage, and only one >>>> had the issues, but now VMs in both are being affected at the same time and >>>> same way. >>>> >>>> --Jim >>>> >>>> On Mon, May 28, 2018 at 10:50 PM, Sahina Bose <[email protected]> >>>> wrote: >>>> >>>>> [Adding gluster-users to look at the heal issue] >>>>> >>>>> On Tue, May 29, 2018 at 9:17 AM, Jim Kusznir <[email protected]> >>>>> wrote: >>>>> >>>>>> Hello: >>>>>> >>>>>> I've been having some cluster and gluster performance issues lately. >>>>>> I also found that my cluster was out of date, and was trying to apply >>>>>> updates (hoping to fix some of these), and discovered the ovirt 4.1 repos >>>>>> were taken completely offline. So, I was forced to begin an upgrade to >>>>>> 4.2. According to docs I found/read, I needed only add the new repo, do >>>>>> a >>>>>> yum update, reboot, and be good on my hosts (did the yum update, the >>>>>> engine-setup on my hosted engine). Things seemed to work relatively >>>>>> well, >>>>>> except for a gluster sync issue that showed up. >>>>>> >>>>>> My cluster is a 3 node hyperconverged cluster. I upgraded the hosted >>>>>> engine first, then engine 3. When engine 3 came back up, for some reason >>>>>> one of my gluster volumes would not sync. Here's sample output: >>>>>> >>>>>> [root@ovirt3 ~]# gluster volume heal data-hdd info >>>>>> Brick 172.172.1.11:/gluster/brick3/data-hdd >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4 >>>>>> 725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9 >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4 >>>>>> cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971 >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-4 >>>>>> 46b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-4 >>>>>> 4f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625 >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-4 >>>>>> 2a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4 >>>>>> ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32 >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4 >>>>>> 810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-4 >>>>>> aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2 >>>>>> Status: Connected >>>>>> Number of entries: 8 >>>>>> >>>>>> Brick 172.172.1.12:/gluster/brick3/data-hdd >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4 >>>>>> 725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9 >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4 >>>>>> cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971 >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4 >>>>>> ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32 >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-4 >>>>>> 46b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-4 >>>>>> 4f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625 >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4 >>>>>> 810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-4 >>>>>> aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2 >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-4 >>>>>> 2a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b >>>>>> Status: Connected >>>>>> Number of entries: 8 >>>>>> >>>>>> Brick 172.172.1.13:/gluster/brick3/data-hdd >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4 >>>>>> ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32 >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4 >>>>>> 810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-4 >>>>>> aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2 >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-4 >>>>>> 2a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4 >>>>>> cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971 >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4 >>>>>> 725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9 >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-4 >>>>>> 46b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-4 >>>>>> 4f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625 >>>>>> Status: Connected >>>>>> Number of entries: 8 >>>>>> >>>>>> --------- >>>>>> Its been in this state for a couple days now, and bandwidth >>>>>> monitoring shows no appreciable data moving. I've tried repeatedly >>>>>> commanding a full heal from all three clusters in the node. Its always >>>>>> the >>>>>> same files that need healing. >>>>>> >>>>>> When running gluster volume heal data-hdd statistics, I see sometimes >>>>>> different information, but always some number of "heal failed" entries. >>>>>> It >>>>>> shows 0 for split brain. >>>>>> >>>>>> I'm not quite sure what to do. I suspect it may be due to nodes 1 >>>>>> and 2 still being on the older ovirt/gluster release, but I'm afraid to >>>>>> upgrade and reboot them until I have a good gluster sync (don't need to >>>>>> create a split brain issue). How do I proceed with this? >>>>>> >>>>>> Second issue: I've been experiencing VERY POOR performance on most of >>>>>> my VMs. To the tune that logging into a windows 10 vm via remote desktop >>>>>> can take 5 minutes, launching quickbooks inside said vm can easily take >>>>>> 10 >>>>>> minutes. On some linux VMs, I get random messages like this: >>>>>> Message from syslogd@unifi at May 28 20:39:23 ... >>>>>> kernel:[6171996.308904] NMI watchdog: BUG: soft lockup - CPU#0 stuck >>>>>> for 22s! [mongod:14766] >>>>>> >>>>>> (the process and PID are often different) >>>>>> >>>>>> I'm not quite sure what to do about this either. My initial thought >>>>>> was upgrad everything to current and see if its still there, but I cannot >>>>>> move forward with that until my gluster is healed... >>>>>> >>>>>> Thanks! >>>>>> --Jim >>>>>> >>>>>> _______________________________________________ >>>>>> Users mailing list -- [email protected] >>>>>> To unsubscribe send an email to [email protected] >>>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>>>> oVirt Code of Conduct: https://www.ovirt.org/communit >>>>>> y/about/community-guidelines/ >>>>>> List Archives: https://lists.ovirt.org/archiv >>>>>> es/list/[email protected]/message/3LEV6ZQ3JV2XLAL7NYBTXOYMYUOTIRQF/ >>>>>> >>>>>> >>>>> >>>> >>> >> >
_______________________________________________ Users mailing list -- [email protected] To unsubscribe send an email to [email protected] Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/[email protected]/message/44KH457EQ5QQJR2WOFRU3WWNM2TWHM3R/

