On one ovirt server, I'm now seeing these messages: [56474.239725] blk_update_request: 63 callbacks suppressed [56474.239732] blk_update_request: I/O error, dev dm-2, sector 0 [56474.240602] blk_update_request: I/O error, dev dm-2, sector 3905945472 [56474.241346] blk_update_request: I/O error, dev dm-2, sector 3905945584 [56474.242236] blk_update_request: I/O error, dev dm-2, sector 2048 [56474.243072] blk_update_request: I/O error, dev dm-2, sector 3905943424 [56474.243997] blk_update_request: I/O error, dev dm-2, sector 3905943536 [56474.247347] blk_update_request: I/O error, dev dm-2, sector 0 [56474.248315] blk_update_request: I/O error, dev dm-2, sector 3905945472 [56474.249231] blk_update_request: I/O error, dev dm-2, sector 3905945584 [56474.250221] blk_update_request: I/O error, dev dm-2, sector 2048
On Tue, May 29, 2018 at 11:59 AM, Jim Kusznir <[email protected]> wrote: > I see in messages on ovirt3 (my 3rd machine, the one upgraded to 4.2): > > May 29 11:54:41 ovirt3 ovs-vsctl: ovs|00001|db_ctl_base|ERR| > unix:/var/run/openvswitch/db.sock: database connection failed (No such > file or directory) > May 29 11:54:51 ovirt3 ovs-vsctl: ovs|00001|db_ctl_base|ERR| > unix:/var/run/openvswitch/db.sock: database connection failed (No such > file or directory) > May 29 11:55:01 ovirt3 ovs-vsctl: ovs|00001|db_ctl_base|ERR| > unix:/var/run/openvswitch/db.sock: database connection failed (No such > file or directory) > (appears a lot). > > I also found on the ssh session of that, some sysv warnings about the > backing disk for one of the gluster volumes (straight replica 3). The > glusterfs process for that disk on that machine went offline. Its my > understanding that it should continue to work with the other two machines > while I attempt to replace that disk, right? Attempted writes (touching an > empty file) can take 15 seconds, repeating it later will be much faster. > > Gluster generates a bunch of different log files, I don't know what ones > you want, or from which machine(s). > > How do I do "volume profiling"? > > Thanks! > > On Tue, May 29, 2018 at 11:53 AM, Sahina Bose <[email protected]> wrote: > >> Do you see errors reported in the mount logs for the volume? If so, could >> you attach the logs? >> Any issues with your underlying disks. Can you also attach output of >> volume profiling? >> >> On Wed, May 30, 2018 at 12:13 AM, Jim Kusznir <[email protected]> >> wrote: >> >>> Ok, things have gotten MUCH worse this morning. I'm getting random >>> errors from VMs, right now, about a third of my VMs have been paused due to >>> storage issues, and most of the remaining VMs are not performing well. >>> >>> At this point, I am in full EMERGENCY mode, as my production services >>> are now impacted, and I'm getting calls coming in with problems... >>> >>> I'd greatly appreciate help...VMs are running VERY slowly (when they >>> run), and they are steadily getting worse. I don't know why. I was seeing >>> CPU peaks (to 100%) on several VMs, in perfect sync, for a few minutes at a >>> time (while the VM became unresponsive and any VMs I was logged into that >>> were linux were giving me the CPU stuck messages in my origional post). Is >>> all this storage related? >>> >>> I also have two different gluster volumes for VM storage, and only one >>> had the issues, but now VMs in both are being affected at the same time and >>> same way. >>> >>> --Jim >>> >>> On Mon, May 28, 2018 at 10:50 PM, Sahina Bose <[email protected]> wrote: >>> >>>> [Adding gluster-users to look at the heal issue] >>>> >>>> On Tue, May 29, 2018 at 9:17 AM, Jim Kusznir <[email protected]> >>>> wrote: >>>> >>>>> Hello: >>>>> >>>>> I've been having some cluster and gluster performance issues lately. >>>>> I also found that my cluster was out of date, and was trying to apply >>>>> updates (hoping to fix some of these), and discovered the ovirt 4.1 repos >>>>> were taken completely offline. So, I was forced to begin an upgrade to >>>>> 4.2. According to docs I found/read, I needed only add the new repo, do a >>>>> yum update, reboot, and be good on my hosts (did the yum update, the >>>>> engine-setup on my hosted engine). Things seemed to work relatively well, >>>>> except for a gluster sync issue that showed up. >>>>> >>>>> My cluster is a 3 node hyperconverged cluster. I upgraded the hosted >>>>> engine first, then engine 3. When engine 3 came back up, for some reason >>>>> one of my gluster volumes would not sync. Here's sample output: >>>>> >>>>> [root@ovirt3 ~]# gluster volume heal data-hdd info >>>>> Brick 172.172.1.11:/gluster/brick3/data-hdd >>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4 >>>>> 725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9 >>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4 >>>>> cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971 >>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-4 >>>>> 46b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba >>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-4 >>>>> 4f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625 >>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-4 >>>>> 2a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b >>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4 >>>>> ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32 >>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4 >>>>> 810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa >>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-4 >>>>> aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2 >>>>> Status: Connected >>>>> Number of entries: 8 >>>>> >>>>> Brick 172.172.1.12:/gluster/brick3/data-hdd >>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4 >>>>> 725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9 >>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4 >>>>> cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971 >>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4 >>>>> ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32 >>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-4 >>>>> 46b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba >>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-4 >>>>> 4f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625 >>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4 >>>>> 810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa >>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-4 >>>>> aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2 >>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-4 >>>>> 2a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b >>>>> Status: Connected >>>>> Number of entries: 8 >>>>> >>>>> Brick 172.172.1.13:/gluster/brick3/data-hdd >>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4 >>>>> ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32 >>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4 >>>>> 810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa >>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-4 >>>>> aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2 >>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-4 >>>>> 2a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b >>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4 >>>>> cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971 >>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4 >>>>> 725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9 >>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-4 >>>>> 46b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba >>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-4 >>>>> 4f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625 >>>>> Status: Connected >>>>> Number of entries: 8 >>>>> >>>>> --------- >>>>> Its been in this state for a couple days now, and bandwidth monitoring >>>>> shows no appreciable data moving. I've tried repeatedly commanding a full >>>>> heal from all three clusters in the node. Its always the same files that >>>>> need healing. >>>>> >>>>> When running gluster volume heal data-hdd statistics, I see sometimes >>>>> different information, but always some number of "heal failed" entries. >>>>> It >>>>> shows 0 for split brain. >>>>> >>>>> I'm not quite sure what to do. I suspect it may be due to nodes 1 and >>>>> 2 still being on the older ovirt/gluster release, but I'm afraid to >>>>> upgrade >>>>> and reboot them until I have a good gluster sync (don't need to create a >>>>> split brain issue). How do I proceed with this? >>>>> >>>>> Second issue: I've been experiencing VERY POOR performance on most of >>>>> my VMs. To the tune that logging into a windows 10 vm via remote desktop >>>>> can take 5 minutes, launching quickbooks inside said vm can easily take 10 >>>>> minutes. On some linux VMs, I get random messages like this: >>>>> Message from syslogd@unifi at May 28 20:39:23 ... >>>>> kernel:[6171996.308904] NMI watchdog: BUG: soft lockup - CPU#0 stuck >>>>> for 22s! [mongod:14766] >>>>> >>>>> (the process and PID are often different) >>>>> >>>>> I'm not quite sure what to do about this either. My initial thought >>>>> was upgrad everything to current and see if its still there, but I cannot >>>>> move forward with that until my gluster is healed... >>>>> >>>>> Thanks! >>>>> --Jim >>>>> >>>>> _______________________________________________ >>>>> Users mailing list -- [email protected] >>>>> To unsubscribe send an email to [email protected] >>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>>> oVirt Code of Conduct: https://www.ovirt.org/communit >>>>> y/about/community-guidelines/ >>>>> List Archives: https://lists.ovirt.org/archiv >>>>> es/list/[email protected]/message/3LEV6ZQ3JV2XLAL7NYBTXOYMYUOTIRQF/ >>>>> >>>>> >>>> >>> >> >
_______________________________________________ Users mailing list -- [email protected] To unsubscribe send an email to [email protected] Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/[email protected]/message/ACO7RFSLBSRBAIONIC2HQ6Z24ZDES5MF/

