[ovirt-users] Re: Gluster problems, cluster performance issues

Jim Kusznir Wed, 30 May 2018 02:23:11 -0700

At the moment, it is responding like I would expect.  I do know I have one
failed drive on one brick (hardware failure, OS removed drive completely;
the underlying /dev/sdb is gone).  I have a new disk on order (overnight),
but that is also one brick of one volume that is replica 3, so I would hope
a complete failure like that would restore the system to operational
capabilities.


Since having the gluster-volume-starting problems, I have performed a test
in the engine volume with writing and removing a file and verifying its
happening from all three hosts; that worked.  The engine volume has all of
its bricks, as well two other volumes; its only one volume that is shy one
brick.

--Jim

On Tue, May 29, 2018 at 11:41 PM, Johan Bernhardsson <[email protected]> wrote:

> Is storage working as it should?  Does the gluster mount point respond as
> it should? Can you write files to it?  Does the physical drives say that
> they are ok? Can you write (you shouldn't bypass gluster mount point but
> you need to test the drives) to the physical drives?
>
> For me this sounds like broken or almost broken hardware or broken
> underlying filesystems.
>
> If one of the drives malfunction and timeout, gluster will be slow and
> timeout. It runs write in sync so the slowest node will slow down the whole
> system.
>
> /Johan
>
>
> On May 30, 2018 08:29:46 Jim Kusznir <[email protected]> wrote:
>
>> hosted-engine --deploy failed (would not come up on my existing gluster
>> storage).  However, I realized no changes were written to my existing
>> storage.  So, I went back to trying to get my old engine running.
>>
>> hosted-engine --vm-status is now taking a very long time (5+minutes) to
>> return, and it returns stail information everywhere.  I thought perhaps the
>> lockspace is corrupt, so tried to clean that and metadata, but both are
>> failing (--cleam-metadata has hung and I can't even ctrl-c out of it).
>>
>> How can I reinitialize all the lockspace/metadata safely?  There is no
>> engine or VMs running currently....
>>
>> --Jim
>>
>> On Tue, May 29, 2018 at 9:33 PM, Jim Kusznir <[email protected]> wrote:
>>
>>> Well, things went from bad to very, very bad....
>>>
>>> It appears that during one of the 2 minute lockups, the fencing agents
>>> decided that another node in the cluster was down.  As a result, 2 of the 3
>>> nodes were simultaneously reset with fencing agent reboot.  After the nodes
>>> came back up, the engine would not start.  All running VMs (including VMs
>>> on the 3rd node that was not rebooted) crashed.
>>>
>>> I've now been working for about 3 hours trying to get the engine to come
>>> up.  I don't know why it won't start.  hosted-engine --vm-start says its
>>> starting, but it doesn't start (virsh doesn't show any VMs running).  I'm
>>> currently running --deploy, as I had run out of options for anything else I
>>> can come up with.  I hope this will allow me to re-import all my existing
>>> VMs and allow me to start them back up after everything comes back up.
>>>
>>> I do have an unverified geo-rep backup; I don't know if it is a good
>>> backup (there were several prior messages to this list, but I didn't get
>>> replies to my questions.  It was running in what I believe to be "strange",
>>> and the data directories are larger than their source).
>>>
>>> I'll see if my --deploy works, and if not, I'll be back with another
>>> message/help request.
>>>
>>> When the dust settles and I'm at least minimally functional again, I
>>> really want to understand why all these technologies designed to offer
>>> redundancy conspired to reduce uptime and create failures where there
>>> weren't any otherwise.  I thought with hosted engine, 3 ovirt servers and
>>> glusterfs with minimum replica 2+arb or replica 3 should have offered
>>> strong resilience against server failure or disk failure, and should have
>>> prevented / recovered from data corruption.  Instead, all of the above
>>> happened (once I get my cluster back up, I still have to try and recover my
>>> webserver VM, which won't boot due to XFS corrupt journal issues created
>>> during the gluster crashes).  I think a lot of these issues were rooted
>>> from the upgrade from 4.1 to 4.2.
>>>
>>> --Jim
>>>
>>> On Tue, May 29, 2018 at 6:25 PM, Jim Kusznir <[email protected]>
>>> wrote:
>>>
>>>> I also finally found the following in my system log on one server:
>>>>
>>>> [10679.524491] INFO: task glusterclogro:14933 blocked for more than 120
>>>> seconds.
>>>> [10679.525826] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>> disables this message.
>>>> [10679.527144] glusterclogro   D ffff97209832bf40     0 14933      1
>>>> 0x00000080
>>>> [10679.527150] Call Trace:
>>>> [10679.527161]  [<ffffffffb9913f79>] schedule+0x29/0x70
>>>> [10679.527218]  [<ffffffffc060e388>] _xfs_log_force_lsn+0x2e8/0x340
>>>> [xfs]
>>>> [10679.527225]  [<ffffffffb92cf1b0>] ? wake_up_state+0x20/0x20
>>>> [10679.527254]  [<ffffffffc05eeb97>] xfs_file_fsync+0x107/0x1e0 [xfs]
>>>> [10679.527260]  [<ffffffffb944f0e7>] do_fsync+0x67/0xb0
>>>> [10679.527268]  [<ffffffffb992076f>] ? system_call_after_swapgs+0xbc/
>>>> 0x160
>>>> [10679.527271]  [<ffffffffb944f3d0>] SyS_fsync+0x10/0x20
>>>> [10679.527275]  [<ffffffffb992082f>] system_call_fastpath+0x1c/0x21
>>>> [10679.527279]  [<ffffffffb992077b>] ? system_call_after_swapgs+0xc8/
>>>> 0x160
>>>> [10679.527283] INFO: task glusterposixfsy:14941 blocked for more than
>>>> 120 seconds.
>>>> [10679.528608] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>> disables this message.
>>>> [10679.529956] glusterposixfsy D ffff972495f84f10     0 14941      1
>>>> 0x00000080
>>>> [10679.529961] Call Trace:
>>>> [10679.529966]  [<ffffffffb9913f79>] schedule+0x29/0x70
>>>> [10679.530003]  [<ffffffffc060e388>] _xfs_log_force_lsn+0x2e8/0x340
>>>> [xfs]
>>>> [10679.530008]  [<ffffffffb92cf1b0>] ? wake_up_state+0x20/0x20
>>>> [10679.530038]  [<ffffffffc05eeb97>] xfs_file_fsync+0x107/0x1e0 [xfs]
>>>> [10679.530042]  [<ffffffffb944f0e7>] do_fsync+0x67/0xb0
>>>> [10679.530046]  [<ffffffffb992076f>] ? system_call_after_swapgs+0xbc/
>>>> 0x160
>>>> [10679.530050]  [<ffffffffb944f3f3>] SyS_fdatasync+0x13/0x20
>>>> [10679.530054]  [<ffffffffb992082f>] system_call_fastpath+0x1c/0x21
>>>> [10679.530058]  [<ffffffffb992077b>] ? system_call_after_swapgs+0xc8/
>>>> 0x160
>>>> [10679.530062] INFO: task glusteriotwr13:15486 blocked for more than
>>>> 120 seconds.
>>>> [10679.531805] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>> disables this message.
>>>> [10679.533732] glusteriotwr13  D ffff9720a83f0000     0 15486      1
>>>> 0x00000080
>>>> [10679.533738] Call Trace:
>>>> [10679.533747]  [<ffffffffb9913f79>] schedule+0x29/0x70
>>>> [10679.533799]  [<ffffffffc060e388>] _xfs_log_force_lsn+0x2e8/0x340
>>>> [xfs]
>>>> [10679.533806]  [<ffffffffb92cf1b0>] ? wake_up_state+0x20/0x20
>>>> [10679.533846]  [<ffffffffc05eeb97>] xfs_file_fsync+0x107/0x1e0 [xfs]
>>>> [10679.533852]  [<ffffffffb944f0e7>] do_fsync+0x67/0xb0
>>>> [10679.533858]  [<ffffffffb992076f>] ? system_call_after_swapgs+0xbc/
>>>> 0x160
>>>> [10679.533863]  [<ffffffffb944f3f3>] SyS_fdatasync+0x13/0x20
>>>> [10679.533868]  [<ffffffffb992082f>] system_call_fastpath+0x1c/0x21
>>>> [10679.533873]  [<ffffffffb992077b>] ? system_call_after_swapgs+0xc8/
>>>> 0x160
>>>> [10919.512757] INFO: task glusterclogro:14933 blocked for more than 120
>>>> seconds.
>>>> [10919.514714] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>> disables this message.
>>>> [10919.516663] glusterclogro   D ffff97209832bf40     0 14933      1
>>>> 0x00000080
>>>> [10919.516677] Call Trace:
>>>> [10919.516690]  [<ffffffffb9913f79>] schedule+0x29/0x70
>>>> [10919.516696]  [<ffffffffb99118e9>] schedule_timeout+0x239/0x2c0
>>>> [10919.516703]  [<ffffffffb951cc04>] ? blk_finish_plug+0x14/0x40
>>>> [10919.516768]  [<ffffffffc05e9224>] ? _xfs_buf_ioapply+0x334/0x460
>>>> [xfs]
>>>> [10919.516774]  [<ffffffffb991432d>] wait_for_completion+0xfd/0x140
>>>> [10919.516782]  [<ffffffffb92cf1b0>] ? wake_up_state+0x20/0x20
>>>> [10919.516821]  [<ffffffffc05eb0a3>] ? _xfs_buf_read+0x23/0x40 [xfs]
>>>> [10919.516859]  [<ffffffffc05eafa9>] xfs_buf_submit_wait+0xf9/0x1d0
>>>> [xfs]
>>>> [10919.516902]  [<ffffffffc061b279>] ? xfs_trans_read_buf_map+0x199/0x400
>>>> [xfs]
>>>> [10919.516940]  [<ffffffffc05eb0a3>] _xfs_buf_read+0x23/0x40 [xfs]
>>>> [10919.516977]  [<ffffffffc05eb1b9>] xfs_buf_read_map+0xf9/0x160 [xfs]
>>>> [10919.517022]  [<ffffffffc061b279>] xfs_trans_read_buf_map+0x199/0x400
>>>> [xfs]
>>>> [10919.517057]  [<ffffffffc05c8d04>] xfs_da_read_buf+0xd4/0x100 [xfs]
>>>> [10919.517091]  [<ffffffffc05c8d53>] xfs_da3_node_read+0x23/0xd0 [xfs]
>>>> [10919.517126]  [<ffffffffc05c9fee>] xfs_da3_node_lookup_int+0x6e/0x2f0
>>>> [xfs]
>>>> [10919.517160]  [<ffffffffc05d5a1d>] xfs_dir2_node_lookup+0x4d/0x170
>>>> [xfs]
>>>> [10919.517194]  [<ffffffffc05ccf5d>] xfs_dir_lookup+0x1bd/0x1e0 [xfs]
>>>> [10919.517233]  [<ffffffffc05fd8d9>] xfs_lookup+0x69/0x140 [xfs]
>>>> [10919.517271]  [<ffffffffc05fa018>] xfs_vn_lookup+0x78/0xc0 [xfs]
>>>> [10919.517278]  [<ffffffffb9425cf3>] lookup_real+0x23/0x60
>>>> [10919.517283]  [<ffffffffb9426702>] __lookup_hash+0x42/0x60
>>>> [10919.517288]  [<ffffffffb942d519>] SYSC_renameat2+0x3a9/0x5a0
>>>> [10919.517296]  [<ffffffffb94d3753>] ? selinux_file_free_security+0x2
>>>> 3/0x30
>>>> [10919.517304]  [<ffffffffb992077b>] ? system_call_after_swapgs+0xc8/
>>>> 0x160
>>>> [10919.517309]  [<ffffffffb992076f>] ? system_call_after_swapgs+0xbc/
>>>> 0x160
>>>> [10919.517313]  [<ffffffffb992077b>] ? system_call_after_swapgs+0xc8/
>>>> 0x160
>>>> [10919.517318]  [<ffffffffb992076f>] ? system_call_after_swapgs+0xbc/
>>>> 0x160
>>>> [10919.517323]  [<ffffffffb942e58e>] SyS_renameat2+0xe/0x10
>>>> [10919.517328]  [<ffffffffb942e5ce>] SyS_rename+0x1e/0x20
>>>> [10919.517333]  [<ffffffffb992082f>] system_call_fastpath+0x1c/0x21
>>>> [10919.517339]  [<ffffffffb992077b>] ? system_call_after_swapgs+0xc8/
>>>> 0x160
>>>> [11159.496095] INFO: task glusteriotwr9:15482 blocked for more than 120
>>>> seconds.
>>>> [11159.497546] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>> disables this message.
>>>> [11159.498978] glusteriotwr9   D ffff971fa0fa1fa0     0 15482      1
>>>> 0x00000080
>>>> [11159.498984] Call Trace:
>>>> [11159.498995]  [<ffffffffb9911f00>] ? bit_wait+0x50/0x50
>>>> [11159.498999]  [<ffffffffb9913f79>] schedule+0x29/0x70
>>>> [11159.499003]  [<ffffffffb99118e9>] schedule_timeout+0x239/0x2c0
>>>> [11159.499056]  [<ffffffffc05dd9b7>] ? xfs_iext_bno_to_ext+0xa7/0x1a0
>>>> [xfs]
>>>> [11159.499082]  [<ffffffffc05dd43e>] ? xfs_iext_bno_to_irec+0x8e/0xd0
>>>> [xfs]
>>>> [11159.499090]  [<ffffffffb92f7a12>] ? ktime_get_ts64+0x52/0xf0
>>>> [11159.499093]  [<ffffffffb9911f00>] ? bit_wait+0x50/0x50
>>>> [11159.499097]  [<ffffffffb991348d>] io_schedule_timeout+0xad/0x130
>>>> [11159.499101]  [<ffffffffb9913528>] io_schedule+0x18/0x20
>>>> [11159.499104]  [<ffffffffb9911f11>] bit_wait_io+0x11/0x50
>>>> [11159.499107]  [<ffffffffb9911ac1>] __wait_on_bit_lock+0x61/0xc0
>>>> [11159.499113]  [<ffffffffb9393634>] __lock_page+0x74/0x90
>>>> [11159.499118]  [<ffffffffb92bc210>] ? wake_bit_function+0x40/0x40
>>>> [11159.499121]  [<ffffffffb9394154>] __find_lock_page+0x54/0x70
>>>> [11159.499125]  [<ffffffffb9394e85>] grab_cache_page_write_begin+0x
>>>> 55/0xc0
>>>> [11159.499130]  [<ffffffffb9484b76>] iomap_write_begin+0x66/0x100
>>>> [11159.499135]  [<ffffffffb9484edf>] iomap_write_actor+0xcf/0x1d0
>>>> [11159.499140]  [<ffffffffb9484e10>] ? iomap_write_end+0x80/0x80
>>>> [11159.499144]  [<ffffffffb94854e7>] iomap_apply+0xb7/0x150
>>>> [11159.499149]  [<ffffffffb9485621>] iomap_file_buffered_write+0xa1
>>>> /0xe0
>>>> [11159.499153]  [<ffffffffb9484e10>] ? iomap_write_end+0x80/0x80
>>>> [11159.499182]  [<ffffffffc05f025d>] 
>>>> xfs_file_buffered_aio_write+0x12d/0x2c0
>>>> [xfs]
>>>> [11159.499213]  [<ffffffffc05f057d>] xfs_file_aio_write+0x18d/0x1b0
>>>> [xfs]
>>>> [11159.499217]  [<ffffffffb941a533>] do_sync_write+0x93/0xe0
>>>> [11159.499222]  [<ffffffffb941b010>] vfs_write+0xc0/0x1f0
>>>> [11159.499225]  [<ffffffffb941c002>] SyS_pwrite64+0x92/0xc0
>>>> [11159.499230]  [<ffffffffb992076f>] ? system_call_after_swapgs+0xbc/
>>>> 0x160
>>>> [11159.499234]  [<ffffffffb992082f>] system_call_fastpath+0x1c/0x21
>>>> [11159.499238]  [<ffffffffb992077b>] ? system_call_after_swapgs+0xc8/
>>>> 0x160
>>>> [11279.488720] INFO: task xfsaild/dm-10:1134 blocked for more than 120
>>>> seconds.
>>>> [11279.490197] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>> disables this message.
>>>> [11279.491665] xfsaild/dm-10   D ffff9720a8660fd0     0  1134      2
>>>> 0x00000000
>>>> [11279.491671] Call Trace:
>>>> [11279.491682]  [<ffffffffb92a3a2e>] ? try_to_del_timer_sync+0x5e/0x90
>>>> [11279.491688]  [<ffffffffb9913f79>] schedule+0x29/0x70
>>>> [11279.491744]  [<ffffffffc060de36>] _xfs_log_force+0x1c6/0x2c0 [xfs]
>>>> [11279.491750]  [<ffffffffb92cf1b0>] ? wake_up_state+0x20/0x20
>>>> [11279.491783]  [<ffffffffc0619fec>] ? xfsaild+0x16c/0x6f0 [xfs]
>>>> [11279.491817]  [<ffffffffc060df5c>] xfs_log_force+0x2c/0x70 [xfs]
>>>> [11279.491849]  [<ffffffffc0619e80>] ? xfs_trans_ail_cursor_first+0x90/0x90
>>>> [xfs]
>>>> [11279.491880]  [<ffffffffc0619fec>] xfsaild+0x16c/0x6f0 [xfs]
>>>> [11279.491913]  [<ffffffffc0619e80>] ? xfs_trans_ail_cursor_first+0x90/0x90
>>>> [xfs]
>>>> [11279.491919]  [<ffffffffb92bb161>] kthread+0xd1/0xe0
>>>> [11279.491926]  [<ffffffffb92bb090>] ? insert_kthread_work+0x40/0x40
>>>> [11279.491932]  [<ffffffffb9920677>] ret_from_fork_nospec_begin+0x2
>>>> 1/0x21
>>>> [11279.491936]  [<ffffffffb92bb090>] ? insert_kthread_work+0x40/0x40
>>>> [11279.491976] INFO: task glusterclogfsyn:14934 blocked for more than
>>>> 120 seconds.
>>>> [11279.493466] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>> disables this message.
>>>> [11279.494952] glusterclogfsyn D ffff97209832af70     0 14934      1
>>>> 0x00000080
>>>> [11279.494957] Call Trace:
>>>> [11279.494979]  [<ffffffffc0309839>] ? __split_and_process_bio+0x2e9/0x520
>>>> [dm_mod]
>>>> [11279.494983]  [<ffffffffb9913f79>] schedule+0x29/0x70
>>>> [11279.494987]  [<ffffffffb99118e9>] schedule_timeout+0x239/0x2c0
>>>> [11279.494997]  [<ffffffffc0309d98>] ? dm_make_request+0x128/0x1a0
>>>> [dm_mod]
>>>> [11279.495001]  [<ffffffffb991348d>] io_schedule_timeout+0xad/0x130
>>>> [11279.495005]  [<ffffffffb99145ad>] wait_for_completion_io+0xfd/0x140
>>>> [11279.495010]  [<ffffffffb92cf1b0>] ? wake_up_state+0x20/0x20
>>>> [11279.495016]  [<ffffffffb951e574>] blkdev_issue_flush+0xb4/0x110
>>>> [11279.495049]  [<ffffffffc06064b9>] xfs_blkdev_issue_flush+0x19/0x20
>>>> [xfs]
>>>> [11279.495079]  [<ffffffffc05eec40>] xfs_file_fsync+0x1b0/0x1e0 [xfs]
>>>> [11279.495086]  [<ffffffffb944f0e7>] do_fsync+0x67/0xb0
>>>> [11279.495090]  [<ffffffffb992076f>] ? system_call_after_swapgs+0xbc/
>>>> 0x160
>>>> [11279.495094]  [<ffffffffb944f3d0>] SyS_fsync+0x10/0x20
>>>> [11279.495098]  [<ffffffffb992082f>] system_call_fastpath+0x1c/0x21
>>>> [11279.495102]  [<ffffffffb992077b>] ? system_call_after_swapgs+0xc8/
>>>> 0x160
>>>> [11279.495105] INFO: task glusterposixfsy:14941 blocked for more than
>>>> 120 seconds.
>>>> [11279.496606] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>> disables this message.
>>>> [11279.498114] glusterposixfsy D ffff972495f84f10     0 14941      1
>>>> 0x00000080
>>>> [11279.498118] Call Trace:
>>>> [11279.498134]  [<ffffffffc0309839>] ? __split_and_process_bio+0x2e9/0x520
>>>> [dm_mod]
>>>> [11279.498138]  [<ffffffffb9913f79>] schedule+0x29/0x70
>>>> [11279.498142]  [<ffffffffb99118e9>] schedule_timeout+0x239/0x2c0
>>>> [11279.498152]  [<ffffffffc0309d98>] ? dm_make_request+0x128/0x1a0
>>>> [dm_mod]
>>>> [11279.498156]  [<ffffffffb991348d>] io_schedule_timeout+0xad/0x130
>>>> [11279.498160]  [<ffffffffb99145ad>] wait_for_completion_io+0xfd/0x140
>>>> [11279.498165]  [<ffffffffb92cf1b0>] ? wake_up_state+0x20/0x20
>>>> [11279.498169]  [<ffffffffb951e574>] blkdev_issue_flush+0xb4/0x110
>>>> [11279.498202]  [<ffffffffc06064b9>] xfs_blkdev_issue_flush+0x19/0x20
>>>> [xfs]
>>>> [11279.498231]  [<ffffffffc05eec40>] xfs_file_fsync+0x1b0/0x1e0 [xfs]
>>>> [11279.498238]  [<ffffffffb944f0e7>] do_fsync+0x67/0xb0
>>>> [11279.498242]  [<ffffffffb992076f>] ? system_call_after_swapgs+0xbc/
>>>> 0x160
>>>> [11279.498246]  [<ffffffffb944f3f3>] SyS_fdatasync+0x13/0x20
>>>> [11279.498250]  [<ffffffffb992082f>] system_call_fastpath+0x1c/0x21
>>>> [11279.498254]  [<ffffffffb992077b>] ? system_call_after_swapgs+0xc8/
>>>> 0x160
>>>> [11279.498257] INFO: task glusteriotwr1:14950 blocked for more than 120
>>>> seconds.
>>>> [11279.499789] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>> disables this message.
>>>> [11279.501343] glusteriotwr1   D ffff97208b6daf70     0 14950      1
>>>> 0x00000080
>>>> [11279.501348] Call Trace:
>>>> [11279.501353]  [<ffffffffb9913f79>] schedule+0x29/0x70
>>>> [11279.501390]  [<ffffffffc060e388>] _xfs_log_force_lsn+0x2e8/0x340
>>>> [xfs]
>>>> [11279.501396]  [<ffffffffb92cf1b0>] ? wake_up_state+0x20/0x20
>>>> [11279.501428]  [<ffffffffc05eeb97>] xfs_file_fsync+0x107/0x1e0 [xfs]
>>>> [11279.501432]  [<ffffffffb944ef3f>] generic_write_sync+0x4f/0x70
>>>> [11279.501461]  [<ffffffffc05f0545>] xfs_file_aio_write+0x155/0x1b0
>>>> [xfs]
>>>> [11279.501466]  [<ffffffffb941a533>] do_sync_write+0x93/0xe0
>>>> [11279.501471]  [<ffffffffb941b010>] vfs_write+0xc0/0x1f0
>>>> [11279.501475]  [<ffffffffb941c002>] SyS_pwrite64+0x92/0xc0
>>>> [11279.501479]  [<ffffffffb992076f>] ? system_call_after_swapgs+0xbc/
>>>> 0x160
>>>> [11279.501483]  [<ffffffffb992082f>] system_call_fastpath+0x1c/0x21
>>>> [11279.501489]  [<ffffffffb992077b>] ? system_call_after_swapgs+0xc8/
>>>> 0x160
>>>> [11279.501493] INFO: task glusteriotwr4:14953 blocked for more than 120
>>>> seconds.
>>>> [11279.503047] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>> disables this message.
>>>> [11279.504630] glusteriotwr4   D ffff972499f2bf40     0 14953      1
>>>> 0x00000080
>>>> [11279.504635] Call Trace:
>>>> [11279.504640]  [<ffffffffb9913f79>] schedule+0x29/0x70
>>>> [11279.504676]  [<ffffffffc060e388>] _xfs_log_force_lsn+0x2e8/0x340
>>>> [xfs]
>>>> [11279.504681]  [<ffffffffb92cf1b0>] ? wake_up_state+0x20/0x20
>>>> [11279.504710]  [<ffffffffc05eeb97>] xfs_file_fsync+0x107/0x1e0 [xfs]
>>>> [11279.504714]  [<ffffffffb944f0e7>] do_fsync+0x67/0xb0
>>>> [11279.504718]  [<ffffffffb992076f>] ? system_call_after_swapgs+0xbc/
>>>> 0x160
>>>> [11279.504722]  [<ffffffffb944f3d0>] SyS_fsync+0x10/0x20
>>>> [11279.504725]  [<ffffffffb992082f>] system_call_fastpath+0x1c/0x21
>>>> [11279.504730]  [<ffffffffb992077b>] ? system_call_after_swapgs+0xc8/
>>>> 0x160
>>>> [12127.466494] perf: interrupt took too long (8263 > 8150), lowering
>>>> kernel.perf_event_max_sample_rate to 24000
>>>>
>>>> --------------------
>>>> I think this is the cause of the massive ovirt performance issues
>>>> irrespective of gluster volume.  At the time this happened, I was also
>>>> ssh'ed into the host, and was doing some rpm querry commands.  I had just
>>>> run rpm -qa |grep glusterfs (to verify what version was actually
>>>> installed), and that command took almost 2 minutes to return!  Normally it
>>>> takes less than 2 seconds.  That is all pure local SSD IO, too....
>>>>
>>>> I'm no expert, but its my understanding that anytime a software causes
>>>> these kinds of issues, its a serious bug in the software, even if its
>>>> mis-handled exceptions.  Is this correct?
>>>>
>>>> --Jim
>>>>
>>>> On Tue, May 29, 2018 at 3:01 PM, Jim Kusznir <[email protected]>
>>>> wrote:
>>>>
>>>>> I think this is the profile information for one of the volumes that
>>>>> lives on the SSDs and is fully operational with no down/problem disks:
>>>>>
>>>>> [root@ovirt2 yum.repos.d]# gluster volume profile data info
>>>>> Brick: ovirt2.nwfiber.com:/gluster/brick2/data
>>>>> ----------------------------------------------
>>>>> Cumulative Stats:
>>>>>    Block Size:                256b+                 512b+
>>>>>   1024b+
>>>>>  No. of Reads:                  983                  2696
>>>>>     1059
>>>>> No. of Writes:                    0                  1113
>>>>>      302
>>>>>
>>>>>    Block Size:               2048b+                4096b+
>>>>>   8192b+
>>>>>  No. of Reads:                  852                 88608
>>>>>    53526
>>>>> No. of Writes:                  522                812340
>>>>>    76257
>>>>>
>>>>>    Block Size:              16384b+               32768b+
>>>>>  65536b+
>>>>>  No. of Reads:                54351                241901
>>>>>    15024
>>>>> No. of Writes:                21636                  8656
>>>>>     8976
>>>>>
>>>>>    Block Size:             131072b+
>>>>>  No. of Reads:               524156
>>>>> No. of Writes:               296071
>>>>>  %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls
>>>>>      Fop
>>>>>  ---------   -----------   -----------   -----------   ------------
>>>>>     ----
>>>>>       0.00       0.00 us       0.00 us       0.00 us           4189
>>>>>  RELEASE
>>>>>       0.00       0.00 us       0.00 us       0.00 us           1257
>>>>> RELEASEDIR
>>>>>       0.00      46.19 us      12.00 us     187.00 us             69
>>>>>    FLUSH
>>>>>       0.00     147.00 us      78.00 us     367.00 us             86
>>>>> REMOVEXATTR
>>>>>       0.00     223.46 us      24.00 us    1166.00 us            149
>>>>>  READDIR
>>>>>       0.00     565.34 us      76.00 us    3639.00 us             88
>>>>>  FTRUNCATE
>>>>>       0.00     263.28 us      20.00 us   28385.00 us            228
>>>>>       LK
>>>>>       0.00      98.84 us       2.00 us     880.00 us           1198
>>>>>  OPENDIR
>>>>>       0.00      91.59 us      26.00 us   10371.00 us           3853
>>>>>   STATFS
>>>>>       0.00     494.14 us      17.00 us  193439.00 us           1171
>>>>> GETXATTR
>>>>>       0.00     299.42 us      35.00 us    9799.00 us           2044
>>>>> READDIRP
>>>>>       0.00    1965.31 us     110.00 us  382258.00 us            321
>>>>>  XATTROP
>>>>>       0.01     113.40 us      24.00 us   61061.00 us           8134
>>>>>     STAT
>>>>>       0.01     755.38 us      57.00 us  607603.00 us           3196
>>>>>  DISCARD
>>>>>       0.05    2690.09 us      58.00 us 2704761.00 us           3206
>>>>>     OPEN
>>>>>       0.10  119978.25 us      97.00 us 9406684.00 us            154
>>>>>  SETATTR
>>>>>       0.18     101.73 us      28.00 us  700477.00 us         313379
>>>>>    FSTAT
>>>>>       0.23    1059.84 us      25.00 us 2716124.00 us          38255
>>>>>   LOOKUP
>>>>>       0.47    1024.11 us      54.00 us 6197164.00 us          81455
>>>>> FXATTROP
>>>>>       1.72    2984.00 us      15.00 us 37098954.00 us         103020
>>>>>   FINODELK
>>>>>       5.92   44315.32 us      51.00 us 24731536.00 us          23957
>>>>>      FSYNC
>>>>>      13.27    2399.78 us      25.00 us 22089540.00 us         991005
>>>>>       READ
>>>>>      37.00    5980.43 us      52.00 us 22099889.00 us        1108976
>>>>>      WRITE
>>>>>      41.04    5452.75 us      13.00 us 22102452.00 us        1349053
>>>>>    INODELK
>>>>>
>>>>>     Duration: 10026 seconds
>>>>>    Data Read: 80046027759 bytes
>>>>> Data Written: 44496632320 bytes
>>>>>
>>>>> Interval 1 Stats:
>>>>>    Block Size:                256b+                 512b+
>>>>>   1024b+
>>>>>  No. of Reads:                  983                  2696
>>>>>     1059
>>>>> No. of Writes:                    0                   838
>>>>>      185
>>>>>
>>>>>    Block Size:               2048b+                4096b+
>>>>>   8192b+
>>>>>  No. of Reads:                  852                 85856
>>>>>    51575
>>>>> No. of Writes:                  382                705802
>>>>>    57812
>>>>>
>>>>>    Block Size:              16384b+               32768b+
>>>>>  65536b+
>>>>>  No. of Reads:                52673                232093
>>>>>    14984
>>>>> No. of Writes:                13499                  4908
>>>>>     4242
>>>>>
>>>>>    Block Size:             131072b+
>>>>>  No. of Reads:               460040
>>>>> No. of Writes:                 6411
>>>>>  %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls
>>>>>      Fop
>>>>>  ---------   -----------   -----------   -----------   ------------
>>>>>     ----
>>>>>       0.00       0.00 us       0.00 us       0.00 us           2093
>>>>>  RELEASE
>>>>>       0.00       0.00 us       0.00 us       0.00 us           1093
>>>>> RELEASEDIR
>>>>>       0.00      53.38 us      26.00 us     111.00 us             16
>>>>>    FLUSH
>>>>>       0.00     145.14 us      78.00 us     367.00 us             71
>>>>> REMOVEXATTR
>>>>>       0.00     190.96 us     114.00 us     298.00 us             71
>>>>>  SETATTR
>>>>>       0.00     213.38 us      24.00 us    1145.00 us             90
>>>>>  READDIR
>>>>>       0.00     263.28 us      20.00 us   28385.00 us            228
>>>>>       LK
>>>>>       0.00     101.76 us       2.00 us     880.00 us           1093
>>>>>  OPENDIR
>>>>>       0.01      93.60 us      27.00 us   10371.00 us           3090
>>>>>   STATFS
>>>>>       0.02     537.47 us      17.00 us  193439.00 us           1038
>>>>> GETXATTR
>>>>>       0.03     297.44 us      35.00 us    9799.00 us           1990
>>>>> READDIRP
>>>>>       0.03    2357.28 us     110.00 us  382258.00 us            253
>>>>>  XATTROP
>>>>>       0.04     385.93 us      58.00 us   47593.00 us           2091
>>>>>     OPEN
>>>>>       0.04     114.86 us      24.00 us   61061.00 us           7715
>>>>>     STAT
>>>>>       0.06     444.59 us      57.00 us  333240.00 us           3053
>>>>>  DISCARD
>>>>>       0.42     316.24 us      25.00 us  290728.00 us          29823
>>>>>   LOOKUP
>>>>>       0.73     257.92 us      54.00 us  344812.00 us          63296
>>>>> FXATTROP
>>>>>       1.37      98.30 us      28.00 us   67621.00 us         313172
>>>>>    FSTAT
>>>>>       1.58    2124.69 us      51.00 us  849200.00 us          16717
>>>>>    FSYNC
>>>>>       5.73     162.46 us      52.00 us  748492.00 us         794079
>>>>>    WRITE
>>>>>       7.19    2065.17 us      16.00 us 37098954.00 us          78381
>>>>>   FINODELK
>>>>>      36.44     886.32 us      25.00 us 2216436.00 us         925421
>>>>>     READ
>>>>>      46.30    1178.04 us      13.00 us 1700704.00 us         884635
>>>>>  INODELK
>>>>>
>>>>>     Duration: 7485 seconds
>>>>>    Data Read: 71250527215 bytes
>>>>> Data Written: 5119903744 bytes
>>>>>
>>>>> Brick: ovirt3.nwfiber.com:/gluster/brick2/data
>>>>> ----------------------------------------------
>>>>> Cumulative Stats:
>>>>>    Block Size:                  1b+
>>>>>  No. of Reads:                    0
>>>>> No. of Writes:              3264419
>>>>>  %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls
>>>>>      Fop
>>>>>  ---------   -----------   -----------   -----------   ------------
>>>>>     ----
>>>>>       0.00       0.00 us       0.00 us       0.00 us             90
>>>>>   FORGET
>>>>>       0.00       0.00 us       0.00 us       0.00 us           9462
>>>>>  RELEASE
>>>>>       0.00       0.00 us       0.00 us       0.00 us           4254
>>>>> RELEASEDIR
>>>>>       0.00      50.52 us      13.00 us     190.00 us             71
>>>>>    FLUSH
>>>>>       0.00     186.97 us      87.00 us     713.00 us             86
>>>>> REMOVEXATTR
>>>>>       0.00      79.32 us      33.00 us     189.00 us            228
>>>>>       LK
>>>>>       0.00     220.98 us     129.00 us     513.00 us             86
>>>>>  SETATTR
>>>>>       0.01     259.30 us      26.00 us    2632.00 us            137
>>>>>  READDIR
>>>>>       0.02     322.76 us     145.00 us    2125.00 us            321
>>>>>  XATTROP
>>>>>       0.03     109.55 us       2.00 us    1258.00 us           1193
>>>>>  OPENDIR
>>>>>       0.05      70.21 us      21.00 us     431.00 us           3196
>>>>>  DISCARD
>>>>>       0.05     169.26 us      21.00 us    2315.00 us           1545
>>>>> GETXATTR
>>>>>       0.12     176.85 us      63.00 us    2844.00 us           3206
>>>>>     OPEN
>>>>>       0.61     303.49 us      90.00 us    3085.00 us           9633
>>>>>    FSTAT
>>>>>       2.44     305.66 us      28.00 us    3716.00 us          38230
>>>>>   LOOKUP
>>>>>       4.52     266.22 us      55.00 us   53424.00 us          81455
>>>>> FXATTROP
>>>>>       6.96    1397.99 us      51.00 us   64822.00 us          23889
>>>>>    FSYNC
>>>>>      16.48      84.74 us      25.00 us    6917.00 us         932592
>>>>>    WRITE
>>>>>      30.16     106.90 us      13.00 us 3920189.00 us        1353046
>>>>>  INODELK
>>>>>      38.55    1794.52 us      14.00 us 16210553.00 us         103039
>>>>>   FINODELK
>>>>>
>>>>>     Duration: 66562 seconds
>>>>>    Data Read: 0 bytes
>>>>> Data Written: 3264419 bytes
>>>>>
>>>>> Interval 1 Stats:
>>>>>    Block Size:                  1b+
>>>>>  No. of Reads:                    0
>>>>> No. of Writes:               794080
>>>>>  %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls
>>>>>      Fop
>>>>>  ---------   -----------   -----------   -----------   ------------
>>>>>     ----
>>>>>       0.00       0.00 us       0.00 us       0.00 us           2093
>>>>>  RELEASE
>>>>>       0.00       0.00 us       0.00 us       0.00 us           1093
>>>>> RELEASEDIR
>>>>>       0.00      70.31 us      26.00 us     125.00 us             16
>>>>>    FLUSH
>>>>>       0.00     193.10 us     103.00 us     713.00 us             71
>>>>> REMOVEXATTR
>>>>>       0.01     227.32 us     133.00 us     513.00 us             71
>>>>>  SETATTR
>>>>>       0.01      79.32 us      33.00 us     189.00 us            228
>>>>>       LK
>>>>>       0.01     259.83 us      35.00 us    1138.00 us             89
>>>>>  READDIR
>>>>>       0.03     318.26 us     145.00 us    2047.00 us            253
>>>>>  XATTROP
>>>>>       0.04     112.67 us       3.00 us    1258.00 us           1093
>>>>>  OPENDIR
>>>>>       0.06     167.98 us      23.00 us    1951.00 us           1014
>>>>> GETXATTR
>>>>>       0.08      70.97 us      22.00 us     431.00 us           3053
>>>>>  DISCARD
>>>>>       0.13     183.78 us      66.00 us    2844.00 us           2091
>>>>>     OPEN
>>>>>       1.01     303.82 us      90.00 us    3085.00 us           9610
>>>>>    FSTAT
>>>>>       3.27     316.59 us      30.00 us    3716.00 us          29820
>>>>>   LOOKUP
>>>>>       5.83     265.79 us      59.00 us   53424.00 us          63296
>>>>> FXATTROP
>>>>>       7.95    1373.89 us      51.00 us   64822.00 us          16717
>>>>>    FSYNC
>>>>>      23.17     851.99 us      14.00 us 16210553.00 us          78555
>>>>>   FINODELK
>>>>>      24.04      87.44 us      27.00 us    6917.00 us         794081
>>>>>    WRITE
>>>>>      34.36     111.91 us      14.00 us  984871.00 us         886790
>>>>>  INODELK
>>>>>
>>>>>     Duration: 7485 seconds
>>>>>    Data Read: 0 bytes
>>>>> Data Written: 794080 bytes
>>>>>
>>>>>
>>>>> -----------------------
>>>>> Here is the data from the volume that is backed by the SHDDs and has
>>>>> one failed disk:
>>>>> [root@ovirt2 yum.repos.d]# gluster volume profile data-hdd info
>>>>> Brick: 172.172.1.12:/gluster/brick3/data-hdd
>>>>> --------------------------------------------
>>>>> Cumulative Stats:
>>>>>    Block Size:                256b+                 512b+
>>>>>   1024b+
>>>>>  No. of Reads:                 1702                    86
>>>>>       16
>>>>> No. of Writes:                    0                   767
>>>>>       71
>>>>>
>>>>>    Block Size:               2048b+                4096b+
>>>>>   8192b+
>>>>>  No. of Reads:                   19                 51841
>>>>>     2049
>>>>> No. of Writes:                   76                 60668
>>>>>    35727
>>>>>
>>>>>    Block Size:              16384b+               32768b+
>>>>>  65536b+
>>>>>  No. of Reads:                 1744                   639
>>>>>     1088
>>>>> No. of Writes:                 8524                  2410
>>>>>     1285
>>>>>
>>>>>    Block Size:             131072b+
>>>>>  No. of Reads:               771999
>>>>> No. of Writes:                29584
>>>>>  %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls
>>>>>      Fop
>>>>>  ---------   -----------   -----------   -----------   ------------
>>>>>     ----
>>>>>       0.00       0.00 us       0.00 us       0.00 us           2902
>>>>>  RELEASE
>>>>>       0.00       0.00 us       0.00 us       0.00 us           1517
>>>>> RELEASEDIR
>>>>>       0.00     197.00 us     197.00 us     197.00 us              1
>>>>>  FTRUNCATE
>>>>>       0.00      70.24 us      16.00 us     758.00 us             51
>>>>>    FLUSH
>>>>>       0.00     143.93 us      82.00 us     305.00 us             57
>>>>> REMOVEXATTR
>>>>>       0.00     178.63 us     105.00 us     712.00 us             60
>>>>>  SETATTR
>>>>>       0.00      67.30 us      19.00 us     572.00 us            555
>>>>>       LK
>>>>>       0.00     322.80 us      23.00 us    4673.00 us            138
>>>>>  READDIR
>>>>>       0.00     336.56 us     106.00 us   11994.00 us            237
>>>>>  XATTROP
>>>>>       0.00      84.70 us      28.00 us    1071.00 us           3469
>>>>>   STATFS
>>>>>       0.01     387.75 us       2.00 us  146017.00 us           1467
>>>>>  OPENDIR
>>>>>       0.01     148.59 us      21.00 us   64374.00 us           4454
>>>>>     STAT
>>>>>       0.02     783.02 us      16.00 us   93502.00 us           1902
>>>>> GETXATTR
>>>>>       0.03    1516.10 us      17.00 us  210690.00 us           1364
>>>>>  ENTRYLK
>>>>>       0.03    2555.47 us     300.00 us  674454.00 us           1064
>>>>> READDIRP
>>>>>       0.07      85.74 us      19.00 us   68340.00 us          62849
>>>>>    FSTAT
>>>>>       0.07    1978.12 us      59.00 us  202596.00 us           2729
>>>>>     OPEN
>>>>>       0.22     708.57 us      15.00 us  394799.00 us          25447
>>>>>   LOOKUP
>>>>>       5.94    2331.74 us      15.00 us 1099530.00 us         207534
>>>>> FINODELK
>>>>>       7.31    8311.75 us      58.00 us 1800216.00 us          71668
>>>>> FXATTROP
>>>>>      12.49    7735.19 us      51.00 us 3595513.00 us         131642
>>>>>    WRITE
>>>>>      17.70     957.08 us      16.00 us 13700466.00 us        1508160
>>>>>    INODELK
>>>>>      24.55    2546.43 us      26.00 us 5077347.00 us         786060
>>>>>     READ
>>>>>      31.56   49699.15 us      47.00 us 3746331.00 us          51777
>>>>>    FSYNC
>>>>>
>>>>>     Duration: 10101 seconds
>>>>>    Data Read: 101562897361 bytes
>>>>> Data Written: 4834450432 bytes
>>>>>
>>>>> Interval 0 Stats:
>>>>>    Block Size:                256b+                 512b+
>>>>>   1024b+
>>>>>  No. of Reads:                 1702                    86
>>>>>       16
>>>>> No. of Writes:                    0                   767
>>>>>       71
>>>>>
>>>>>    Block Size:               2048b+                4096b+
>>>>>   8192b+
>>>>>  No. of Reads:                   19                 51841
>>>>>     2049
>>>>> No. of Writes:                   76                 60668
>>>>>    35727
>>>>>
>>>>>    Block Size:              16384b+               32768b+
>>>>>  65536b+
>>>>>  No. of Reads:                 1744                   639
>>>>>     1088
>>>>> No. of Writes:                 8524                  2410
>>>>>     1285
>>>>>
>>>>>    Block Size:             131072b+
>>>>>  No. of Reads:               771999
>>>>> No. of Writes:                29584
>>>>>  %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls
>>>>>      Fop
>>>>>  ---------   -----------   -----------   -----------   ------------
>>>>>     ----
>>>>>       0.00       0.00 us       0.00 us       0.00 us           2902
>>>>>  RELEASE
>>>>>       0.00       0.00 us       0.00 us       0.00 us           1517
>>>>> RELEASEDIR
>>>>>       0.00     197.00 us     197.00 us     197.00 us              1
>>>>>  FTRUNCATE
>>>>>       0.00      70.24 us      16.00 us     758.00 us             51
>>>>>    FLUSH
>>>>>       0.00     143.93 us      82.00 us     305.00 us             57
>>>>> REMOVEXATTR
>>>>>       0.00     178.63 us     105.00 us     712.00 us             60
>>>>>  SETATTR
>>>>>       0.00      67.30 us      19.00 us     572.00 us            555
>>>>>       LK
>>>>>       0.00     322.80 us      23.00 us    4673.00 us            138
>>>>>  READDIR
>>>>>       0.00     336.56 us     106.00 us   11994.00 us            237
>>>>>  XATTROP
>>>>>       0.00      84.70 us      28.00 us    1071.00 us           3469
>>>>>   STATFS
>>>>>       0.01     387.75 us       2.00 us  146017.00 us           1467
>>>>>  OPENDIR
>>>>>       0.01     148.59 us      21.00 us   64374.00 us           4454
>>>>>     STAT
>>>>>       0.02     783.02 us      16.00 us   93502.00 us           1902
>>>>> GETXATTR
>>>>>       0.03    1516.10 us      17.00 us  210690.00 us           1364
>>>>>  ENTRYLK
>>>>>       0.03    2555.47 us     300.00 us  674454.00 us           1064
>>>>> READDIRP
>>>>>       0.07      85.73 us      19.00 us   68340.00 us          62849
>>>>>    FSTAT
>>>>>       0.07    1978.12 us      59.00 us  202596.00 us           2729
>>>>>     OPEN
>>>>>       0.22     708.57 us      15.00 us  394799.00 us          25447
>>>>>   LOOKUP
>>>>>       5.94    2334.57 us      15.00 us 1099530.00 us         207534
>>>>> FINODELK
>>>>>       7.31    8311.49 us      58.00 us 1800216.00 us          71668
>>>>> FXATTROP
>>>>>      12.49    7735.32 us      51.00 us 3595513.00 us         131642
>>>>>    WRITE
>>>>>      17.71     957.08 us      16.00 us 13700466.00 us        1508160
>>>>>    INODELK
>>>>>      24.56    2546.42 us      26.00 us 5077347.00 us         786060
>>>>>     READ
>>>>>      31.54   49651.63 us      47.00 us 3746331.00 us          51777
>>>>>    FSYNC
>>>>>
>>>>>     Duration: 10101 seconds
>>>>>    Data Read: 101562897361 bytes
>>>>> Data Written: 4834450432 bytes
>>>>>
>>>>>
>>>>> On Tue, May 29, 2018 at 2:55 PM, Jim Kusznir <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Thank you for your response.
>>>>>>
>>>>>> I have 4 gluster volumes.  3 are replica 2 + arbitrator.  replica
>>>>>> bricks are on ovirt1 and ovirt2, arbitrator on ovirt3.  The 4th volume is
>>>>>> replica 3, with a brick on all three ovirt machines.
>>>>>>
>>>>>> The first 3 volumes are on an SSD disk; the 4th is on a Seagate SSHD
>>>>>> (same in all three machines).  On ovirt3, the SSHD has reported hard IO
>>>>>> failures, and that brick is offline.  However, the other two replicas are
>>>>>> fully operational (although they still show contents in the heal info
>>>>>> command that won't go away, but that may be the case until I replace the
>>>>>> failed disk).
>>>>>>
>>>>>> What is bothering me is that ALL 4 gluster volumes are showing
>>>>>> horrible performance issues.  At this point, as the bad disk has been
>>>>>> completely offlined, I would expect gluster to perform at normal speed, 
>>>>>> but
>>>>>> that is definitely not the case.
>>>>>>
>>>>>> I've also noticed that the performance hits seem to come in waves:
>>>>>> things seem to work acceptably (but slow) for a while, then suddenly, its
>>>>>> as if all disk IO on all volumes (including non-gluster local OS disk
>>>>>> volumes for the hosts) pause for about 30 seconds, then IO resumes again.
>>>>>> During those times, I start getting VM not responding and host not
>>>>>> responding notices as well as the applications having major issues.
>>>>>>
>>>>>> I've shut down most of my VMs and am down to just my essential core
>>>>>> VMs (shedded about 75% of my VMs).  I still am experiencing the same 
>>>>>> issues.
>>>>>>
>>>>>> Am I correct in believing that once the failed disk was brought
>>>>>> offline that performance should return to normal?
>>>>>>
>>>>>> On Tue, May 29, 2018 at 1:27 PM, Alex K <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I would check disks status and accessibility of mount points where
>>>>>>> your gluster volumes reside.
>>>>>>>
>>>>>>> On Tue, May 29, 2018, 22:28 Jim Kusznir <[email protected]> wrote:
>>>>>>>
>>>>>>>> On one ovirt server, I'm now seeing these messages:
>>>>>>>> [56474.239725] blk_update_request: 63 callbacks suppressed
>>>>>>>> [56474.239732] blk_update_request: I/O error, dev dm-2, sector 0
>>>>>>>> [56474.240602] blk_update_request: I/O error, dev dm-2, sector
>>>>>>>> 3905945472
>>>>>>>> [56474.241346] blk_update_request: I/O error, dev dm-2, sector
>>>>>>>> 3905945584
>>>>>>>> [56474.242236] blk_update_request: I/O error, dev dm-2, sector 2048
>>>>>>>> [56474.243072] blk_update_request: I/O error, dev dm-2, sector
>>>>>>>> 3905943424
>>>>>>>> [56474.243997] blk_update_request: I/O error, dev dm-2, sector
>>>>>>>> 3905943536
>>>>>>>> [56474.247347] blk_update_request: I/O error, dev dm-2, sector 0
>>>>>>>> [56474.248315] blk_update_request: I/O error, dev dm-2, sector
>>>>>>>> 3905945472
>>>>>>>> [56474.249231] blk_update_request: I/O error, dev dm-2, sector
>>>>>>>> 3905945584
>>>>>>>> [56474.250221] blk_update_request: I/O error, dev dm-2, sector 2048
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, May 29, 2018 at 11:59 AM, Jim Kusznir <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I see in messages on ovirt3 (my 3rd machine, the one upgraded to
>>>>>>>>> 4.2):
>>>>>>>>>
>>>>>>>>> May 29 11:54:41 ovirt3 ovs-vsctl: 
>>>>>>>>> ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock:
>>>>>>>>> database connection failed (No such file or directory)
>>>>>>>>> May 29 11:54:51 ovirt3 ovs-vsctl: 
>>>>>>>>> ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock:
>>>>>>>>> database connection failed (No such file or directory)
>>>>>>>>> May 29 11:55:01 ovirt3 ovs-vsctl: 
>>>>>>>>> ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock:
>>>>>>>>> database connection failed (No such file or directory)
>>>>>>>>> (appears a lot).
>>>>>>>>>
>>>>>>>>> I also found on the ssh session of that, some sysv warnings about
>>>>>>>>> the backing disk for one of the gluster volumes (straight replica 3). 
>>>>>>>>>  The
>>>>>>>>> glusterfs process for that disk on that machine went offline.  Its my
>>>>>>>>> understanding that it should continue to work with the other two 
>>>>>>>>> machines
>>>>>>>>> while I attempt to replace that disk, right?  Attempted writes 
>>>>>>>>> (touching an
>>>>>>>>> empty file) can take 15 seconds, repeating it later will be much 
>>>>>>>>> faster.
>>>>>>>>>
>>>>>>>>> Gluster generates a bunch of different log files, I don't know
>>>>>>>>> what ones you want, or from which machine(s).
>>>>>>>>>
>>>>>>>>> How do I do "volume profiling"?
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> On Tue, May 29, 2018 at 11:53 AM, Sahina Bose <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Do you see errors reported in the mount logs for the volume? If
>>>>>>>>>> so, could you attach the logs?
>>>>>>>>>> Any issues with your underlying disks. Can you also attach output
>>>>>>>>>> of volume profiling?
>>>>>>>>>>
>>>>>>>>>> On Wed, May 30, 2018 at 12:13 AM, Jim Kusznir <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Ok, things have gotten MUCH worse this morning.  I'm getting
>>>>>>>>>>> random errors from VMs, right now, about a third of my VMs have 
>>>>>>>>>>> been paused
>>>>>>>>>>> due to storage issues, and most of the remaining VMs are not 
>>>>>>>>>>> performing
>>>>>>>>>>> well.
>>>>>>>>>>>
>>>>>>>>>>> At this point, I am in full EMERGENCY mode, as my production
>>>>>>>>>>> services are now impacted, and I'm getting calls coming in with 
>>>>>>>>>>> problems...
>>>>>>>>>>>
>>>>>>>>>>> I'd greatly appreciate help...VMs are running VERY slowly (when
>>>>>>>>>>> they run), and they are steadily getting worse.  I don't know why.  
>>>>>>>>>>> I was
>>>>>>>>>>> seeing CPU peaks (to 100%) on several VMs, in perfect sync, for a 
>>>>>>>>>>> few
>>>>>>>>>>> minutes at a time (while the VM became unresponsive and any VMs I 
>>>>>>>>>>> was
>>>>>>>>>>> logged into that were linux were giving me the CPU stuck messages 
>>>>>>>>>>> in my
>>>>>>>>>>> origional post).  Is all this storage related?
>>>>>>>>>>>
>>>>>>>>>>> I also have two different gluster volumes for VM storage, and
>>>>>>>>>>> only one had the issues, but now VMs in both are being affected at 
>>>>>>>>>>> the same
>>>>>>>>>>> time and same way.
>>>>>>>>>>>
>>>>>>>>>>> --Jim
>>>>>>>>>>>
>>>>>>>>>>> On Mon, May 28, 2018 at 10:50 PM, Sahina Bose <[email protected]
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> [Adding gluster-users to look at the heal issue]
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, May 29, 2018 at 9:17 AM, Jim Kusznir <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello:
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've been having some cluster and gluster performance issues
>>>>>>>>>>>>> lately.  I also found that my cluster was out of date, and was 
>>>>>>>>>>>>> trying to
>>>>>>>>>>>>> apply updates (hoping to fix some of these), and discovered the 
>>>>>>>>>>>>> ovirt 4.1
>>>>>>>>>>>>> repos were taken completely offline.  So, I was forced to begin 
>>>>>>>>>>>>> an upgrade
>>>>>>>>>>>>> to 4.2.  According to docs I found/read, I needed only add the 
>>>>>>>>>>>>> new repo, do
>>>>>>>>>>>>> a yum update, reboot, and be good on my hosts (did the yum 
>>>>>>>>>>>>> update, the
>>>>>>>>>>>>> engine-setup on my hosted engine).  Things seemed to work 
>>>>>>>>>>>>> relatively well,
>>>>>>>>>>>>> except for a gluster sync issue that showed up.
>>>>>>>>>>>>>
>>>>>>>>>>>>> My cluster is a 3 node hyperconverged cluster.  I upgraded the
>>>>>>>>>>>>> hosted engine first, then engine 3.  When engine 3 came back up, 
>>>>>>>>>>>>> for some
>>>>>>>>>>>>> reason one of my gluster volumes would not sync.  Here's sample 
>>>>>>>>>>>>> output:
>>>>>>>>>>>>>
>>>>>>>>>>>>> [root@ovirt3 ~]# gluster volume heal data-hdd info
>>>>>>>>>>>>> Brick 172.172.1.11:/gluster/brick3/data-hdd
>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4
>>>>>>>>>>>>> 725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9
>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4
>>>>>>>>>>>>> cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971
>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-4
>>>>>>>>>>>>> 46b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba
>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-4
>>>>>>>>>>>>> 4f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625
>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-4
>>>>>>>>>>>>> 2a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b
>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4
>>>>>>>>>>>>> ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32
>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4
>>>>>>>>>>>>> 810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa
>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-4
>>>>>>>>>>>>> aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2
>>>>>>>>>>>>> Status: Connected
>>>>>>>>>>>>> Number of entries: 8
>>>>>>>>>>>>>
>>>>>>>>>>>>> Brick 172.172.1.12:/gluster/brick3/data-hdd
>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4
>>>>>>>>>>>>> 725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9
>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4
>>>>>>>>>>>>> cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971
>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4
>>>>>>>>>>>>> ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32
>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-4
>>>>>>>>>>>>> 46b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba
>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-4
>>>>>>>>>>>>> 4f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625
>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4
>>>>>>>>>>>>> 810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa
>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-4
>>>>>>>>>>>>> aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2
>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-4
>>>>>>>>>>>>> 2a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b
>>>>>>>>>>>>> Status: Connected
>>>>>>>>>>>>> Number of entries: 8
>>>>>>>>>>>>>
>>>>>>>>>>>>> Brick 172.172.1.13:/gluster/brick3/data-hdd
>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4
>>>>>>>>>>>>> ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32
>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4
>>>>>>>>>>>>> 810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa
>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-4
>>>>>>>>>>>>> aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2
>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-4
>>>>>>>>>>>>> 2a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b
>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4
>>>>>>>>>>>>> cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971
>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4
>>>>>>>>>>>>> 725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9
>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-4
>>>>>>>>>>>>> 46b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba
>>>>>>>>>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-4
>>>>>>>>>>>>> 4f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625
>>>>>>>>>>>>> Status: Connected
>>>>>>>>>>>>> Number of entries: 8
>>>>>>>>>>>>>
>>>>>>>>>>>>> ---------
>>>>>>>>>>>>> Its been in this state for a couple days now, and bandwidth
>>>>>>>>>>>>> monitoring shows no appreciable data moving.  I've tried 
>>>>>>>>>>>>> repeatedly
>>>>>>>>>>>>> commanding a full heal from all three clusters in the node.  Its 
>>>>>>>>>>>>> always the
>>>>>>>>>>>>> same files that need healing.
>>>>>>>>>>>>>
>>>>>>>>>>>>> When running gluster volume heal data-hdd statistics, I see
>>>>>>>>>>>>> sometimes different information, but always some number of "heal 
>>>>>>>>>>>>> failed"
>>>>>>>>>>>>> entries.  It shows 0 for split brain.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm not quite sure what to do.  I suspect it may be due to
>>>>>>>>>>>>> nodes 1 and 2 still being on the older ovirt/gluster release, but 
>>>>>>>>>>>>> I'm
>>>>>>>>>>>>> afraid to upgrade and reboot them until I have a good gluster 
>>>>>>>>>>>>> sync (don't
>>>>>>>>>>>>> need to create a split brain issue).  How do I proceed with this?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Second issue: I've been experiencing VERY POOR performance on
>>>>>>>>>>>>> most of my VMs.  To the tune that logging into a windows 10 vm 
>>>>>>>>>>>>> via remote
>>>>>>>>>>>>> desktop can take 5 minutes, launching quickbooks inside said vm 
>>>>>>>>>>>>> can easily
>>>>>>>>>>>>> take 10 minutes.  On some linux VMs, I get random messages like 
>>>>>>>>>>>>> this:
>>>>>>>>>>>>> Message from syslogd@unifi at May 28 20:39:23 ...
>>>>>>>>>>>>>  kernel:[6171996.308904] NMI watchdog: BUG: soft lockup -
>>>>>>>>>>>>> CPU#0 stuck for 22s! [mongod:14766]
>>>>>>>>>>>>>
>>>>>>>>>>>>> (the process and PID are often different)
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm not quite sure what to do about this either.  My initial
>>>>>>>>>>>>> thought was upgrad everything to current and see if its still 
>>>>>>>>>>>>> there, but I
>>>>>>>>>>>>> cannot move forward with that until my gluster is healed...
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>> --Jim
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> Users mailing list -- [email protected]
>>>>>>>>>>>>> To unsubscribe send an email to [email protected]
>>>>>>>>>>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>>>>>>>>>>>> oVirt Code of Conduct: https://www.ovirt.org/communit
>>>>>>>>>>>>> y/about/community-guidelines/
>>>>>>>>>>>>> List Archives: https://lists.ovirt.org/archiv
>>>>>>>>>>>>> es/list/[email protected]/message/3LEV6ZQ3JV2XLAL7NYBTXOYMYUOT
>>>>>>>>>>>>> IRQF/
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Users mailing list -- [email protected]
>>>>>>>> To unsubscribe send an email to [email protected]
>>>>>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>>>>>>> oVirt Code of Conduct: https://www.ovirt.org/communit
>>>>>>>> y/about/community-guidelines/
>>>>>>>> List Archives: https://lists.ovirt.org/archiv
>>>>>>>> es/list/[email protected]/message/ACO7RFSLBSRBAIONIC2HQ6Z24ZDES5MF/
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>> _______________________________________________
>> Users mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-
>> guidelines/
>> List Archives: https://lists.ovirt.org/archives/list/[email protected]/
>> message/JKBSLSZFRMFGILXGZB4YAMJIJKEKNGMA/
>>
>
>

_______________________________________________
Users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/OSKM4MFRYYLRRN2QOIF4FVXPX2UK3JU7/

[ovirt-users] Re: Gluster problems, cluster performance issues

Reply via email to