Any errors in vdsm.log or gluster mount log for this volume?

On Wed, Feb 27, 2019 at 1:07 PM Mike Lykov <[email protected]> wrote:
>
>
> Hi all. I have a HCI setup, glusterfs 3.12, ovirt 4.2.7, 4 nodes
>
> Yesterday I see 3 VMs detected by engine as "not responding" (it is marked as 
> HA VMs)
> (it all located on ovirtnode1 server)
> Two of them are restarted by engine on other nodes successfully, but one are 
> not. All get LOCALIZED message like 'cannot set lock: no free space on 
> device'  - what is this ? Why engine get that errors, and why some VMs can 
> restart automatically, some not (but successfully restarted bu user via webui 
> after some pause?)
> Who knows, what are you think? Full engine logs may be uploaded.
>
> from engine.log: start event
> ---------------
> 2019-02-26 17:04:05,308+04 INFO  
> [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] 
> (EE-ManagedThreadFactory-engineScheduled-Thread-59) [] VM 
> 'd546add1-126a-4490-bc83-469bab659854'(openfire.miac) moved from 'Up' --> 
> 'NotResponding'
> 2019-02-26 17:04:05,865+04 WARN  
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
> (EE-ManagedThreadFactory-engineScheduled-Thread-59) [] EVENT_ID: 
> VM_NOT_RESPONDING(126), VM openfire.miac is not responding.
> 2019-02-26 17:04:05,865+04 INFO  
> [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] 
> (EE-ManagedThreadFactory-engineScheduled-Thread-59) [] VM 
> '7a3af2e7-8296-4fe0-ac55-c52a4b1de93f'(e-l-k.miac) moved from 'Up' --> 
> 'NotResponding'
> 2019-02-26 17:04:05,894+04 WARN  
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
> (EE-ManagedThreadFactory-engineScheduled-Thread-59) [] EVENT_ID: 
> VM_NOT_RESPONDING(126), VM e-l-k.miac is not responding.
> 2019-02-26 17:04:05,895+04 INFO  
> [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] 
> (EE-ManagedThreadFactory-engineScheduled-Thread-59) [] VM 
> 'de76aa6c-a211-41de-8d85-7d2821c3980d'(tsgr-mon) moved from 'Up' --> 
> 'NotResponding'
> 2019-02-26 17:04:05,926+04 WARN  
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
> (EE-ManagedThreadFactory-engineScheduled-Thread-59) [] EVENT_ID: 
> VM_NOT_RESPONDING(126), VM tsgr-mon is not responding.
> ---------------------------
> 2019-02-26 17:04:22,237+04 ERROR 
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
> (ForkJoinPool-1-worker-9) [] EVENT_ID: VM_DOWN_ERROR(119), VM openfire.miac 
> is down with error. Exit message: VM has been terminated on the host.
> ....
> 2019-02-26 17:04:22,374+04 INFO  [org.ovirt.engine.core.bll.VdsEventListener] 
> (ForkJoinPool-1-worker-9) [] Highly Available VM went down. Attempting to 
> restart. VM Name 'openfire.miac', VM Id 'd546add1-126a-4490-bc83-469bab659854'
> ...
> 2019-02-26 17:04:27,737+04 ERROR 
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
> (EE-ManagedThreadFactory-engineScheduled-Thread-5) [] EVENT_ID: 
> VM_DOWN_ERROR(119), VM openfire.miac is down with error. Exit message: 
> resource busy: Failed to acquire lock: Lease is held by another host.
> ...
> 2019-02-26 17:04:28,350+04 INFO  
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
> (EE-ManagedThreadFactory-engine-Thread-2886073) [] EVENT_ID: 
> VDS_INITIATED_RUN_VM(506), Trying to restart VM openfire.miac on Host 
> ovirtnode6.miac
> 2019-02-26 17:04:31,841+04 ERROR 
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
> (ForkJoinPool-1-worker-2) [] EVENT_ID: VM_DOWN_ERROR(119), VM openfire.miac 
> is down with error. Exit message: resource busy: Failed to acquire lock: 
> Lease is held by another host.
> 2019-02-26 17:04:31,877+04 ERROR 
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
> (EE-ManagedThreadFactory-engine-Thread-2886082) [] EVENT_ID: 
> VDS_INITIATED_RUN_VM_FAILED(507), Failed to restart VM openfire.miac on Host 
> ovirtnode6.miac
> ...
> 2019-02-26 17:04:31,994+04 INFO  
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
> (EE-ManagedThreadFactory-engine-Thread-2886082) [] EVENT_ID: 
> VDS_INITIATED_RUN_VM(506), Trying to restart VM openfire.miac on Host 
> ovirtnode1.miac
> .....
> 2019-02-26 17:04:36,054+04 ERROR 
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
> (ForkJoinPool-1-worker-9) [] EVENT_ID: VM_DOWN_ERROR(119), VM openfire.miac 
> is down with error. Exit message: Не удалось установить блокировку: На 
> устройстве не осталось свободного места.
> 2019-02-26 17:04:36,054+04 INFO  
> [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] 
> (ForkJoinPool-1-worker-9) [] add VM 
> 'd546add1-126a-4490-bc83-469bab659854'(openfire.miac) to rerun treatment
> 2019-02-26 17:04:36,091+04 ERROR 
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
> (EE-ManagedThreadFactory-engine-Thread-2886083) [] EVENT_ID: 
> VDS_INITIATED_RUN_VM_FAILED(507), Failed to restart VM openfire.miac on Host 
> ovirtnode1.miac
> -------------------
> No more attemtps for this VM were made (state now is 'Down')
>
> Engine tried restart this VM on some other nodes, get ' Lease is held by 
> another host' (it is normal, because timeout for lock not expired?) and then 
> got (LOCALIZED MESSAGE ?? Why it is localized while all other are not?) 
> 'cannot set lock: no free space on device'
>
> Which 'device' it mean? How to see how many 'free space' in it? gluster 
> volumes  almost empty:
> ovirtstor1.miac:/engine                           150G         7,5G  143G     
>        5% /rhev/data-center/mnt/glusterSD/ovirtstor1.miac:_engine
> ovirtnode1.miac:/data                             5,9T          42G  5,9T     
>        1% /rhev/data-center/mnt/glusterSD/ovirtnode1.miac:_data
> ovirtnode1.miac:/vmstore                          4,4T         194G  4,2T     
>        5% /rhev/data-center/mnt/glusterSD/ovirtnode1.miac:_vmstore
> ovirtnode1.miac:/iso_storage                      600G         1,3G  599G     
>        1% /rhev/data-center/mnt/glusterSD/ovirtnode1.miac:_iso__storage
> ovirtnode1.miac:/export                           5,1T          14G  5,1T     
>        1% /rhev/data-center/mnt/glusterSD/ovirtnode1.miac:_export
>
> Other VMs also get this error about free space:
> -------------
> 2019-02-26 17:05:05,142+04 ERROR 
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
> (ForkJoinPool-1-worker-11) [] EVENT_ID: VM_DOWN_ERROR(119), VM tsgr-mon is 
> down with error. Exit message: Не удалось установить блокир
> овку: На устройстве не осталось свободного места.
>
> 2019-02-26 17:05:43,526+04 INFO  
> [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] 
> (ForkJoinPool-1-worker-11) [] VM 
> 'de76aa6c-a211-41de-8d85-7d2821c3980d'(tsgr-mon) moved from 'PoweringUp' --> 
> 'Up'
> 2019-02-26 17:05:43,537+04 INFO  
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
> (ForkJoinPool-1-worker-11) [] EVENT_ID: VDS_INITIATED_RUN_VM(506), Trying to 
> restart VM tsgr-mon on Host ovirtnode6.miac
> ----------------
> And it is last message about this VM, state now is 'Up'
>
> Therefore all VMs are migrated from this node ovirtnode1, and later in log 
> are records about it unresponsiveness:
>
> 2019-02-26 17:07:43,584+04 ERROR 
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
> (DefaultQuartzScheduler9) [35e05907] EVENT_ID: 
> VDS_BROKER_COMMAND_FAILURE(10,802), VDSM ovirtnode1.miac command 
> GetGlusterVolumeHealIn
> foVDS failed: Message timeout which can be caused by communication issues
> 2019-02-26 17:07:43,592+04 WARN  
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
> (EE-ManagedThreadFactory-engine-Thread-2886194) [35e05907] EVENT_ID: 
> VDS_HOST_NOT_RESPONDING_CONNECTING(9,008), Host ovirtnode1.miac i
> s not responding. It will stay in Connecting state for a grace period of 60 
> seconds and after that an attempt to fence the host will be issued.
> 2019-02-26 17:07:49,684+04 INFO  
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
> (EE-ManagedThreadFactory-engineScheduled-Thread-99) [4511cab1] EVENT_ID: 
> FENCE_OPERATION_USING_AGENT_AND_PROXY_STARTED(9,020), Executing power 
> management status on Host ovirtnode1.miac using Proxy Host ovirtnode5.miac 
> and Fence Agent ipmilan:172.16.16.1.
> 2019-02-26 17:07:49,685+04 INFO  
> [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] 
> (EE-ManagedThreadFactory-engineScheduled-Thread-99) [4511cab1] START, 
> FenceVdsVDSCommand(HostName = ovirtnode5.miac,  ... action='STATUS', 
> agent='FenceAgent:{id='338cbdd9-aab6-484c-a906-21e747a96a84', ... 
> type='ipmilan', ip='172.16.16.1', port='null', user='admin', password='***', 
> encryptOptions='false', options=''}', policy='null'}), log id: 
> 281e6c2019-02-26 17:07:49,942+04 INFO  
> [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] 
> (EE-ManagedThreadFactory-engineScheduled-Thread-99) [4511cab1] FINISH, 
> FenceVdsVDSCommand, return: FenceOperationResult:{status='SUCCESS', 
> powerStatus='ON', message=''}, log id: 281e6c
>
> 2019-02-26 17:07:56,131+04 INFO  
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
> (EE-ManagedThreadFactory-engineScheduled-Thread-99) [317316e4] EVENT_ID: 
> SYNC_STORAGE_DEVICES_IN_HOST(4,126), Manually synced the storage devices from 
> host ovirtnode1.miac
> 2019-02-26 17:07:56,550+04 INFO  
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
> (EE-ManagedThreadFactory-engineScheduled-Thread-99) [3a88be4f] EVENT_ID: 
> VDS_DETECTED(13), Status of host ovirtnode1.miac was set to Up.
> 2019-02-26 17:07:56,554+04 INFO  
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
> (EE-ManagedThreadFactory-engineScheduled-Thread-99) [3a88be4f] EVENT_ID: 
> VDS_FENCE_STATUS(496), Host ovirtnode1.miac power managementwas verified 
> successfully.
> -------------
>
> After all, now I try to start VM that not restarted automatically:
> -------------
> 2019-02-27 11:23:30,139+04 INFO  
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
> (EE-ManagedThreadFactory-engine-Thread-2921212) 
> [8e405cb9-3a3a-42ab-ae05-5aaf40194b30] EVENT_ID: USER_STARTED_VM(153), VM 
> openfire.miac was started by admin@internal-authz (Host: ovirtnode1.miac).
> .....
> 2019-02-27 11:24:12,489+04 INFO  
> [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] 
> (ForkJoinPool-1-worker-1) [] VM 
> 'd546add1-126a-4490-bc83-469bab659854'(openfire.miac) moved from 'PoweringUp' 
> --> 'Up'
> 2019-02-27 11:24:12,502+04 INFO  
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
> (ForkJoinPool-1-worker-1) [] EVENT_ID: USER_RUN_VM(32), VM openfire.miac 
> started on Host ovirtnode1.miac
> ---------------
>
> It started without problems on same host ovirtnode1. Strange.
>
>
> --
> Mike
> _______________________________________________
> Users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct: 
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives: 
> https://lists.ovirt.org/archives/list/[email protected]/message/XHC5MDVYLIHKZR4YBCGULV6OLGTB25LF/
_______________________________________________
Users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/[email protected]/message/PMLDHBXSEVKUSHFHYYGSQVFMJSN5W3HK/

Reply via email to