[ovirt-users] Re: System unable to recover after a crash

2019-07-13 Thread Strahil
Hi Carl,

I'd recommend you to avoid  DNS &  DHCP unless you oVirt infra consistes of 
hundreds of servers.
It is far more reliable to use  static IPs + /etc/hosts .
As you could 'ssh' to the engine, check the logs - there should be a clue why 
it failed.
Most  probably it's related to the DNS/IP used.
I think the devs can tell their opinion on Monday.


Best Regards,
Strahil NikolovOn Jul 13, 2019 15:08, carl langlois  
wrote:
>
> Hi 
> Thanks for the info. There have been some progress with the situation. So to 
> make the story as short as possible we are in a process of changing our range 
> of IP addresse to 10.8.X.X to 10.16.X.X for all of the ovirt infra. This 
> implies a new DHCP server, new switchs etc etc. For now we went back to our 
> old IP address ranges because we were not able to stabilize the system. 
>
> So the last status using our new range of addresses was that gluster was all 
> fine, the hosted engine domaine was moutning okey. I suspect DNS table was 
> not properly updated.. but i am not 100% sure. But  if we tried to used the  
> new range of adrreses everything seems to be fine except that the 
> hosted-engine always fail the "liveliness check" after going up. I was not 
> able to solve this situation so i went back to our previous DHCP server. 
>
> So i am not sure what is missing for the hosted-engine to use the DHCP 
> server. Is there any hardcode config in the hosted-egnine that need to be 
> updated when chaging DHCP server(i.e new address with the same hostname, new 
> gateway..)
>
> More info on the test i did with the new DHCP server -- > All node have name 
> resolution working. I am able to ssh to the hosted-engine 
>
> Any suggestions will be appreciated as i am out of idea for now. Do i need to 
> redo some sort of setup in the engine to take into account the range of 
> address/new gateway? There is also a LDAP server access configure in the 
> engine for username mapping..
> Carl
>
>
>
>
> On Sat, Jul 13, 2019 at 6:31 AM Strahil Nikolov  wrote:
>>
>> Can you mount the volume manually at another location ?
>> Also, have you done any changes to Gluster ?
>>
>> Please provide "gluster volume info engine" . I have noticed the following 
>> in your logs: option 'parallel-readdir' is not recognized
>>
>> Best Regards,
>> Strahil Nikolov
>>
>> В петък, 12 юли 2019 г., 22:30:41 ч. Гринуич+3, carl langlois 
>>  написа:
>>
>>
>> Hi ,
>>
>> I am in state where my system does not recover from a major failure. I have 
>> pinpoint the probleme to be that the hosted engine storage domain is not 
>> able to mount
>>
>> I have a glusterfs containing the storage domain. but when it attempt to 
>> mount glusterfs to /rhev/data-center/mnt/glusterSD/ovhost1:_engine i get 
>>
>> +--+
>> [2019-07-12 19:19:44.063608] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 
>> 0-engine-client-2: changing port to 49153 (from 0)
>> [2019-07-12 19:19:55.033725] I [fuse-bridge.c:4205:fuse_init] 
>> 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 
>> 7.22
>> [2019-07-12 19:19:55.033748] I [fuse-bridge.c:4835:fuse_graph_sync] 0-fuse: 
>> switched to graph 0
>> [2019-07-12 19:19:55.033895] I [MSGID: 108006] [afr-common.c:537___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/GIZAG6QLH2ISNRQUN62JJTBWJQKKJ6B4/


[ovirt-users] Re: System unable to recover after a crash

2019-07-13 Thread carl langlois
Hi
Thanks for the info. There have been some progress with the situation. So
to make the story as short as possible we are in a process of changing our
range of IP addresse to 10.8.X.X to 10.16.X.X for all of the ovirt infra.
This implies a new DHCP server, new switchs etc etc. For now we went back
to our old IP address ranges because we were not able to stabilize the
system.

So the last status using our new range of addresses was that gluster was
all fine, the hosted engine domaine was moutning okey. I suspect DNS table
was not properly updated.. but i am not 100% sure. But  if we tried to used
the  new range of adrreses everything seems to be fine except that the
hosted-engine always fail the "liveliness check" after going up. I was not
able to solve this situation so i went back to our previous DHCP server.

So i am not sure what is missing for the hosted-engine to use the DHCP
server. Is there any hardcode config in the hosted-egnine that need to be
updated when chaging DHCP server(i.e new address with the same hostname,
new gateway..)

More info on the test i did with the new DHCP server -- > All node have
name resolution working. I am able to ssh to the hosted-engine

Any suggestions will be appreciated as i am out of idea for now. Do i need
to redo some sort of setup in the engine to take into account the range of
address/new gateway? There is also a LDAP server access configure in the
engine for username mapping..
Carl




On Sat, Jul 13, 2019 at 6:31 AM Strahil Nikolov 
wrote:

> Can you mount the volume manually at another location ?
> Also, have you done any changes to Gluster ?
>
> Please provide "gluster volume info engine" . I have noticed the following
> in your logs: option 'parallel-readdir' is not recognized
>
> Best Regards,
> Strahil Nikolov
>
> В петък, 12 юли 2019 г., 22:30:41 ч. Гринуич+3, carl langlois <
> crl.langl...@gmail.com> написа:
>
>
> Hi ,
>
> I am in state where my system does not recover from a major failure. I
> have pinpoint the probleme to be that the hosted engine storage domain is
> not able to mount
>
> I have a glusterfs containing the storage domain. but when it attempt to
> mount glusterfs to /rhev/data-center/mnt/glusterSD/ovhost1:_engine i get
>
>
> +--+
> [2019-07-12 19:19:44.063608] I [rpc-clnt.c:1986:rpc_clnt_reconfig]
> 0-engine-client-2: changing port to 49153 (from 0)
> [2019-07-12 19:19:55.033725] I [fuse-bridge.c:4205:fuse_init]
> 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
> 7.22
> [2019-07-12 19:19:55.033748] I [fuse-bridge.c:4835:fuse_graph_sync]
> 0-fuse: switched to graph 0
> [2019-07-12 19:19:55.033895] I [MSGID: 108006]
> [afr-common.c:5372:afr_local_init] 0-engine-replicate-0: no subvolumes up
> [2019-07-12 19:19:55.033938] E [fuse-bridge.c:4271:fuse_first_lookup]
> 0-fuse: first lookup on root failed (Transport endpoint is not connected)
> [2019-07-12 19:19:55.034041] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk]
> 0-fuse: ----0001: failed to resolve (Transport
> endpoint is not connected)
> [2019-07-12 19:19:55.034060] E [fuse-bridge.c:900:fuse_getattr_resume]
> 0-glusterfs-fuse: 2: GETATTR 1 (----0001)
> resolution failed
> [2019-07-12 19:19:55.034095] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk]
> 0-fuse: ----0001: failed to resolve (Transport
> endpoint is not connected)
> [2019-07-12 19:19:55.034102] E [fuse-bridge.c:900:fuse_getattr_resume]
> 0-glusterfs-fuse: 3: GETATTR 1 (----0001)
> resolution failed
> [2019-07-12 19:19:55.035596] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk]
> 0-fuse: ----0001: failed to resolve (Transport
> endpoint is not connected)
> [2019-07-12 19:19:55.035611] E [fuse-bridge.c:900:fuse_getattr_resume]
> 0-glusterfs-fuse: 4: GETATTR 1 (----0001)
> resolution failed
> [2019-07-12 19:19:55.047957] I [fuse-bridge.c:5093:fuse_thread_proc]
> 0-fuse: initating unmount of /rhev/data-center/mnt/glusterSD/ovhost1:_engine
> The message "I [MSGID: 108006] [afr-common.c:5372:afr_local_init]
> 0-engine-replicate-0: no subvolumes up" repeated 3 times between
> [2019-07-12 19:19:55.033895] and [2019-07-12 19:19:55.035588]
> [2019-07-12 19:19:55.048138] W [glusterfsd.c:1375:cleanup_and_exit]
> (-->/lib64/libpthread.so.0(+0x7e25) [0x7f51cecb3e25]
> -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x5632143bd4b5]
> -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x5632143bd32b] ) 0-:
> received signum (15), shutting down
> [2019-07-12 19:19:55.048150] I [fuse-bridge.c:5852:fini] 0-fuse:
> Unmounting '/rhev/data-center/mnt/glusterSD/ovhost1:_engine'.
> [2019-07-12 19:19:55.048155] I [fuse-bridge.c:5857:fini] 0-fuse: Closing
> fuse connection to '/rhev/data-center/mnt/glusterSD/ovhost1:_engine'.
> [2019-07-12 19:19:56.029923] I [MSGID: 100030] [glusterfsd.c:2511:m

[ovirt-users] Re: System unable to recover after a crash

2019-07-13 Thread Strahil Nikolov
 Can you mount the volume manually at another location ?Also, have you done any 
changes to Gluster ?
Please provide "gluster volume info engine" . I have noticed the following in 
your logs: option 'parallel-readdir' is not recognized
Best Regards,Strahil Nikolov
В петък, 12 юли 2019 г., 22:30:41 ч. Гринуич+3, carl langlois 
 написа:  
 
 Hi ,
I am in state where my system does not recover from a major failure. I have 
pinpoint the probleme to be that the hosted engine storage domain is not able 
to mount
I have a glusterfs containing the storage domain. but when it attempt to mount 
glusterfs to /rhev/data-center/mnt/glusterSD/ovhost1:_engine i get 

+--+
[2019-07-12 19:19:44.063608] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 
0-engine-client-2: changing port to 49153 (from 0)
[2019-07-12 19:19:55.033725] I [fuse-bridge.c:4205:fuse_init] 0-glusterfs-fuse: 
FUSE inited with protocol versions: glusterfs 7.24 kernel 7.22
[2019-07-12 19:19:55.033748] I [fuse-bridge.c:4835:fuse_graph_sync] 0-fuse: 
switched to graph 0
[2019-07-12 19:19:55.033895] I [MSGID: 108006] 
[afr-common.c:5372:afr_local_init] 0-engine-replicate-0: no subvolumes up
[2019-07-12 19:19:55.033938] E [fuse-bridge.c:4271:fuse_first_lookup] 0-fuse: 
first lookup on root failed (Transport endpoint is not connected)
[2019-07-12 19:19:55.034041] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk] 
0-fuse: ----0001: failed to resolve (Transport 
endpoint is not connected)
[2019-07-12 19:19:55.034060] E [fuse-bridge.c:900:fuse_getattr_resume] 
0-glusterfs-fuse: 2: GETATTR 1 (----0001) 
resolution failed
[2019-07-12 19:19:55.034095] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk] 
0-fuse: ----0001: failed to resolve (Transport 
endpoint is not connected)
[2019-07-12 19:19:55.034102] E [fuse-bridge.c:900:fuse_getattr_resume] 
0-glusterfs-fuse: 3: GETATTR 1 (----0001) 
resolution failed
[2019-07-12 19:19:55.035596] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk] 
0-fuse: ----0001: failed to resolve (Transport 
endpoint is not connected)
[2019-07-12 19:19:55.035611] E [fuse-bridge.c:900:fuse_getattr_resume] 
0-glusterfs-fuse: 4: GETATTR 1 (----0001) 
resolution failed
[2019-07-12 19:19:55.047957] I [fuse-bridge.c:5093:fuse_thread_proc] 0-fuse: 
initating unmount of /rhev/data-center/mnt/glusterSD/ovhost1:_engine
The message "I [MSGID: 108006] [afr-common.c:5372:afr_local_init] 
0-engine-replicate-0: no subvolumes up" repeated 3 times between [2019-07-12 
19:19:55.033895] and [2019-07-12 19:19:55.035588]
[2019-07-12 19:19:55.048138] W [glusterfsd.c:1375:cleanup_and_exit] 
(-->/lib64/libpthread.so.0(+0x7e25) [0x7f51cecb3e25] 
-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x5632143bd4b5] 
-->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x5632143bd32b] ) 0-: received 
signum (15), shutting down
[2019-07-12 19:19:55.048150] I [fuse-bridge.c:5852:fini] 0-fuse: Unmounting 
'/rhev/data-center/mnt/glusterSD/ovhost1:_engine'.
[2019-07-12 19:19:55.048155] I [fuse-bridge.c:5857:fini] 0-fuse: Closing fuse 
connection to '/rhev/data-center/mnt/glusterSD/ovhost1:_engine'.
[2019-07-12 19:19:56.029923] I [MSGID: 100030] [glusterfsd.c:2511:main] 
0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.12.11 
(args: /usr/sbin/glusterfs --volfile-server=ovhost1 --volfile-server=ovhost2 
--volfile-server=ovhost3 --volfile-id=/engine 
/rhev/data-center/mnt/glusterSD/ovhost1:_engine)
[2019-07-12 19:19:56.032209] W [MSGID: 101002] [options.c:995:xl_opt_validate] 
0-glusterfs: option 'address-family' is deprecated, preferred is 
'transport.address-family', continuing with correction
[2019-07-12 19:19:56.037510] I [MSGID: 101190] 
[event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with 
index 1
[2019-07-12 19:19:56.039618] I [MSGID: 101190] 
[event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread with 
index 2
[2019-07-12 19:19:56.039691] W [MSGID: 101174] 
[graph.c:363:_log_if_unknown_option] 0-engine-readdir-ahead: option 
'parallel-readdir' is not recognized
[2019-07-12 19:19:56.039739] I [MSGID: 114020] [client.c:2360:notify] 
0-engine-client-0: parent translators are ready, attempting connect on transport
[2019-07-12 19:19:56.043324] I [MSGID: 114020] [client.c:2360:notify] 
0-engine-client-1: parent translators are ready, attempting connect on transport
[2019-07-12 19:19:56.043481] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 
0-engine-client-0: changing port to 49153 (from 0)
[2019-07-12 19:19:56.048539] I [MSGID: 114020] [client.c:2360:notify] 
0-engine-client-2: parent translators are ready, attempting connect on transport
[2019-07-12 19:19:56.048952] I [rpc-clnt.c:1986:rpc_clnt_reconfig] 
0-engine-client-1: changing port to 49153 (from 0)
Final graph:
without this mount point the ha-agent is not s

[ovirt-users] Re: System unable to recover after a crash

2019-07-12 Thread Alex K
On Fri, Jul 12, 2019, 22:30 carl langlois  wrote:

> Hi ,
>
> I am in state where my system does not recover from a major failure. I
> have pinpoint the probleme to be that the hosted engine storage domain is
> not able to mount
>
> I have a glusterfs containing the storage domain. but when it attempt to
> mount glusterfs to /rhev/data-center/mnt/glusterSD/ovhost1:_engine i get
>
>
> +--+
> [2019-07-12 19:19:44.063608] I [rpc-clnt.c:1986:rpc_clnt_reconfig]
> 0-engine-client-2: changing port to 49153 (from 0)
> [2019-07-12 19:19:55.033725] I [fuse-bridge.c:4205:fuse_init]
> 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
> 7.22
> [2019-07-12 19:19:55.033748] I [fuse-bridge.c:4835:fuse_graph_sync]
> 0-fuse: switched to graph 0
> [2019-07-12 19:19:55.033895] I [MSGID: 108006]
> [afr-common.c:5372:afr_local_init] 0-engine-replicate-0: no subvolumes up
> [2019-07-12 19:19:55.033938] E [fuse-bridge.c:4271:fuse_first_lookup]
> 0-fuse: first lookup on root failed (Transport endpoint is not connected)
> [2019-07-12 19:19:55.034041] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk]
> 0-fuse: ----0001: failed to resolve (Transport
> endpoint is not connected)
> [2019-07-12 19:19:55.034060] E [fuse-bridge.c:900:fuse_getattr_resume]
> 0-glusterfs-fuse: 2: GETATTR 1 (----0001)
> resolution failed
> [2019-07-12 19:19:55.034095] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk]
> 0-fuse: ----0001: failed to resolve (Transport
> endpoint is not connected)
> [2019-07-12 19:19:55.034102] E [fuse-bridge.c:900:fuse_getattr_resume]
> 0-glusterfs-fuse: 3: GETATTR 1 (----0001)
> resolution failed
> [2019-07-12 19:19:55.035596] W [fuse-resolve.c:132:fuse_resolve_gfid_cbk]
> 0-fuse: ----0001: failed to resolve (Transport
> endpoint is not connected)
> [2019-07-12 19:19:55.035611] E [fuse-bridge.c:900:fuse_getattr_resume]
> 0-glusterfs-fuse: 4: GETATTR 1 (----0001)
> resolution failed
> [2019-07-12 19:19:55.047957] I [fuse-bridge.c:5093:fuse_thread_proc]
> 0-fuse: initating unmount of /rhev/data-center/mnt/glusterSD/ovhost1:_engine
> The message "I [MSGID: 108006] [afr-common.c:5372:afr_local_init]
> 0-engine-replicate-0: no subvolumes up" repeated 3 times between
> [2019-07-12 19:19:55.033895] and [2019-07-12 19:19:55.035588]
> [2019-07-12 19:19:55.048138] W [glusterfsd.c:1375:cleanup_and_exit]
> (-->/lib64/libpthread.so.0(+0x7e25) [0x7f51cecb3e25]
> -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x5632143bd4b5]
> -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x5632143bd32b] ) 0-:
> received signum (15), shutting down
> [2019-07-12 19:19:55.048150] I [fuse-bridge.c:5852:fini] 0-fuse:
> Unmounting '/rhev/data-center/mnt/glusterSD/ovhost1:_engine'.
> [2019-07-12 19:19:55.048155] I [fuse-bridge.c:5857:fini] 0-fuse: Closing
> fuse connection to '/rhev/data-center/mnt/glusterSD/ovhost1:_engine'.
> [2019-07-12 19:19:56.029923] I [MSGID: 100030] [glusterfsd.c:2511:main]
> 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 3.12.11
> (args: /usr/sbin/glusterfs --volfile-server=ovhost1
> --volfile-server=ovhost2 --volfile-server=ovhost3 --volfile-id=/engine
> /rhev/data-center/mnt/glusterSD/ovhost1:_engine)
> [2019-07-12 19:19:56.032209] W [MSGID: 101002]
> [options.c:995:xl_opt_validate] 0-glusterfs: option 'address-family' is
> deprecated, preferred is 'transport.address-family', continuing with
> correction
> [2019-07-12 19:19:56.037510] I [MSGID: 101190]
> [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread
> with index 1
> [2019-07-12 19:19:56.039618] I [MSGID: 101190]
> [event-epoll.c:613:event_dispatch_epoll_worker] 0-epoll: Started thread
> with index 2
> [2019-07-12 19:19:56.039691] W [MSGID: 101174]
> [graph.c:363:_log_if_unknown_option] 0-engine-readdir-ahead: option
> 'parallel-readdir' is not recognized
> [2019-07-12 19:19:56.039739] I [MSGID: 114020] [client.c:2360:notify]
> 0-engine-client-0: parent translators are ready, attempting connect on
> transport
> [2019-07-12 19:19:56.043324] I [MSGID: 114020] [client.c:2360:notify]
> 0-engine-client-1: parent translators are ready, attempting connect on
> transport
> [2019-07-12 19:19:56.043481] I [rpc-clnt.c:1986:rpc_clnt_reconfig]
> 0-engine-client-0: changing port to 49153 (from 0)
> [2019-07-12 19:19:56.048539] I [MSGID: 114020] [client.c:2360:notify]
> 0-engine-client-2: parent translators are ready, attempting connect on
> transport
> [2019-07-12 19:19:56.048952] I [rpc-clnt.c:1986:rpc_clnt_reconfig]
> 0-engine-client-1: changing port to 49153 (from 0)
> Final graph:
>
> without this mount point the ha-agent is not starting.
>
> the volume seem to be okey
>
>
> Gluster process TCP Port  RDMA Port  Online
>  Pid
>
> --