[ovirt-users] Re: OVirt Gluster Fail

2019-03-26 Thread Andrea Milan
  

Hi Sahina, Strahil 

thank you for the information, I managed to
start the heal and restore both the hosted engine and the Vms. 

This is
the log's on all nodes 

[2019-03-26 08:30:58.462329] I [MSGID: 104045]
[glfs-master.c:91:notify] 0-gfapi: New graph
676c6e6f-6465-3032-2e61-736370642e6c (0) coming up 

[2019-03-26
08:30:58.462364] I [MSGID: 114020] [client.c:2356:notify]
0-asc-client-0: parent translators are ready, attempting connect on
transport 

[2019-03-26 08:30:58.464374] I [MSGID: 114020]
[client.c:2356:notify] 0-asc-client-1: parent translators are ready,
attempting connect on transport 

[2019-03-26 08:30:58.464898] I
[rpc-clnt.c:1965:rpc_clnt_reconfig] 0-asc-client-0: changing port to
49438 (from 0) 

[2019-03-26 08:30:58.466148] I [MSGID: 114020]
[client.c:2356:notify] 0-asc-client-3: parent translators are ready,
attempting connect on transport 

[2019-03-26 08:30:58.468028] E
[socket.c:2309:socket_connect_finish] 0-asc-client-0: connection to
192.170.254.3:49438 failed (Nessun instradamento per l'host)


[2019-03-26 08:30:58.468054] I [rpc-clnt.c:1965:rpc_clnt_reconfig]
0-asc-client-1: changing port to 49441 (from 0) 

[2019-03-26
08:30:58.470040] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 0-asc-client-3:
changing port to 49421 (from 0) 

[2019-03-26 08:30:58.471345] I [MSGID:
114057] [client-handshake.c:1440:select_server_supported_programs]
0-asc-client-1: Using Program GlusterFS 3.3, Num (1298437), Version
(330) 

[2019-03-26 08:30:58.472642] I [MSGID: 114046]
[client-handshake.c:1216:client_setvolume_cbk] 0-asc-client-1: Connected
to asc-client-1, attached to remote volume '/bricks/asc/brick'.


[2019-03-26 08:30:58.472659] I [MSGID: 114047]
[client-handshake.c:1227:client_setvolume_cbk] 0-asc-client-1: Server
and Client lk-version numbers are not same, reopening the fds


[2019-03-26 08:30:58.472714] I [MSGID: 108005]
[afr-common.c:4387:afr_notify] 0-asc-replicate-0: Subvolume
'asc-client-1' came back up; going online. 

[2019-03-26
08:30:58.472731] I [MSGID: 114035]
[client-handshake.c:202:client_set_lk_version_cbk] 0-asc-client-1:
Server lk version = 1 

[2019-03-26 08:30:58.473112] E
[socket.c:2309:socket_connect_finish] 0-asc-client-3: connection to
192.170.254.6:49421 failed (Nessun instradamento per l'host)


[2019-03-26 08:30:58.473152] W [MSGID: 108001]
[afr-common.c:4467:afr_notify] 0-asc-replicate-0: Client-quorum is not
met 

[2019-03-26 08:30:58.477699] I [MSGID: 108031]
[afr-common.c:2157:afr_local_discovery_cbk] 0-asc-replicate-0: selecting
local read_child asc-client-1 

[2019-03-26 08:30:58.477804] I [MSGID:
104041] [glfs-resolve.c:885:__glfs_active_subvol] 0-asc: switched to
graph 676c6e6f-6465-3032-2e61-736370642e6c (0) 

I analyzed the single
nodes and I realized that the firewalld service has been stopped on all
nodes. 

Firewalld re-enabled the heal started automatically, and the
"gluster heal volume VOLNAME info" immediately gave correct connections.


the recovery of the single bricks started immediatly. 

When finished
that I have correctly detected and start the host-engine. 

I wanted to
tell you about the sequence that led me to the block: 

1) Node03 in
maintenance by hosted-engine. 

2) Maintenance performed and restarted.


3) Repositioned active node03 

4) Heal automatic controlled with
Ovirt Manager. 

5) Heal completed correctly. 

6) Node02 put into
maintenance. 

7) During the shutdown of the Node02 some VMs have gone
to Pause, Ovirt Manager has signaled the block of the Node01 and
immediately the host-engine has stopped. 

8) Restarted the Node02, I
saw that the gluster had the peer but there was no healing between the
nodes. 

I had to close everything, and the situation that presented
itself was that of previous emails. 

Questions: 

- Why did the Node02
in maintenance block Node01? 

- Why was restarting the system not
restarting the firewalld service? Is it also managed by vdsm? 

- What
is the correct way to backup virtual machines on an external machine? We
use Ovirt4.1 

- can backup be used outside of Ovirt? Es qemu-kvm
standard ... 
Thanks for all. 
Best regards 
Andrea Milan 

Il
25.03.2019 11:53 Sahina Bose ha scritto: 

> You will first need to
restore connectivity between the gluster peers
> for heal to work. So
restart glusterd on all hosts as Strahil
> mentioned, and check if
"gluster peer status" returns the other nodes
> as connected. If not,
please check the glusterd log to see what's
> causing the issue. Share
the logs if we need to look at it, along with
> the version info
> 
> On
Sun, Mar 24, 2019 at 1:08 AM Strahil wrote:
> 
>> Hi Andrea, The cluster
volumes might have sharding enabled and thus files larger than shard
size can be recovered only via cluster. You can try to restart gluster
on all nodes and force heal: 1. Kill gluster processes: systemctl stop
glusterd /usr/share/glusterfs/scripts/stop-all-gluster-processes.sh 2.
Start gluster: systemctl start glusterd 3. Force heal: for i in
$(gluster volume list); do gluster 

[ovirt-users] Re: OVirt Gluster Fail

2019-03-26 Thread Strahil
Hi Andrea,

My guess is that while node2 was in maintenance , node3 brick(s) have died, or 
there were some pending heals.

For backup, you can use anything that  works for KVM, but the hard part is to 
get the configuration of each VM. If the VM is running, you can use 'virsh 
dumpxml domain' to get the configuration of the running VM, but this won't work 
for VM that are off.

Why firewalld was not stopped  - my guess is a rare bug that is hard to 
reproduce.

Best Regards,
Strahil Nikolov

On Mar 26, 2019 17:10, Andrea Milan  wrote:
>
> Hi Sahina, Strahil
>
>  
>
> thank you for the information, I managed to start the heal and restore both 
> the hosted engine and the Vms.
>
>  
>
> This is the log’s on all nodes
>
>  
>
> [2019-03-26 08:30:58.462329] I [MSGID: 104045] [glfs-master.c:91:notify] 
> 0-gfapi: New graph 676c6e6f-6465-3032-2e61-736370642e6c (0) coming up
>
> [2019-03-26 08:30:58.462364] I [MSGID: 114020] [client.c:2356:notify] 
> 0-asc-client-0: parent translators are ready, attempting connect on transport
>
> [2019-03-26 08:30:58.464374] I [MSGID: 114020] [client.c:2356:notify] 
> 0-asc-client-1: parent translators are ready, attempting connect on transport
>
> [2019-03-26 08:30:58.464898] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 
> 0-asc-client-0: changing port to 49438 (from 0)
>
> [2019-03-26 08:30:58.466148] I [MSGID: 114020] [client.c:2356:notify] 
> 0-asc-client-3: parent translators are ready, attempting connect on transport
>
> [2019-03-26 08:30:58.468028] E [socket.c:2309:socket_connect_finish] 
> 0-asc-client-0: connection to 192.170.254.3:49438 failed (Nessun 
> instradamento per l'host)
>
> [2019-03-26 08:30:58.468054] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 
> 0-asc-client-1: changing port to 49441 (from 0)
>
> [2019-03-26 08:30:58.470040] I [rpc-clnt.c:1965:rpc_clnt_reconfig] 
> 0-asc-client-3: changing port to 49421 (from 0)
>
> [2019-03-26 08:30:58.471345] I [MSGID: 114057] 
> [client-handshake.c:1440:select_server_supported_programs] 0-asc-client-1: 
> Using Program GlusterFS 3.3, Num (1298437), Version (330)
>
> [2019-03-26 08:30:58.472642] I [MSGID: 114046] 
> [client-handshake.c:1216:client_setvolume_cbk] 0-asc-client-1: Connected to 
> asc-client-1, attached to remote volume '/bricks/asc/brick'.
>
> [2019-03-26 08:30:58.472659] I [MSGID: 114047] 
> [client-handshake.c:1227:client_setvolume_cbk] 0-asc-client-1: Server and 
> Client lk-version numbers are not same, reopening the fds
>
> [2019-03-26 08:30:58.472714] I [MSGID: 108005] [afr-common.c:4387:afr_notify] 
> 0-asc-replicate-0: Subvolume 'asc-client-1' came back up; going online.
>
> [2019-03-26 08:30:58.472731] I [MSGID: 114035] 
> [client-handshake.c:202:client_set_lk_version_cbk] 0-asc-client-1: Server lk 
> version = 1
>
> [2019-03-26 08:30:58.473112] E [socket.c:2309:socket_connect_finish] 
> 0-asc-client-3: connection to 192.170.254.6:49421 failed (Nessun 
> instradamento per l'host)
>
> [2019-03-26 08:30:58.473152] W [MSGID: 108001] [afr-common.c:4467:afr_notify] 
> 0-asc-replicate-0: Client-quorum is not met
>
> [2019-03-26 08:30:58.477699] I [MSGID: 108031] 
> [afr-common.c:2157:afr_local_discovery_cbk] 0-asc-replicate-0: selecting 
> local read_child asc-client-1
>
> [2019-03-26 08:30:58.477804] I [MSGID: 104041] 
> [glfs-resolve.c:885:__glfs_active_subvol] 0-asc: switched to graph 
> 676c6e6f-6465-3032-2e61-736370642e6c (0)
>
>  
>
>  
>
> I analyzed the single nodes and I realized that the firewalld service has 
> been stopped on all nodes.
>
> Firewalld re-enabled the heal started automatically, and the “gluster heal 
> volume VOLNAME info” immediately gave correct connections.
>
>  
>
> the recovery of the single bricks started immediatly.
>
> When finished that I have correctly detected and start the host-engine.
>
>  
>
> I wanted to tell you about the sequence that led me to the block:
>
>  
>
> 1) Node03 in maintenance by hosted-engine.
>
> 2) Maintenance performed and restarted.
>
> 3) Repositioned active node03
>
> 4) Heal automatic controlled with Ovirt Manager.
>
> 5) Heal completed correctly.
>
> 6) Node02 put into maintenance.
>
> 7) During the shutdown of the Node02 some VMs have gone to Pause, Ovirt 
> Manager has signaled the block of the Node01 and immediately the host-engine 
> has stopped.
>
> 8) Restarted the Node02, I saw that the gluster had the peer but there was no 
> healing between the nodes.
>
> I had to close everything, and the situation that presented itself was that 
> of previous emails.
>
>  
>
> Questions:
>
> - Why did the Node02 in maintenance block Node01?
>
> - Why was restarting the system not restarting the firewalld service? Is it 
> also managed by vdsm?
>
> - What is the correct way to backup virtual machines on an external machine? 
> We use Ovirt4.1
>
> - can backup be used outside of Ovirt? Es qemu-kvm standard ...
>
> Thanks for all.
> Best regards
> Andrea Milan
>
> Il 25.03.2019 11:53 Sahina Bose ha scritto:
>>
>> You will first need to restore connectivity 

[ovirt-users] Re: OVirt Gluster Fail

2019-03-25 Thread Sahina Bose
You will first need to restore connectivity between the gluster peers
for heal to work. So restart glusterd on all hosts as Strahil
mentioned, and check if "gluster peer status" returns the other nodes
as connected. If not, please check the glusterd log to see what's
causing the issue. Share the logs if we need to look at it, along with
the version info


On Sun, Mar 24, 2019 at 1:08 AM Strahil  wrote:
>
> Hi Andrea,
>
> The cluster volumes might have sharding enabled and thus files larger than 
> shard size can be recovered only  via cluster.
>
> You  can try to restart gluster on all nodes and force heal:
> 1. Kill gluster processes:
> systemctl stop glusterd
> /usr/share/glusterfs/scripts/stop-all-gluster-processes.sh
>
> 2. Start gluster:
> systemctl start glusterd
>
> 3. Force heal:
> for i in $(gluster volume list);  do gluster volume heal $i full  ; done
> sleep 300
> for i in $(gluster volume list);  do gluster volume heal $i info summary ; 
> done
>
> Best Regards,
> Strahil Nikolov
>
> On Mar 23, 2019 13:51, commram...@tiscali.it wrote: > > During maintenance of 
> a machine the hosted engine crashed. > At that point there was no more chance 
> of managing anything. > > The VMs have paused, and were no longer manageable. 
> > I restarted the machine, but one point all the bricks were no longer 
> reachable. > > Now I am in a situation where the engine support is no longer 
> loaded. > > The gluster sees the peers connected and the services turned on 
> for the various bricks, but fails to heal the messages that I find for each 
> machine are the following > > # gluster volume heal engine info > Brick 
> 192.170.254.3:/bricks/engine/brick > > . > . > . > > Status: Connected Number 
> of entries: 190 > > Brick 192.170.254.4:/bricks/engine/brick > Status: Il 
> socket di destinazione non è connesso > Number of entries: - > > Brick 
> 192.170.254.6:/bricks/engine/brick > Status: Il socket di destinazione non è 
> connesso > Number of entries: - > > this for all the bricks (some have no 
> heal to do because the machines inside were turned off). > > In practice all 
> the bricks see only localhost as connected. > > How can I restore the 
> machines? > Is there a way to read data from the physical machine and export 
> it so that it can be reused? > Unfortunately we need to access that data. > > 
> Someone can help me. > > Thanks Andrea > 
> ___ > Users mailing list -- 
> users@ovirt.org > To unsubscribe send an email to users-le...@ovirt.org > 
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of 
> Conduct: https://www.ovirt.org/community/about/community-guidelines/ > List 
> Archives: 
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/EOIY7ZU4GOEMRUNY3CWF6R3JIQNPHLVA/
>  ___
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct: 
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives: 
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/NMBDYBOY4TZB37I6O6VYBCVVGM5H3Y3F/
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/IUVVKMLEN5NLBCANUHA6QU5FVNXTHUJ6/


[ovirt-users] Re: OVirt Gluster Fail

2019-03-23 Thread Strahil
Hi Andrea,

The cluster volumes might have sharding enabled and thus files larger than 
shard size can be recovered only  via cluster.

You  can try to restart gluster on all nodes and force heal:
1. Kill gluster processes:
systemctl stop glusterd
/usr/share/glusterfs/scripts/stop-all-gluster-processes.sh

2. Start gluster:
systemctl start glusterd


3. Force heal:
for i in $(gluster volume list);  do gluster volume heal $i full  ; done
sleep 300
for i in $(gluster volume list);  do gluster volume heal $i info summary ; done

Best Regards,
Strahil NikolovOn Mar 23, 2019 13:51, commram...@tiscali.it wrote: > > During 
maintenance of a machine the hosted engine crashed. > At that point there was 
no more chance of managing anything. > > The VMs have paused, and were no 
longer manageable. > I restarted the machine, but one point all the bricks were 
no longer reachable. > > Now I am in a situation where the engine support is no 
longer loaded. > > The gluster sees the peers connected and the services turned 
on for the various bricks, but fails to heal the messages that I find for each 
machine are the following > > # gluster volume heal engine info > Brick 
192.170.254.3:/bricks/engine/brick > > . > . > . > > Status: Connected Number 
of entries: 190 > > Brick 192.170.254.4:/bricks/engine/brick > Status: Il 
socket di destinazione non è connesso > Number of entries: - > > Brick 
192.170.254.6:/bricks/engine/brick > Status: Il socket di destinazione non è 
connesso > Number of entries: - > > this for all the bricks (some have no heal 
to do because the machines inside were turned off). > > In practice all the 
bricks see only localhost as connected. > > How can I restore the machines? > 
Is there a way to read data from the physical machine and export it so that it 
can be reused? > Unfortunately we need to access that data. > > Someone can 
help me. > > Thanks Andrea > ___ > 
Users mailing list -- users@ovirt.org > To unsubscribe send an email to 
users-le...@ovirt.org > Privacy Statement: 
https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/ > List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/EOIY7ZU4GOEMRUNY3CWF6R3JIQNPHLVA/___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/NMBDYBOY4TZB37I6O6VYBCVVGM5H3Y3F/


[ovirt-users] Re: OVirt Gluster Fail

2019-03-23 Thread commramius
During maintenance of a machine the hosted engine crashed. 
At that point there was no more chance of managing anything.

The VMs have paused, and were no longer manageable.
I restarted the machine, but one point all the bricks were no longer reachable.

Now I am in a situation where the engine support is no longer loaded.

The gluster sees the peers connected and the services turned on for the various 
bricks, but fails to heal the messages that I find for each machine are the 
following

# gluster volume heal engine info 
Brick 192.170.254.3:/bricks/engine/brick 
 
.
. 
. 
 
Status: Connected Number of entries: 190 

Brick 192.170.254.4:/bricks/engine/brick
Status: Il socket di destinazione non è connesso 
Number of entries: - 

Brick 192.170.254.6:/bricks/engine/brick 
Status: Il socket di destinazione non è connesso 
Number of entries: -

this for all the bricks (some have no heal to do because the machines inside 
were turned off). 

In practice all the bricks see only localhost as connected. 

How can I restore the machines? 
Is there a way to read data from the physical machine and export it so that it 
can be reused? 
Unfortunately we need to access that data. 

Someone can help me. 

Thanks Andrea
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/EOIY7ZU4GOEMRUNY3CWF6R3JIQNPHLVA/