Re: [ovirt-users] ovirt with glusterfs - big test - unwanted results

2016-04-11 Thread Sahina Bose
So I looked at the vdsm logs and since there were multiple tests done it 
was difficult to isolate which error to track down. You mentioned test 
between 14:00-14:30  CET - but the gluster logs that were attached ended 
at 11.29 UTC


Tracking down the errors when the master domain (gluster volume 
1HP12-R3A1P1) went inactive for time period when corresponding gluster 
volume log was available - they all seem to correspond to an issue where 
gluster volume quorum was not met.


Can you confirm if this was for the test performed - or provide logs 
from correct time period (both vdsm and gluster mount logs are required 
- from hypervisors where the master domain is mounted)?


For master domain:
On 1hp1:
vdsm.log
Thread-35::ERROR::2016-03-31 
13:21:27,225::monitor::276::Storage.Monitor::(_monitorDomain) Err

or monitoring domain 14995860-1127-4dc4-b8c8-b540b89f9313
Traceback (most recent call last):
...
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 
454, in statvfs

resdict = self._sendCommand("statvfs", {"path": path}, self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 
427, in _sendCommand

raise OSError(errcode, errstr)
OSError: [Errno 107] Transport endpoint is not connected
Thread-35::INFO::2016-03-31 
13:21:27,267::monitor::299::Storage.Monitor::(_notifyStatusChanges) 
Domain 14995860-1127-4dc4-b8c8-b540b89f9313 became INVALID


-- And I see a corresponding:
[2016-03-31 11:21:16.027090] W [MSGID: 108001] 
[afr-common.c:4093:afr_notify] 0-1HP12-R3A1P1-r

eplicate-0: Client-quorum is not met

jsonrpc.Executor/0::DEBUG::2016-03-31 
13:23:34,110::__init__::533::jsonrpc.JsonRpcServer::(_serveRequest) 
Return 'GlusterVolume.status' in bridge with {'volumeStatus': {'bricks': 
[{'status': 'OFFLINE', 'hostuuid': 
'f6568a3b-3d65-4f4f-be9f-14a5935e37a4', 'pid': '-1', 'rdma_port': 'N/A', 
'brick': '1hp1:/STORAGES/P1/GFS', 'port': 'N/A'}, {'status': 'OFFLINE', 
'hostuuid': '8e87cf18-8958-41b7-8d24-7ee420a1ef9f', 'pid': '-1', 
'rdma_port': 'N/A', 'brick': '1hp2:/STORAGES/P1/GFS', 'port': 'N/A'}], 
'nfs': [{'status': 'OFFLINE', 'hostuuid': 
'f6568a3b-3d65-4f4f-be9f-14a5935e37a4', 'hostname': '172.16.5.151/24', 
'pid': '-1', 'rdma_port': 'N/A', 'port': 'N/A'}, {'status': 'OFFLINE', 
'hostuuid': '8e87cf18-8958-41b7-8d24-7ee420a1ef9f', 'hostname': '1hp2', 
'pid': '-1', 'rdma_port': 'N/A', 'port': 'N/A'}], 'shd': [{'status': 
'ONLINE', 'hostname': '172.16.5.151/24', 'pid': '2148', 'hostuuid': 
'f6568a3b-3d65-4f4f-be9f-14a5935e37a4'}, {'status': 'ONLINE', 
'hostname': '1hp2', 'pid': '2146', 'hostuuid': 
'8e87cf18-8958-41b7-8d24-7ee420a1ef9f'}], 'name': '1HP12-R3A1P1'}}


-- 2 bricks were offline. I think the arbiter brick is not reported in 
the xml output - this is a bug.


Similarly on 1hp2:
Thread-35::ERROR::2016-03-31 
13:21:14,284::monitor::276::Storage.Monitor::(_monitorDomain) Err

or monitoring domain 14995860-1127-4dc4-b8c8-b540b89f9313
Traceback (most recent call last):
  ...
raise OSError(errcode, errstr)
OSError: [Errno 2] No such file or directory
Thread-35::INFO::2016-03-31 
13:21:14,285::monitor::299::Storage.Monitor::(_notifyStatusChanges) 
Domain 14995860-1127-4dc4-b8c8-b540b89f9313 became INVALID


Corresponding gluster mount log -
[2016-03-31 11:21:16.027640] W [MSGID: 108001] 
[afr-common.c:4093:afr_notify] 0-1HP12-R3A1P1-r

eplicate-0: Client-quorum is not met

On 04/05/2016 07:02 PM, p...@email.cz wrote:

Hello Sahina,
look attached logs which U requested

regs.
Pavel

On 5.4.2016 14:07, Sahina Bose wrote:



On 03/31/2016 06:41 PM, p...@email.cz wrote:

Hi,
rest of logs:
www.uschovna.cz/en/zasilka/HYGXR57CNHM3TP39-L3W 



The TEST is the last big event in logs 
TEST TIME : about 14:00-14:30  CET


Thank you Pavel for the interesting test report and sharing the logs.

You are right - the master domain should not go down if 2 of 3 bricks 
are available from volume A (1HP12-R3A1P1).


I notice that host kvmarbiter was not responsive at 2016-03-31 
13:27:19 , but the ConnectStorageServerVDSCommand executed on 
kvmarbiter node returned success at 2016-03-31 13:27:26


Could you also share the vdsm logs from 1hp1, 1hp2 and kvmarbiter 
nodes during this time ?


Ravi, Krutika - could you take a look at the gluster logs?



regs.Pavel

On 31.3.2016 14:30, Yaniv Kaul wrote:

Hi Pavel,

Thanks for the report. Can you begin with a more accurate 
description of your environment?
Begin with host, oVirt and Gluster versions. Then continue with the 
exact setup (what are 'A', 'B', 'C' - domains? Volumes? What is the 
mapping between domains and volumes?).


Are there any logs you can share with us?

I'm sure with more information, we'd be happy to look at the issue.
Y.


On Thu, Mar 31, 2016 at 3:09 PM, p...@email.cz > wrote:


Hello,
we tried the  following test - with unwanted results

input:
5 node gluster
A = replica 3 with 

Re: [ovirt-users] ovirt with glusterfs - big test - unwanted results

2016-04-05 Thread Sahina Bose



On 03/31/2016 06:41 PM, p...@email.cz wrote:

Hi,
rest of logs:
www.uschovna.cz/en/zasilka/HYGXR57CNHM3TP39-L3W 



The TEST is the last big event in logs 
TEST TIME : about 14:00-14:30  CET


Thank you Pavel for the interesting test report and sharing the logs.

You are right - the master domain should not go down if 2 of 3 bricks 
are available from volume A (1HP12-R3A1P1).


I notice that host kvmarbiter was not responsive at 2016-03-31 13:27:19 
, but the ConnectStorageServerVDSCommand executed on kvmarbiter node 
returned success at 2016-03-31 13:27:26


Could you also share the vdsm logs from 1hp1, 1hp2 and kvmarbiter nodes 
during this time ?


Ravi, Krutika - could you take a look at the gluster logs?



regs.Pavel

On 31.3.2016 14:30, Yaniv Kaul wrote:

Hi Pavel,

Thanks for the report. Can you begin with a more accurate description 
of your environment?
Begin with host, oVirt and Gluster versions. Then continue with the 
exact setup (what are 'A', 'B', 'C' - domains? Volumes? What is the 
mapping between domains and volumes?).


Are there any logs you can share with us?

I'm sure with more information, we'd be happy to look at the issue.
Y.


On Thu, Mar 31, 2016 at 3:09 PM, p...@email.cz  
> wrote:


Hello,
we tried the  following test - with unwanted results

input:
5 node gluster
A = replica 3 with arbiter 1 ( node1+node2+arbiter on node 5 )
B = replica 3 with arbiter 1 ( node3+node4+arbiter on node 5 )
C = distributed replica 3 arbiter 1  ( node1+node2, node3+node4,
each arbiter on node 5)
node 5 has only arbiter replica ( 4x )

TEST:
1)  directly reboot one node - OK ( is not important which ( data
node or arbiter node ))
2)  directly reboot two nodes - OK ( if  nodes are not from the
same replica )
3)  directly reboot three nodes - yes, this is the main problem
and a questions 
- rebooted all three nodes from replica "B"  ( not so
possible, but who knows ... )
- all VMs with data on this replica was paused ( no data
access ) - OK
- all VMs running on replica "B" nodes lost ( started
manually, later )( datas on other replicas ) - acceptable
BUT
- !!! all oVIrt domains went down !! - master domain is on
replica "A" which lost only one member from three !!!
so we are not expecting that all domain will go down,
especially master with 2 live members.

Results:
- the whole cluster unreachable until at all domains up -
depent of all nodes up !!!
- all paused VMs started back - OK
- rest of all VMs rebooted and runnig - OK

Questions:
1) why all domains down if master domain ( on replica "A" )
has two runnig members ( 2 of 3 )  ??
2) how to fix that colaps without waiting to all nodes up ? (
in worste case if node has HW error eg. ) ??
3) which oVirt  cluster  policy  can prevent that situation
?? ( if any )

regs.
Pavel



___
Users mailing list
Users@ovirt.org 
http://lists.ovirt.org/mailman/listinfo/users






___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] ovirt with glusterfs - big test - unwanted results

2016-03-31 Thread p...@email.cz

Hello Yaniv,

we tried another small test - reboot two nodes from replica3 a1 ( 
1HP12-R3A1P1 ) which keep master domain.
All domains went down  = master down, but master domain didn't move to 
another available domain ( eg. 2HP12-R3A1P1 ).


It looks that "master domain" management isn't correct ( has a bug ?? )

regs.
Pavel


On 31.3.2016 14:30, Yaniv Kaul wrote:

Hi Pavel,

Thanks for the report. Can you begin with a more accurate description 
of your environment?
Begin with host, oVirt and Gluster versions. Then continue with the 
exact setup (what are 'A', 'B', 'C' - domains? Volumes? What is the 
mapping between domains and volumes?).


Are there any logs you can share with us?

I'm sure with more information, we'd be happy to look at the issue.
Y.


On Thu, Mar 31, 2016 at 3:09 PM, p...@email.cz  
> wrote:


Hello,
we tried the  following test - with unwanted results

input:
5 node gluster
A = replica 3 with arbiter 1 ( node1+node2+arbiter on node 5 )
B = replica 3 with arbiter 1 ( node3+node4+arbiter on node 5 )
C = distributed replica 3 arbiter 1  ( node1+node2, node3+node4,
each arbiter on node 5)
node 5 has only arbiter replica ( 4x )

TEST:
1)  directly reboot one node - OK ( is not important which ( data
node or arbiter node ))
2)  directly reboot two nodes - OK ( if  nodes are not from the
same replica )
3)  directly reboot three nodes - yes, this is the main problem
and a questions 
- rebooted all three nodes from replica "B"  ( not so
possible, but who knows ... )
- all VMs with data on this replica was paused ( no data
access ) - OK
- all VMs running on replica "B" nodes lost (  started
manually, later )( datas on other replicas ) - acceptable
BUT
- !!! all oVIrt domains went down !! - master domain is on
replica "A" which lost only one member from three !!!
so we are not expecting that all domain will go down,
especially master with 2 live members.

Results:
- the whole cluster unreachable until at all domains up -
depent of all nodes up !!!
- all paused VMs started back - OK
- rest of all VMs rebooted and runnig - OK

Questions:
1) why all domains down if master domain ( on replica "A" )
has two runnig members ( 2 of 3 )  ??
2) how to fix that colaps without waiting to all nodes up ? (
in worste case if node has HW error eg. ) ??
3) which oVirt  cluster  policy  can prevent that situation ??
( if any )

regs.
Pavel



___
Users mailing list
Users@ovirt.org 
http://lists.ovirt.org/mailman/listinfo/users




___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] ovirt with glusterfs - big test - unwanted results

2016-03-31 Thread p...@email.cz

Hi,
rest of logs:
www.uschovna.cz/en/zasilka/HYGXR57CNHM3TP39-L3W 



The TEST is the last big event in logs 
TEST TIME : about 14:00-14:30  CET

regs.Pavel

On 31.3.2016 14:30, Yaniv Kaul wrote:

Hi Pavel,

Thanks for the report. Can you begin with a more accurate description 
of your environment?
Begin with host, oVirt and Gluster versions. Then continue with the 
exact setup (what are 'A', 'B', 'C' - domains? Volumes? What is the 
mapping between domains and volumes?).


Are there any logs you can share with us?

I'm sure with more information, we'd be happy to look at the issue.
Y.


On Thu, Mar 31, 2016 at 3:09 PM, p...@email.cz  
> wrote:


Hello,
we tried the  following test - with unwanted results

input:
5 node gluster
A = replica 3 with arbiter 1 ( node1+node2+arbiter on node 5 )
B = replica 3 with arbiter 1 ( node3+node4+arbiter on node 5 )
C = distributed replica 3 arbiter 1  ( node1+node2, node3+node4,
each arbiter on node 5)
node 5 has only arbiter replica ( 4x )

TEST:
1)  directly reboot one node - OK ( is not important which ( data
node or arbiter node ))
2)  directly reboot two nodes - OK ( if  nodes are not from the
same replica )
3)  directly reboot three nodes - yes, this is the main problem
and a questions 
- rebooted all three nodes from replica "B"  ( not so
possible, but who knows ... )
- all VMs with data on this replica was paused ( no data
access ) - OK
- all VMs running on replica "B" nodes lost (  started
manually, later )( datas on other replicas ) - acceptable
BUT
- !!! all oVIrt domains went down !! - master domain is on
replica "A" which lost only one member from three !!!
so we are not expecting that all domain will go down,
especially master with 2 live members.

Results:
- the whole cluster unreachable until at all domains up -
depent of all nodes up !!!
- all paused VMs started back - OK
- rest of all VMs rebooted and runnig - OK

Questions:
1) why all domains down if master domain ( on replica "A" )
has two runnig members ( 2 of 3 )  ??
2) how to fix that colaps without waiting to all nodes up ? (
in worste case if node has HW error eg. ) ??
3) which oVirt  cluster  policy  can prevent that situation ??
( if any )

regs.
Pavel



___
Users mailing list
Users@ovirt.org 
http://lists.ovirt.org/mailman/listinfo/users




___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] ovirt with glusterfs - big test - unwanted results

2016-03-31 Thread p...@email.cz

Hello,
some envir. answers :
*
OS = RHEL - 7 - 2.151
kernel = 3.10.0 - 327.10.1.el7.x86_64
KVM = 2.3.0 - 31.el7_2.7.1
libvirt = libvirt-1.2.17-13.el7_2.3
vdsm = vdsm-4.17.23.2-0.el7
glusterfs = glusterfs-3.7.9-1.el7
ovirt = 3.5.6.2-1
*
# gluster peer status
Number of Peers: 4

Hostname: 1hp2
Uuid: 8e87cf18-8958-41b7-8d24-7ee420a1ef9f
State: Peer in Cluster (Connected)

Hostname: 2hp2
Uuid: b1d987d8-0b42-4ce4-b85f-83b4072e0990
State: Peer in Cluster (Connected)

Hostname: 2hp1
Uuid: a1cbe1a8-88ad-4e89-8a0e-d2bb2b6786d8
State: Peer in Cluster (Connected)

Hostname: kvmarbiter
Uuid: bb1d63f1-7757-4c07-b70d-aa2f68449e21
State: Peer in Cluster (Connected)
*
== "C" ==
Volume Name: 12HP12-D2R3A1P2
Type: Distributed-Replicate
Volume ID: 3c22d3dc-7c6e-4e37-9e0b-78410873ed6d
Status: Started
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: 1hp1:/STORAGES/P2/GFS
Brick2: 1hp2:/STORAGES/P2/GFS
Brick3: kvmarbiter:/STORAGES/P2-1/GFS (arbiter)
Brick4: 2hp1:/STORAGES/P2/GFS
Brick5: 2hp2:/STORAGES/P2/GFS
Brick6: kvmarbiter:/STORAGES/P2-2/GFS (arbiter)
Options Reconfigured:
performance.readdir-ahead: on
*
== "A" ==
Volume Name: 1HP12-R3A1P1
Type: Replicate
Volume ID: e4121610-6128-4ecc-86d3-1429ab3b8356
Status: Started
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: 1hp1:/STORAGES/P1/GFS
Brick2: 1hp2:/STORAGES/P1/GFS
Brick3: kvmarbiter:/STORAGES/P1-1/GFS (arbiter)
Options Reconfigured:
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
cluster.data-self-heal-algorithm: full
performance.write-behind: on
performance.low-prio-threads: 32
performance.write-behind-window-size: 128MB
network.ping-timeout: 10
*
== "B" ==
Volume Name: 2HP12-R3A1P1
Type: Replicate
Volume ID: d3d260cd-455f-42d6-9580-d88ae6df0519
Status: Started
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: 2hp1:/STORAGES/P1/GFS
Brick2: 2hp2:/STORAGES/P1/GFS
Brick3: kvmarbiter:/STORAGES/P1-2/GFS (arbiter)
Options Reconfigured:
performance.readdir-ahead: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
performance.stat-prefetch: off
cluster.eager-lock: enable
network.remote-dio: enable
cluster.quorum-type: auto
cluster.server-quorum-type: server
storage.owner-uid: 36
storage.owner-gid: 36
features.shard: on
features.shard-block-size: 512MB
cluster.data-self-heal-algorithm: full
performance.write-behind: on
performance.low-prio-threads: 32
performance.write-behind-window-size: 128MB
network.ping-timeout: 10


The oVirt volumes(storages) have the same name as gluster volumes ( eg: 
"B" = 2HP12-R3A1P1( ovirt storage ) = 2HP12-R3A1P1( gluster volume name ) )

In the test the master volume was  "A" = 1HP12-R3A1P1

regs. Pavel
PS: logs will follow as webstore pointer ... this takes some time


On 31.3.2016 14:30, Yaniv Kaul wrote:

Hi Pavel,

Thanks for the report. Can you begin with a more accurate description 
of your environment?
Begin with host, oVirt and Gluster versions. Then continue with the 
exact setup (what are 'A', 'B', 'C' - domains? Volumes? What is the 
mapping between domains and volumes?).


Are there any logs you can share with us?

I'm sure with more information, we'd be happy to look at the issue.
Y.


On Thu, Mar 31, 2016 at 3:09 PM, p...@email.cz  
> wrote:


Hello,
we tried the  following test - with unwanted results

input:
5 node gluster
A = replica 3 with arbiter 1 ( node1+node2+arbiter on node 5 )
B = replica 3 with arbiter 1 ( node3+node4+arbiter on node 5 )
C = distributed replica 3 arbiter 1  ( node1+node2, node3+node4,
each arbiter on node 5)
node 5 has only arbiter replica ( 4x )

TEST:
1)  directly reboot one node - OK ( is not important which ( data
node or arbiter node ))
2)  directly reboot two nodes - OK ( if  nodes are not from the
same replica )
3)  directly reboot three nodes - yes, this is the main problem
and a questions 
- rebooted all three nodes from replica "B"  ( not so
possible, but who knows ... )
- all VMs with data on this replica was paused ( no data
access ) - OK
- all VMs running on replica "B" nodes lost (  started
manually, later )( datas on other replicas ) - acceptable
BUT
- !!! all oVIrt domains went down !! - master domain is on
replica "A" which lost only one member from three !!!
  

Re: [ovirt-users] ovirt with glusterfs - big test - unwanted results

2016-03-31 Thread Yaniv Kaul
Hi Pavel,

Thanks for the report. Can you begin with a more accurate description of
your environment?
Begin with host, oVirt and Gluster versions. Then continue with the exact
setup (what are 'A', 'B', 'C' - domains? Volumes? What is the mapping
between domains and volumes?).

Are there any logs you can share with us?

I'm sure with more information, we'd be happy to look at the issue.
Y.


On Thu, Mar 31, 2016 at 3:09 PM, p...@email.cz  wrote:

> Hello,
> we tried the  following test - with unwanted results
>
> input:
> 5 node gluster
> A = replica 3 with arbiter 1 ( node1+node2+arbiter on node 5 )
> B = replica 3 with arbiter 1 ( node3+node4+arbiter on node 5 )
> C = distributed replica 3 arbiter 1  ( node1+node2, node3+node4, each
> arbiter on node 5)
> node 5 has only arbiter replica ( 4x )
>
> TEST:
> 1)  directly reboot one node - OK ( is not important which ( data node or
> arbiter node ))
> 2)  directly reboot two nodes - OK ( if  nodes are not from the same
> replica )
> 3)  directly reboot three nodes - yes, this is the main problem and a
> questions 
> - rebooted all three nodes from replica "B"  ( not so possible, but
> who knows ... )
> - all VMs with data on this replica was paused ( no data access ) - OK
> - all VMs running on replica "B" nodes lost (  started manually, later
> )( datas on other replicas ) - acceptable
> BUT
> - !!! all oVIrt domains went down !! - master domain is on replica "A"
> which lost only one member from three !!!
> so we are not expecting that all domain will go down, especially
> master with 2 live members.
>
> Results:
> - the whole cluster unreachable until at all domains up - depent of
> all nodes up !!!
> - all paused VMs started back - OK
> - rest of all VMs rebooted and runnig - OK
>
> Questions:
> 1) why all domains down if master domain ( on replica "A" ) has two
> runnig members ( 2 of 3 )  ??
> 2) how to fix that colaps without waiting to all nodes up ? ( in
> worste case if node has HW error eg. ) ??
> 3) which oVirt  cluster  policy  can prevent that situation ?? ( if
> any )
>
> regs.
> Pavel
>
>
>
> ___
> Users mailing list
> Users@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>
>
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


[ovirt-users] ovirt with glusterfs - big test - unwanted results

2016-03-31 Thread p...@email.cz

Hello,
we tried the  following test - with unwanted results

input:
5 node gluster
A = replica 3 with arbiter 1 ( node1+node2+arbiter on node 5 )
B = replica 3 with arbiter 1 ( node3+node4+arbiter on node 5 )
C = distributed replica 3 arbiter 1  ( node1+node2, node3+node4, each 
arbiter on node 5)

node 5 has only arbiter replica ( 4x )

TEST:
1)  directly reboot one node - OK ( is not important which ( data node 
or arbiter node ))
2)  directly reboot two nodes - OK ( if  nodes are not from the same 
replica )
3)  directly reboot three nodes - yes, this is the main problem and a 
questions 
- rebooted all three nodes from replica "B"  ( not so possible, but 
who knows ... )

- all VMs with data on this replica was paused ( no data access ) - OK
- all VMs running on replica "B" nodes lost (  started manually, 
later )( datas on other replicas ) - acceptable

BUT
- !!! all oVIrt domains went down !! - master domain is on replica 
"A" which lost only one member from three !!!
so we are not expecting that all domain will go down, especially 
master with 2 live members.


Results:
- the whole cluster unreachable until at all domains up - depent of 
all nodes up !!!

- all paused VMs started back - OK
- rest of all VMs rebooted and runnig - OK

Questions:
1) why all domains down if master domain ( on replica "A" ) has two 
runnig members ( 2 of 3 )  ??
2) how to fix that colaps without waiting to all nodes up ? ( in 
worste case if node has HW error eg. ) ??
3) which oVirt  cluster  policy  can prevent that situation ?? ( if 
any )


regs.
Pavel


___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users