[ovirt-users] Re: HE + Gluster : Engine corrupted?

2018-07-18 Thread Ravishankar N

Hi,

"[2018-06-16 04:00:10.264690] E [MSGID: 108008] 
[afr-self-heal-common.c:335:afr_gfid_split_brain_source] 
0-engine-replicate-0: Gfid mismatch detected for 
/hosted-engine.lockspace>, 
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and 
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0."


Are the gfids actually different on the bricks as this message says? If 
yes, then the commands shared earlier should have fixed it.


-Ravi


On 07/02/2018 02:15 PM, Krutika Dhananjay wrote:

Hi,

So it seems some of the files in the volume have mismatching gfids. I 
see the following logs from 15th June, ~8pm EDT:



...
...
[2018-06-16 04:00:10.264690] E [MSGID: 108008] 
[afr-self-heal-common.c:335:afr_gfid_split_brain_source] 
0-engine-replicate-0: Gfid mismatch detected for 
/hosted-engine.lockspace>, 
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and 
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.
[2018-06-16 04:00:10.265861] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4411: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
[2018-06-16 04:00:11.522600] E [MSGID: 108008] 
[afr-self-heal-common.c:212:afr_gfid_split_brain_source] 
0-engine-replicate-0: All the bricks should be up to resolve the gfid 
split barin
[2018-06-16 04:00:11.522632] E [MSGID: 108008] 
[afr-self-heal-common.c:335:afr_gfid_split_brain_source] 
0-engine-replicate-0: Gfid mismatch detected for 
/hosted-engine.lockspace>, 
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and 
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.
[2018-06-16 04:00:11.523750] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4493: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
[2018-06-16 04:00:12.864393] E [MSGID: 108008] 
[afr-self-heal-common.c:212:afr_gfid_split_brain_source] 
0-engine-replicate-0: All the bricks should be up to resolve the gfid 
split barin
[2018-06-16 04:00:12.864426] E [MSGID: 108008] 
[afr-self-heal-common.c:335:afr_gfid_split_brain_source] 
0-engine-replicate-0: Gfid mismatch detected for 
/hosted-engine.lockspace>, 
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and 
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.
[2018-06-16 04:00:12.865392] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4575: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
[2018-06-16 04:00:18.716007] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4657: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
[2018-06-16 04:00:20.553365] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4739: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
[2018-06-16 04:00:21.771698] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4821: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
[2018-06-16 04:00:23.871647] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4906: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
[2018-06-16 04:00:25.034780] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4987: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)

...
...


Adding Ravi who works on replicate component to hep resolve the 
mismatches.


-Krutika


On Mon, Jul 2, 2018 at 12:27 PM, Krutika Dhananjay 
mailto:kdhan...@redhat.com>> wrote:


Hi,

Sorry, I was out sick on Friday. I am looking into the logs. Will
get back to you in some time.

-Krutika

On Fri, Jun 29, 2018 at 7:47 PM, Hanson Turner
mailto:han...@andrewswireless.net>>
wrote:

Hi Krutika,

Did you need any other logs?


Thanks,

Hanson


On 06/27/2018 02:04 PM, Hanson Turner wrote:


Hi Krutika,

Looking at the email spams, it looks like it started at
8:04PM EDT on Jun 15 2018.

From my memory, I think the cluster was working fine until
sometime that night. Somewhere between midnight and the next
(Saturday) morning, the engine crashed and all vm's stopped.

I do have nightly backups that ran every night, using the
engine-backup command. Looks like my last valid backup was
2018-06-15.

I've included all logs I think might be of use. Please
forgive the use of 7zip, as the raw logs took 50mb which is
greater than my attachment limit.

I think the just of what happened, is we had a downed node
for a period of time. Earlier that day, the node was brought
back into service. Later that night or early the next
morning, the engine was gone and hopping 

[ovirt-users] Re: HE + Gluster : Engine corrupted?

2018-07-03 Thread Hanson Turner

Hi Ravishankar,

This doesn't look like split-brain...

[root@ovirtnode1 ~]# gluster volume heal engine info
Brick ovirtnode1:/gluster_bricks/engine/engine
Status: Connected
Number of entries: 0

Brick ovirtnode3:/gluster_bricks/engine/engine
Status: Connected
Number of entries: 0

Brick ovirtnode4:/gluster_bricks/engine/engine
Status: Connected
Number of entries: 0

[root@ovirtnode1 ~]# gluster volume heal engine info split-brain
Brick ovirtnode1:/gluster_bricks/engine/engine
Status: Connected
Number of entries in split-brain: 0

Brick ovirtnode3:/gluster_bricks/engine/engine
Status: Connected
Number of entries in split-brain: 0

Brick ovirtnode4:/gluster_bricks/engine/engine
Status: Connected
Number of entries in split-brain: 0

[root@ovirtnode1 ~]# gluster volume info engine
Volume Name: engine
Type: Replicate
Volume ID: c8dc1b04-bc25-4e97-81bb-4d94929918b1
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: ovirtnode1:/gluster_bricks/engine/engine
Brick2: ovirtnode3:/gluster_bricks/engine/engine
Brick3: ovirtnode4:/gluster_bricks/engine/engine

Thanks,

Hanson


On 07/02/2018 07:09 AM, Ravishankar N wrote:




On 07/02/2018 02:15 PM, Krutika Dhananjay wrote:

Hi,

So it seems some of the files in the volume have mismatching gfids. I 
see the following logs from 15th June, ~8pm EDT:



...
...
[2018-06-16 04:00:10.264690] E [MSGID: 108008] 
[afr-self-heal-common.c:335:afr_gfid_split_brain_source] 
0-engine-replicate-0: Gfid mismatch detected for 
/hosted-engine.lockspace>, 
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and 
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.


You can use 
https://docs.gluster.org/en/latest/Troubleshooting/resolving-splitbrain/ 
(see 3. Resolution of split-brain using gluster CLI).
Nit: The doc says in the beginning that gfid split-brain cannot be 
fixed automatically but newer releases do support it, so the methods 
in section 3 should work to solve gfid split-brains.


[2018-06-16 04:00:10.265861] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4411: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
[2018-06-16 04:00:11.522600] E [MSGID: 108008] 
[afr-self-heal-common.c:212:afr_gfid_split_brain_source] 
0-engine-replicate-0: All the bricks should be up to resolve the gfid 
split barin

This is a concern. For the commands to work, all 3 bricks must be online.
Thanks,
Ravi
[2018-06-16 04:00:11.522632] E [MSGID: 108008] 
[afr-self-heal-common.c:335:afr_gfid_split_brain_source] 
0-engine-replicate-0: Gfid mismatch detected for 
/hosted-engine.lockspace>, 
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and 
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.
[2018-06-16 04:00:11.523750] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4493: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
[2018-06-16 04:00:12.864393] E [MSGID: 108008] 
[afr-self-heal-common.c:212:afr_gfid_split_brain_source] 
0-engine-replicate-0: All the bricks should be up to resolve the gfid 
split barin
[2018-06-16 04:00:12.864426] E [MSGID: 108008] 
[afr-self-heal-common.c:335:afr_gfid_split_brain_source] 
0-engine-replicate-0: Gfid mismatch detected for 
/hosted-engine.lockspace>, 
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and 
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.
[2018-06-16 04:00:12.865392] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4575: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
[2018-06-16 04:00:18.716007] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4657: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
[2018-06-16 04:00:20.553365] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4739: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
[2018-06-16 04:00:21.771698] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4821: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
[2018-06-16 04:00:23.871647] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4906: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
[2018-06-16 04:00:25.034780] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4987: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)

...
...


Adding Ravi who works on replicate component to hep resolve the 
mismatches.


-Krutika


On Mon, Jul 2, 2018 at 12:27 PM, Krutika Dhananjay 
mailto:kdhan...@redhat.com>> wrote:


Hi,

Sorry, I was out sick on Friday. I am looking into the logs. Will
get back to you in some time.

-Krutika

On 

[ovirt-users] Re: HE + Gluster : Engine corrupted?

2018-07-02 Thread Ravishankar N



On 07/02/2018 02:15 PM, Krutika Dhananjay wrote:

Hi,

So it seems some of the files in the volume have mismatching gfids. I 
see the following logs from 15th June, ~8pm EDT:



...
...
[2018-06-16 04:00:10.264690] E [MSGID: 108008] 
[afr-self-heal-common.c:335:afr_gfid_split_brain_source] 
0-engine-replicate-0: Gfid mismatch detected for 
/hosted-engine.lockspace>, 
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and 
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.


You can use 
https://docs.gluster.org/en/latest/Troubleshooting/resolving-splitbrain/ 
(see 3. Resolution of split-brain using gluster CLI).
Nit: The doc says in the beginning that gfid split-brain cannot be fixed 
automatically but newer releases do support it, so the methods in 
section 3 should work to solve gfid split-brains.


[2018-06-16 04:00:10.265861] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4411: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
[2018-06-16 04:00:11.522600] E [MSGID: 108008] 
[afr-self-heal-common.c:212:afr_gfid_split_brain_source] 
0-engine-replicate-0: All the bricks should be up to resolve the gfid 
split barin

This is a concern. For the commands to work, all 3 bricks must be online.
Thanks,
Ravi
[2018-06-16 04:00:11.522632] E [MSGID: 108008] 
[afr-self-heal-common.c:335:afr_gfid_split_brain_source] 
0-engine-replicate-0: Gfid mismatch detected for 
/hosted-engine.lockspace>, 
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and 
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.
[2018-06-16 04:00:11.523750] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4493: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
[2018-06-16 04:00:12.864393] E [MSGID: 108008] 
[afr-self-heal-common.c:212:afr_gfid_split_brain_source] 
0-engine-replicate-0: All the bricks should be up to resolve the gfid 
split barin
[2018-06-16 04:00:12.864426] E [MSGID: 108008] 
[afr-self-heal-common.c:335:afr_gfid_split_brain_source] 
0-engine-replicate-0: Gfid mismatch detected for 
/hosted-engine.lockspace>, 
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and 
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.
[2018-06-16 04:00:12.865392] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4575: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
[2018-06-16 04:00:18.716007] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4657: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
[2018-06-16 04:00:20.553365] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4739: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
[2018-06-16 04:00:21.771698] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4821: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
[2018-06-16 04:00:23.871647] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4906: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)
[2018-06-16 04:00:25.034780] W [fuse-bridge.c:540:fuse_entry_cbk] 
0-glusterfs-fuse: 4987: LOOKUP() 
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace 
=> -1 (Input/output error)

...
...


Adding Ravi who works on replicate component to hep resolve the 
mismatches.


-Krutika


On Mon, Jul 2, 2018 at 12:27 PM, Krutika Dhananjay 
mailto:kdhan...@redhat.com>> wrote:


Hi,

Sorry, I was out sick on Friday. I am looking into the logs. Will
get back to you in some time.

-Krutika

On Fri, Jun 29, 2018 at 7:47 PM, Hanson Turner
mailto:han...@andrewswireless.net>>
wrote:

Hi Krutika,

Did you need any other logs?


Thanks,

Hanson


On 06/27/2018 02:04 PM, Hanson Turner wrote:


Hi Krutika,

Looking at the email spams, it looks like it started at
8:04PM EDT on Jun 15 2018.

From my memory, I think the cluster was working fine until
sometime that night. Somewhere between midnight and the next
(Saturday) morning, the engine crashed and all vm's stopped.

I do have nightly backups that ran every night, using the
engine-backup command. Looks like my last valid backup was
2018-06-15.

I've included all logs I think might be of use. Please
forgive the use of 7zip, as the raw logs took 50mb which is
greater than my attachment limit.

I think the just of what happened, is we had a downed node
for a period of time. Earlier that day, the node was brought
back into service. Later that night or early the next
morning, the engine was gone and hopping from node to node.


[ovirt-users] Re: HE + Gluster : Engine corrupted?

2018-07-02 Thread Krutika Dhananjay
Hi,

So it seems some of the files in the volume have mismatching gfids. I see
the following logs from 15th June, ~8pm EDT:


...
...
[2018-06-16 04:00:10.264690] E [MSGID: 108008]
[afr-self-heal-common.c:335:afr_gfid_split_brain_source]
0-engine-replicate-0: Gfid mismatch detected for
/hosted-engine.lockspace>,
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.
[2018-06-16 04:00:10.265861] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4411: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[2018-06-16 04:00:11.522600] E [MSGID: 108008]
[afr-self-heal-common.c:212:afr_gfid_split_brain_source]
0-engine-replicate-0: All the bricks should be up to resolve the gfid split
barin
[2018-06-16 04:00:11.522632] E [MSGID: 108008]
[afr-self-heal-common.c:335:afr_gfid_split_brain_source]
0-engine-replicate-0: Gfid mismatch detected for
/hosted-engine.lockspace>,
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.
[2018-06-16 04:00:11.523750] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4493: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[2018-06-16 04:00:12.864393] E [MSGID: 108008]
[afr-self-heal-common.c:212:afr_gfid_split_brain_source]
0-engine-replicate-0: All the bricks should be up to resolve the gfid split
barin
[2018-06-16 04:00:12.864426] E [MSGID: 108008]
[afr-self-heal-common.c:335:afr_gfid_split_brain_source]
0-engine-replicate-0: Gfid mismatch detected for
/hosted-engine.lockspace>,
6bbe6097-8520-4a61-971e-6e30c2ee0abe on engine-client-2 and
ef21a706-41cf-4519-8659-87ecde4bbfbf on engine-client-0.
[2018-06-16 04:00:12.865392] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4575: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[2018-06-16 04:00:18.716007] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4657: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[2018-06-16 04:00:20.553365] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4739: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[2018-06-16 04:00:21.771698] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4821: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[2018-06-16 04:00:23.871647] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4906: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
[2018-06-16 04:00:25.034780] W [fuse-bridge.c:540:fuse_entry_cbk]
0-glusterfs-fuse: 4987: LOOKUP()
/c65e03f0-d553-4d5d-ba4f-9d378c153b9b/ha_agent/hosted-engine.lockspace =>
-1 (Input/output error)
...
...


Adding Ravi who works on replicate component to hep resolve the mismatches.

-Krutika


On Mon, Jul 2, 2018 at 12:27 PM, Krutika Dhananjay 
wrote:

> Hi,
>
> Sorry, I was out sick on Friday. I am looking into the logs. Will get back
> to you in some time.
>
> -Krutika
>
> On Fri, Jun 29, 2018 at 7:47 PM, Hanson Turner  > wrote:
>
>> Hi Krutika,
>>
>> Did you need any other logs?
>>
>>
>> Thanks,
>>
>> Hanson
>>
>> On 06/27/2018 02:04 PM, Hanson Turner wrote:
>>
>> Hi Krutika,
>>
>> Looking at the email spams, it looks like it started at 8:04PM EDT on Jun
>> 15 2018.
>>
>> From my memory, I think the cluster was working fine until sometime that
>> night. Somewhere between midnight and the next (Saturday) morning, the
>> engine crashed and all vm's stopped.
>>
>> I do have nightly backups that ran every night, using the engine-backup
>> command. Looks like my last valid backup was 2018-06-15.
>>
>> I've included all logs I think might be of use. Please forgive the use of
>> 7zip, as the raw logs took 50mb which is greater than my attachment limit.
>>
>> I think the just of what happened, is we had a downed node for a period
>> of time. Earlier that day, the node was brought back into service. Later
>> that night or early the next morning, the engine was gone and hopping from
>> node to node.
>>
>> I have tried to mount the engine's hdd file to see if I could fix it.
>> There are a few corrupted partitions, and those are xfs formatted. Trying
>> to mount gives me issues about needing repaired, trying to repair gives me
>> issues about needing something cleaned first. I cannot remember exactly
>> what it was, but it wanted me to run a command that ended -L to clear out
>> the logs. I said no way and have left the engine vm in a powered down
>> state, as well as the cluster in global maintenance.
>>
>> I can see no sign of the vm booting, (ie no networking) except for what
>> I've described earlier in the VNC session.
>>
>>
>> Thanks,
>>
>> Hanson
>>
>>
>>
>> On 06/27/2018 12:04 PM, 

[ovirt-users] Re: HE + Gluster : Engine corrupted?

2018-06-25 Thread Krutika Dhananjay
Could you share the gluster mount and brick logs? You'll find  them under
/var/log/glusterfs.
Also, what's the version of gluster you're using?
Also, output of `gluster volume info `?

-Krutika

On Thu, Jun 21, 2018 at 9:50 AM, Sahina Bose  wrote:

>
>
> On Wed, Jun 20, 2018 at 11:33 PM, Hanson Turner <
> han...@andrewswireless.net> wrote:
>
>> Hi Benny,
>>
>> Who should I be reaching out to for help with a gluster based hosted
>> engine corruption?
>>
>
>
> Krutika, could you help?
>
>
>>
>> --== Host 1 status ==--
>>
>> conf_on_shared_storage : True
>> Status up-to-date  : True
>> Hostname   : ovirtnode1.abcxyzdomains.net
>> Host ID: 1
>> Engine status  : {"reason": "failed liveliness
>> check", "health": "bad", "vm": "up", "detail": "Up"}
>> Score  : 3400
>> stopped: False
>> Local maintenance  : False
>> crc32  : 92254a68
>> local_conf_timestamp   : 115910
>> Host timestamp : 115910
>> Extra metadata (valid at timestamp):
>> metadata_parse_version=1
>> metadata_feature_version=1
>> timestamp=115910 (Mon Jun 18 09:43:20 2018)
>> host-id=1
>> score=3400
>> vm_conf_refresh_time=115910 (Mon Jun 18 09:43:20 2018)
>> conf_on_shared_storage=True
>> maintenance=False
>> state=GlobalMaintenance
>> stopped=False
>>
>>
>> My when I VNC into my HE, All I get is:
>> Probing EDD (edd=off to disable)... ok
>>
>>
>> So, that's why it's failing the liveliness check... I cannot get the
>> screen on HE to change short of ctl-alt-del which will reboot the HE.
>> I do have backups for the HE that are/were run on a nightly basis.
>>
>> If the cluster was left alone, the HE vm would bounce from machine to
>> machine trying to boot. This is why the cluster is in maintenance mode.
>> One of the nodes was down for a period of time and brought back, sometime
>> through the night, which is when the automated backup kicks, the HE started
>> bouncing around. Got nearly 1000 emails.
>>
>> This seems to be the same error (but may not be the same cause) as listed
>> here:
>> https://bugzilla.redhat.com/show_bug.cgi?id=1569827
>>
>> Thanks,
>>
>> Hanson
>>
>>
>> ___
>> Users mailing list -- users@ovirt.org
>> To unsubscribe send an email to users-le...@ovirt.org
>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>> oVirt Code of Conduct: https://www.ovirt.org/communit
>> y/about/community-guidelines/
>> List Archives: https://lists.ovirt.org/archiv
>> es/list/users@ovirt.org/message/3NLA2URX3KN44FGFUVV4N5EJBPICABHH/
>>
>>
>
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/535L3P4FXHSIR2NCKKZM2FOWXLAW6MHS/


[ovirt-users] Re: HE + Gluster : Engine corrupted?

2018-06-20 Thread Sahina Bose
On Wed, Jun 20, 2018 at 11:33 PM, Hanson Turner 
wrote:

> Hi Benny,
>
> Who should I be reaching out to for help with a gluster based hosted
> engine corruption?
>


Krutika, could you help?


>
> --== Host 1 status ==--
>
> conf_on_shared_storage : True
> Status up-to-date  : True
> Hostname   : ovirtnode1.abcxyzdomains.net
> Host ID: 1
> Engine status  : {"reason": "failed liveliness check",
> "health": "bad", "vm": "up", "detail": "Up"}
> Score  : 3400
> stopped: False
> Local maintenance  : False
> crc32  : 92254a68
> local_conf_timestamp   : 115910
> Host timestamp : 115910
> Extra metadata (valid at timestamp):
> metadata_parse_version=1
> metadata_feature_version=1
> timestamp=115910 (Mon Jun 18 09:43:20 2018)
> host-id=1
> score=3400
> vm_conf_refresh_time=115910 (Mon Jun 18 09:43:20 2018)
> conf_on_shared_storage=True
> maintenance=False
> state=GlobalMaintenance
> stopped=False
>
>
> My when I VNC into my HE, All I get is:
> Probing EDD (edd=off to disable)... ok
>
>
> So, that's why it's failing the liveliness check... I cannot get the
> screen on HE to change short of ctl-alt-del which will reboot the HE.
> I do have backups for the HE that are/were run on a nightly basis.
>
> If the cluster was left alone, the HE vm would bounce from machine to
> machine trying to boot. This is why the cluster is in maintenance mode.
> One of the nodes was down for a period of time and brought back, sometime
> through the night, which is when the automated backup kicks, the HE started
> bouncing around. Got nearly 1000 emails.
>
> This seems to be the same error (but may not be the same cause) as listed
> here:
> https://bugzilla.redhat.com/show_bug.cgi?id=1569827
>
> Thanks,
>
> Hanson
>
>
> ___
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct: https://www.ovirt.org/community/about/community-
> guidelines/
> List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/
> message/3NLA2URX3KN44FGFUVV4N5EJBPICABHH/
>
>
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/SIPMUIPXWPHUCFJSGJPYTCYMNDN5IAKI/


[ovirt-users] Re: HE + Gluster : Engine corrupted?

2018-06-20 Thread Greg Sheremeta
+Benny

On Wed, Jun 20, 2018 at 2:07 PM Hanson Turner 
wrote:

> Hi Benny,
>
> Who should I be reaching out to for help with a gluster based hosted
> engine corruption?
>
>
> --== Host 1 status ==--
>
> conf_on_shared_storage : True
> Status up-to-date  : True
> Hostname   : ovirtnode1.abcxyzdomains.net
> Host ID: 1
> Engine status  : {"reason": "failed liveliness check",
> "health": "bad", "vm": "up", "detail": "Up"}
> Score  : 3400
> stopped: False
> Local maintenance  : False
> crc32  : 92254a68
> local_conf_timestamp   : 115910
> Host timestamp : 115910
> Extra metadata (valid at timestamp):
> metadata_parse_version=1
> metadata_feature_version=1
> timestamp=115910 (Mon Jun 18 09:43:20 2018)
> host-id=1
> score=3400
> vm_conf_refresh_time=115910 (Mon Jun 18 09:43:20 2018)
> conf_on_shared_storage=True
> maintenance=False
> state=GlobalMaintenance
> stopped=False
>
>
> My when I VNC into my HE, All I get is:
> Probing EDD (edd=off to disable)... ok
>
>
> So, that's why it's failing the liveliness check... I cannot get the
> screen on HE to change short of ctl-alt-del which will reboot the HE.
> I do have backups for the HE that are/were run on a nightly basis.
>
> If the cluster was left alone, the HE vm would bounce from machine to
> machine trying to boot. This is why the cluster is in maintenance mode.
> One of the nodes was down for a period of time and brought back, sometime
> through the night, which is when the automated backup kicks, the HE started
> bouncing around. Got nearly 1000 emails.
>
> This seems to be the same error (but may not be the same cause) as listed
> here:
> https://bugzilla.redhat.com/show_bug.cgi?id=1569827
>
> Thanks,
>
> Hanson
>
> ___
> Users mailing list -- users@ovirt.org
> To unsubscribe send an email to users-le...@ovirt.org
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct:
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
> https://lists.ovirt.org/archives/list/users@ovirt.org/message/3NLA2URX3KN44FGFUVV4N5EJBPICABHH/
>
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/JPFZYUEB7ZSRN4XUHETFYJMRHYCRX4N5/