On 03/03/2014 03:25 PM, Yedidyah Bar David wrote:
----- Original Message -----
From: "René Koch" <[email protected]>
To: "Yedidyah Bar David" <[email protected]>, "Martin Sivak" <[email protected]>
Cc: [email protected]
Sent: Monday, March 3, 2014 4:10:51 PM
Subject: Re: [Users] hosted engine issues

On 03/03/2014 02:13 PM, Yedidyah Bar David wrote:
Me neither. Is everything Read-Write? Read-Only FS might report no space
left
as well in some cases. Other than that, I do not know.

Perhaps some ipc resource? semaphores?

Please check:

ipcs

cat /proc/sys/kernel/sem

I know nothing about libvirt, that's just a wild guess.

# ipcs

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status

0x00000000 0          root       644        80         2

0x00000000 32769      root       644        16384      2

0x00000000 65538      root       644        280        2


------ Semaphore Arrays --------
key        semid      owner      perms      nsems
0x00000000 0          root       600        1
0x00000000 65537      root       600        1
0x000000a7 163842     root       600        1

This means you have 3 semaphore sets, of one semaphore each.


------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages


Also the rest is moderate usage.

# cat /proc/sys/kernel/sem
250     32000   32      128

So you are far from the maxima (250 per set, 32000 total, 128 sets).



Do you see anything in this output?
I have no clue how to interpret this...

See e.g. http://man7.org/linux/man-pages/man5/proc.5.html

Is the above on a node? engine? both nodes are similar? If so, that's
not the reason for the "no space left on device".

Same on both hosts.
These are CentOS 6.5 hosts which are the base for hosted engine.


If this error is reproducible, you can try to find the process that this
happens to (perhaps libvirtd, vdsmd, or the hosted-engine ha daemon) and do:
strace -f -o /tmp/trace1 -tt -s 512 -p PID
where PID is the pid of that process, then search /tmp/trace1 for 'no space
left on device' and see the exact call that failed.

Thanks a lot for the troubleshooting tips.
I figured the following out:

strace of libvirtd:

3296 17:10:05.396192 write(4, "2014-03-03 16:10:05.396+0000: 3296: error : virLockManagerSanlockAcquire:974 : Failed to acquire lock: No space left on device\n", 127 <unfinished ...>

Then I checked sanlock.log where I found the following error message (which could to be the reason for No space left on device): 2014-03-03 17:10:05+0100 25094 [3105]: r6 cmd_acquire 2,9,11852 invalid lockspace found -1 failed 0 name 2851af27-8744-445d-9fb1-a0d083c8dc82

So my question is now if I can remove the lockspace file (it should be hosted-engine.lockspace located in /rhev/data-center/mnt/ovirt-host01\:_engine/2851af27-8744-445d-9fb1-a0d083c8dc82/ha_agent/, right?) and it will be created again. I fear the GlusterFS split-brain situation destroyed it as this file was affected.


Thanks,
René



_______________________________________________
Users mailing list
[email protected]
http://lists.ovirt.org/mailman/listinfo/users

Reply via email to