[ovirt-users] Re: ovirt 4.4 and CentOS 8 and multipath with Equallogic

2021-02-01 Thread David Teigland
On Mon, Feb 01, 2021 at 07:18:24PM +0200, Nir Soffer wrote:
> Assuming we could use:
> 
> io_timeout = 10
> renewal_retries = 8
> 
> The worst case would be:
> 
>  00 sanlock renewal succeeds
>  19 storage fails
>  20 sanlock try to renew lease 1/7 (timeout=10)
>  30 sanlock renewal timeout
>  40 sanlock try to renew lease 2/7 (timeout=10)
>  50 sanlock renewal timeout
>  60 sanlock try to renew lease 3/7 (timeout=10)
>  70 sanlock renewal timeout
>  80 sanlock try to renew lease 4/7 (timeout=10)
>  90 sanlock renewal timeout
> 100 sanlock try to renew lease 5/7 (timeout=10)
> 110 sanlock renewal timeout
> 120 sanlock try to renew lease 6/7 (timeout=10)
> 130 sanlock renewal timeout
> 139 storage is back
> 140 sanlock try to renew lease 7/7 (timeout=10)
> 140 sanlock renewal succeeds
> 
> David, what do you think?

I wish I could say, it would require some careful study to know how
feasible it is.  The timings are intricate and fundamental to correctness
of the algorithm.
Dave
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/75UVHZEUU6T5AIIYNCK2W37NPHDVH63Z/


[ovirt-users] Re: [EXTERNAL] Re: Storage Domain won't activate

2020-09-04 Thread David Teigland
On Sat, Sep 05, 2020 at 12:25:45AM +0300, Nir Soffer wrote:
> > > /var/log/sanlock.log contains a repeating:
> > > add_lockspace
> > > 
> > e1270474-108c-4cae-83d6-51698cffebbf:1:/dev/e1270474-108c-4cae-83d6-51698cf
> > > febbf/ids:0 conflicts with name of list1 s1
> > > 
> > e1270474-108c-4cae-83d6-51698cffebbf:3:/dev/e1270474-108c-4cae-83d6-51698cf
> > > febbf/ids:0
> 
> David, what does this message mean?
> 
> It is clear that there is a conflict, but not clear what is the
> conflicting item. The host id in the
> request is 1, and in the conflicting item, 3. No conflicting data is
> displayed in the error message.

The lockspace being added is already being managed by sanlock, but using
host_id 3.  sanlock.log should show when lockspace e1270474 with host_id 3
was added.

Dave
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/47LHIPALKTJE4FG3OBNCN23H7SPUSQYE/


[ovirt-users] Re: Failed to activate Storage Domain --- ovirt 4.2

2019-06-10 Thread David Teigland
On Mon, Jun 10, 2019 at 10:59:43PM +0300, Nir Soffer wrote:
> > [root@uk1-ion-ovm-18  pvscan
> >   /dev/mapper/36000d31005697814: Checksum error at offset
> > 4397954425856
> >   Couldn't read volume group metadata from
> > /dev/mapper/36000d31005697814.
> >   Metadata location on /dev/mapper/36000d31005697814 at
> > 4397954425856 has invalid summary for VG.
> >   Failed to read metadata summary from
> > /dev/mapper/36000d31005697814
> >   Failed to scan VG from /dev/mapper/36000d31005697814
> 
> This looks like corrupted vg metadata.

Yes, the second metadata area, at the end of the device is corrupted; the
first metadata area is probably ok.  That version of lvm is not able to
continue by just using the one good copy.

Last week I pushed out major changes to LVM upstream to be able to handle
and repair most of these cases.  So, one option is to build lvm from the
upstream master branch, and check if that can read and repair this
metadata.

> David, we keep 2 metadata copies on the first PV. Can we use one of the
> copies on the PV to restore the metadata to the least good state?

pvcreate with --restorefile and --uuid, and with the right backup metadata
could probably correct things, but experiment with some temporary PVs
first.
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/6T7EM2R7422CXGBO3CKALMIHBYSTBUYK/


[ovirt-users] Re: iSCSI constantly reading from storage (about 8 Mbps) after connecting hosts to storage

2018-10-17 Thread David Teigland
On Wed, Oct 17, 2018 at 11:37:33PM +0300, Nir Soffer wrote:
> - sanlock reads 1MiB from the logical volume "domain-uuid/ids" every 20
> seconds

Every 20 seconds sanlock reads 1MB and writes 512 bytes to monitor and
renew its leases in the lockspace.
___
Users mailing list -- users@ovirt.org
To unsubscribe send an email to users-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/site/privacy-policy/
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/users@ovirt.org/message/PM6QOZ6LEJ4DFHSCU2CJGYHTSGY6NCW4/


Re: [ovirt-users] iSCSI domain on 4kn drives

2017-11-13 Thread David Teigland
On Sat, Nov 11, 2017 at 12:24:25AM +, Nir Soffer wrote:
> David, do you know if 4k disks over NFS works for sanlock?

When using files, sanlock always does 512 byte i/o.  This can be a problem
when there are 4k disks used under NFS.  On disks, sanlock detects the
sector size (with libblkid) and uses 512/4k accordingly.

If vdsm knows when to use 4k i/o over files, I can add a sanlock flag that
vdsm can use to create 4k sanlock leases.

Dave

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] moving disk failed.. remained locked

2017-02-23 Thread David Teigland
On Thu, Feb 23, 2017 at 08:11:50PM +0200, Nir Soffer wrote:
> > [g.cecchi@ovmsrv05 ~]$ sudo sanlock client renewal -s
> > 922b5269-ab56-4c4d-838f-49d33427e2ab
> > timestamp=1207533 read_ms=2 write_ms=0 next_timeouts=0 next_errors=0
> > timestamp=1207554 read_ms=2 write_ms=0 next_timeouts=0 next_errors=0
> > ...
> > timestamp=1211163 read_ms=2 write_ms=0 next_timeouts=0 next_errors=0
> > timestamp=1211183 read_ms=2 write_ms=0 next_timeouts=0 next_errors=0
> > timestamp=1211204 read_ms=2 write_ms=0 next_timeouts=0 next_errors=0
> >
> > How do I translate this output above? What would be the difference in case
> > of problems?
> 
> David, can you explain this output?
> 
> read_ms and write_ms looks obvious, but next_timeout and next_errors are
> a mystery to me.

Sorry for copying, but I think I explained it better then than I could now!
(I need to include this somewhere in the man page.)

commit 6313c709722b3ba63234a75d1651a160bf1728ee
Author: David Teigland <teigl...@redhat.com>
Date:   Wed Mar 9 11:58:21 2016 -0600

sanlock: renewal history

Keep a history of read and write latencies for a lockspace.
The times are measured for io in delta lease renewal
(each delta lease renewal includes one read and one write).

For each successful renewal, a record is saved that includes:
- the timestamp written in the delta lease by the renewal
- the time in milliseconds taken by the delta lease read
- the time in milliseconds taken by the delta lease write

Also counted and recorded are the number io timeouts and
other io errors that occur between successful renewals.

Two consecutive successful renewals would be recorded as:

timestamp=5332 read_ms=482 write_ms=5525 next_timeouts=0 next_errors=0
timestamp=5353 read_ms=99 write_ms=3161 next_timeouts=0 next_errors=0

timestamp is the value written into the delta lease during
that renewal.

read_ms/write_ms are the milliseconds taken for the renewal
read/write ios.

next_timeouts are the number of io timeouts that occured
after the renewal recorded on that line and before the next
successful renewal on the following line.

next_errors are the number of io errors (not timeouts) that
occured after renewal recorded on that line and before the
next successful renewal on the following line.

The command 'sanlock client renewal -s lockspace_name' reports
the full history of renewals saved by sanlock, which by default
is 180 records, about 1 hour of history when using a 20 second
renewal interval for a 10 second io timeout.

(A --summary option could be added to calculate and report
averages over a selected period of the history.)


___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] packages that can be updated without maintening hosts

2017-01-24 Thread David Teigland
On Mon, Jan 23, 2017 at 09:50:38PM +0200, Nir Soffer wrote:

> >> The major issue is sanlock, if it is maintaining a lease on storage,
> >> updating sanlock will cause the host to reboot. Sanlock is not
> >> petting the host watchdog because you killed sanlock during the
> >> upgrade, the watchdog will reboot the host.
> >
> > Is the sanlock RPM preventing an upgrade (in the pre-upgrade script) if it
> > has a lock?

I think that rpm upgrade is not interacting with the running daemon,
although there was a problem with that at one time.

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Sanlock add Lockspace Errors

2016-06-02 Thread David Teigland
On Thu, Jun 02, 2016 at 06:47:37PM +0300, Nir Soffer wrote:
> > This is a mess that's been caused by improper use of storage, and various
> > sanity checks in sanlock have all reported errors for "impossible"
> > conditions indicating that something catastrophic has been done to the
> > storage it's using.  Some fundamental rules are not being followed.
> 
> Thanks David.
> 
> Do you need more output from sanlock to understand this issue?

I can think of nothing more to learn from sanlock.  I'd suggest tighter,
higher level checking or control of storage.  Low level sanity checks
detecting lease corruption are not a convenient place to work from.

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Sanlock add Lockspace Errors

2016-06-02 Thread David Teigland
> verify_leader 2 wrong space name
> 4643f652-8014-4951-8a1a-02af41e67d08
> f757b127-a951-4fa9-bf90-81180c0702e6
> /dev/f757b127-a951-4fa9-bf90-81180c0702e6/ids

> leader1 delta_acquire_begin error -226 lockspace
> f757b127-a951-4fa9-bf90-81180c0702e6 host_id 2

VDSM has tried to join VG/lockspace/storage-domain "f757b127" on LV
/dev/f757b127-a951-4fa9-bf90-81180c0702e6/ids.  But sanlock finds that
lockspace "4643f652" is initialized on that storage, i.e. inconsistency
between the leases formatted on disk and what the leases are being used
for.  That should never happen unless sanlock and/or storage are
used/moved/copied wrongly.  The error is a sanlock sanity check to catch
misuse.


> s1527 check_other_lease invalid for host 0 0 ts 7566376 name  in
> 4643f652-8014-4951-8a1a-02af41e67d08

> s1527 check_other_lease leader 12212010 owner 1 11 ts 7566376
> sn f757b127-a951-4fa9-bf90-81180c0702e6 rn 
> f888524b-27aa-4724-8bae-051f9e950a21.vm1.intern

Apparently sanlock is already managing a lockspace called "4643f652" when
it finds another lease in that lockspace has the inconsistent/corrupt name
"f757b127".  I can't say what steps might have been done to lead to this.

This is a mess that's been caused by improper use of storage, and various
sanity checks in sanlock have all reported errors for "impossible"
conditions indicating that something catastrophic has been done to the
storage it's using.  Some fundamental rules are not being followed.

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] open error -13 = sanlock

2016-03-01 Thread David Teigland
On Wed, Mar 02, 2016 at 12:15:17AM +0200, Nir Soffer wrote:
> 1. Stop engine,  so it will not try to start vdsm
> 2. Stop vdsm on all hosts, so they do not try to acquire a host id with
> sanlock
> This does not affect running vms
> 3. Fix the permissions on the ids file, via glusterfs mount
> 4. Initialize the ids files from one of the hosts, via the glusterfs mount
> This should fix the ids files on all replicas
> 5. Start vdsm on all hosts
> 6. Start engine
> 
> Engine will connect to all hosts, hosts will connect to storage and try to
> acquire a host id.
> Then Engine will start the SPM on one of the hosts, and your DC should
> become up.
> 
> David, Sahina, can you confirm that this procedure is safe?

Looks right.

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Missing dom_md/ids file

2016-02-19 Thread David Teigland
On Fri, Feb 19, 2016 at 11:34:28PM +0200, Nir Soffer wrote:
> On Fri, Feb 19, 2016 at 10:58 PM, Cameron Christensen
>  wrote:
> > Hello,
> >
> > I am using glusterfs storage and ran into a split-brain issue. One of the
> > file affected by split-brain was dom_md/ids. In attempts to fix the
> > split-brain issue I deleted the dom_md/ids file. Is there a method to
> > recreate or reconstruct this file?
> 
> You can do this:
> 
> 1. Put the gluster domain to maintenance (via engine)
> 
> No host should access it while you reconstruct the ids file
> 
> 2. Mount the gluster volume manually
> 
> mkdir repair
> mount -t glusterfs :/ repair/
> 
> 3. Create the file:
> 
> touch repair//dom_md/ids
> 
> 4. Initialize the lockspace
> 
> sanlock direct init -s :0:repair//dom_md/ids:0
> 
> 5. Unmount the gluster volume
> 
> umount repair
> 
> 6. Activate the gluster domain (via engine)
> 
> The domain should become active after a while.
> 
> 
> David: can you confirm this is the best way to reconstruct the ids file?

Yes, that looks right.

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [Users] SELinux denials on Sanlock

2012-06-19 Thread David Teigland
On Tue, Jun 19, 2012 at 01:29:30PM -0400, Daniel J Walsh wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 On 06/19/2012 12:13 PM, David Teigland wrote:
  type=AVC msg=audit(1340053766.745:7): avc:  denied  { open } for
  pid=1908 comm=wdmd name=wdmd.pid dev=dm-0 ino=1574530 
  scontext=system_u:system_r:wdmd_t:s0 
  tcontext=system_u:object_r:initrc_var_run_t:s0 tclass=file
  
  type=AVC msg=audit(1340053766.746:8): avc:  denied  { lock } for 
  pid=1908 comm=wdmd path=/var/run/wdmd/wdmd.pid dev=dm-0 ino=1574530
  scontext=system_u:system_r:wdmd_t:s0 
  tcontext=system_u:object_r:initrc_var_run_t:s0 tclass=file
  
  type=SYSCALL msg=audit(1340053766.746:8): arch=x86_64 syscall=fcntl 
  success=yes exit=0 a0=4 a1=6 a2=7fffae656290 a3=fff3 items=0 ppid=1
  pid=1908 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0
  sgid=0 fsgid=0 tty=(none) ses=4294967295 comm=wdmd exe=/usr/sbin/wdmd
  subj=system_u:system_r:wdmd_t:s0 key=(null)
 
 THis is caused by a bug in sanlock, the init script for wdmd was creating the
 /var/run/wdmd file but not running restorecon on it.

Thanks, that was fixed back in Feb, so some old/wrong packages must be in
use.

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users