On Thu, Jul 25, 2019 at 6:24 PM Gianluca Cecchi <[email protected]> wrote:
> On Thu, Jul 25, 2019 at 2:21 PM Eyal Shenitzky <[email protected]> > wrote: > >> On Thu, Jul 25, 2019 at 3:02 PM Gianluca Cecchi < >> [email protected]> wrote: >> >>> On Thu, Jul 25, 2019 at 1:54 PM Eyal Shenitzky <[email protected]> >>> wrote: >>> >>>> >>>> Please notice that a automation python scripts created in order to >>>> facilitate the DR process. >>>> You can find them under - path/to/your/dr/folder/files. >>>> >>>> You can use those scripts to generate the mapping, test the generated >>>> mapping and start the failover/failback. >>>> >>>> I strongly recommend to use it. >>>> >>>> >>> Yes, I have used it to create the disaster_recovery_vars.yml mapping >>> file and then populating it with the secondary site information, thanks. >>> My doubt was about any difference in playbook actions between "failover" >>> (3.3) and "discreet failover test" (B.1), as the executed playbook and >>> tags are the same. >>> >> >> No, the only difference is that you disable the storage replication by >> yourself, this way you can test the failover while the other "primary" site >> is still active. >> >> > First "discreet failover test" was a success!!! Great. > Storage domain attached, templates imported and the only VM defined at > source correctly started (at source I configured link down for the VM, > inherited at target, so no collisions). > Elapsed between beginning of ovirt connection, until first template import > has been about 6 minutes. > ... > Template TOL76 has been successfully imported from the given > configuration. 7/25/19 3:26:58 PM > Storage Domain ovsd3910 was attached to Data Center SVIZ3-DR by > admin@internal-authz 7/25/19 3:26:46 PM > Storage Domains were attached to Data Center SVIZ3-DR by > admin@internal-authz 7/25/19 3:26:46 PM > Storage Domain ovsd3910 (Data Center SVIZ3-DR) was activated by > admin@internal-authz 7/25/19 3:26:46 PM > ... > Storage Pool Manager runs on Host ovh201. (Address: ovh201.), Data Center > SVIZ3-DR. 7/25/19 3:26:36 PM > Data Center is being initialized, please wait for initialization to > complete. 7/25/19 3:23:53 PM > Storage Domain ovsd3910 was added by admin@internal-authz 7/25/19 3:20:43 > PM > Disk Profile ovsd3910 was successfully added (User: admin@internal-authz). > 7/25/19 3:20:42 PM > User admin@internal-authz connecting from '10.4.192.43' using session > 'xxx' logged in. 7/25/19 3:20:35 PM > > Some notes: > > 1) iSCSI multipath > my storage domains are iSCSI based and my hosts have two network cards to > reach the storage. > I'm using EQL that doesn't support bonding and has one portal that all > initiators use. > So in my primary env I configured "iSCSI Multipathing" tab in Compute --> > Datacenter --> Datacenter_Name window. > But this tab appears only when you activate the storage. > So during the ansible playbook run the iSCSI connection has been activated > through the "default" iscsi interface > I can then: > - configure "iSCSI Multipathing" > - shutdown VM > - put host into maintenance > - remove the default iSCSI session that has not been removed on host > iscsiadm -m session -r 6 -u > - activate host > now I have: > [root@ov201 ~]# iscsiadm -m session > tcp: [10] 10.10.100.8:3260,1 > iqn.2001-05.com.equallogic:4-771816-99d82fc59-5bdd77031e05beac-ovsd3910 > (non-flash) > tcp: [9] 10.10.100.8:3260,1 > iqn.2001-05.com.equallogic:4-771816-99d82fc59-5bdd77031e05beac-ovsd3910 > (non-flash) > [root@ov201 ~]# > with > # multipath -l > 364817197c52fd899acbe051e0377dd5b dm-29 EQLOGIC ,100E-00 > size=1.0T features='1 queue_if_no_path' hwhandler='0' wp=rw > `-+- policy='round-robin 0' prio=0 status=active > |- 23:0:0:0 sdb 8:16 active undef running > `- 24:0:0:0 sdc 8:32 active undef running > - start vm > > The I do a cleanup: > 1. Detach the storage domains from the secondary site. > 2. Enable storage replication between the primary and secondary storage > domains. > > The storage domain remains as "unattached" in DR environment > > Then I executed the test again and during connection I got this error > about 40 seconds after run of playbook > > TASK [oVirt.disaster-recovery : Import iSCSI storage domain] > *************************************** > An exception occurred during task execution. To see the full traceback, > use -vvv. The error was: Error: Fault reason is "Operation Failed". Fault > detail is "[]". HTTP response code is 400. > failed: [localhost] > (item=iqn.2001-05.com.equallogic:4-771816-99d82fc59-5bdd77031e05beac-ovsd3910) > => {"ansible_loop_var": "dr_target", "changed": false, "dr_target": > "iqn.2001-05.com.equallogic:4-771816-99d82fc59-5bdd77031e05beac-ovsd3910", > "msg": "Fault reason is \"Operation Failed\". Fault detail is \"[]\". HTTP > response code is 400."} > > In webadmin gui of DR env I see: > > VDSM ov201 command CleanStorageDomainMetaDataVDS failed: Cannot obtain > lock: "id=56eadc97-5731-40cf-8409-aff58d8ffd11, rc=-243, out=Cannot acquire > Lease(name='SDM', path='/dev/56eadc97-5731-40cf-8409-aff58d8ffd11/leases', > offset=1048576), err=(-243, 'Sanlock resource not acquired', 'Lease is held > by another host')" 7/25/19 4:50:43 PM > What could be the cause of this? > > In vdsm.log: > > 2019-07-25 16:50:43,196+0200 INFO (jsonrpc/1) [vdsm.api] FINISH > forcedDetachStorageDomain error=Cannot obtain lock: > "id=56eadc97-5731-40cf-8409-aff58d8ffd11, rc=-243, out=Cannot acquire > Lease(name='SDM', path='/dev/56eadc97-5731-40cf-8409-aff58d8ffd11/leases', > offset=1048576), err=(-243, 'Sanlock resource not acquired', 'Lease is held > by another host')" from=::ffff:10.4.192.79,49038, flow_id=4bd330d1, > task_id=c0dfac81-5c58-427d-a7d0-e8c695448d27 (api:52) > 2019-07-25 16:50:43,196+0200 ERROR (jsonrpc/1) [storage.TaskManager.Task] > (Task='c0dfac81-5c58-427d-a7d0-e8c695448d27') Unexpected error (task:875) > Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, > in _run > return fn(*args, **kargs) > File "<string>", line 2, in forcedDetachStorageDomain > File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in > method > ret = func(*args, **kwargs) > File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 856, > in forcedDetachStorageDomain > self._detachStorageDomainFromOldPools(sdUUID) > File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 834, > in _detachStorageDomainFromOldPools > dom.acquireClusterLock(host_id) > File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 910, in > acquireClusterLock > self._manifest.acquireDomainLock(hostID) > File "/usr/lib/python2.7/site-packages/vdsm/storage/sd.py", line 527, in > acquireDomainLock > self._domainLock.acquire(hostID, self.getDomainLease()) > File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", > line 419, in acquire > "Cannot acquire %s" % (lease,), str(e)) > AcquireLockFailure: Cannot obtain lock: > "id=56eadc97-5731-40cf-8409-aff58d8ffd11, rc=-243, out=Cannot acquire > Lease(name='SDM', path='/dev/56eadc97-5731-40cf-8409-aff58d8ffd11/leases', > offset=1048576), err=(-243, 'Sanlock resource not acquired', 'Lease is held > by another host')" > 2019-07-25 16:50:43,196+0200 INFO (jsonrpc/1) [storage.TaskManager.Task] > (Task='c0dfac81-5c58-427d-a7d0-e8c695448d27') aborting: Task is aborted: > 'Cannot obtain lock: "id=56eadc97-5731-40cf-8409-aff58d8ffd11, rc=-243, > out=Cannot acquire Lease(name=\'SDM\', > path=\'/dev/56eadc97-5731-40cf-8409-aff58d8ffd11/leases\', offset=1048576), > err=(-243, \'Sanlock resource not acquired\', \'Lease is held by another > host\')"' - code 651 (task:1181) > 2019-07-25 16:50:43,197+0200 ERROR (jsonrpc/1) [storage.Dispatcher] FINISH > forcedDetachStorageDomain error=Cannot obtain lock: > "id=56eadc97-5731-40cf-8409-aff58d8ffd11, rc=-243, out=Cannot acquire > Lease(name='SDM', path='/dev/56eadc97-5731-40cf-8409-aff58d8ffd11/leases', > offset=1048576), err=(-243, 'Sanlock resource not acquired', 'Lease is held > by another host')" (dispatcher:83) > 2019-07-25 16:50:43,197+0200 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC > call StorageDomain.detach failed (error 651) in 24.12 seconds (__init__:312) > 2019-07-25 16:50:44,180+0200 INFO (jsonrpc/6) [api.host] START getStats() > from=::ffff:10.4.192.79,49038 (api:48) > 2019-07-25 16:50:44,222+0200 INFO (jsonrpc/6) [vdsm.api] START > repoStats(domains=()) from=::ffff:10.4.192.79,49038, > task_id=8a7a0302-4ee3-49a8-a3f7-f9636a123765 (api:48) > 2019-07-25 16:50:44,222+0200 INFO (jsonrpc/6) [vdsm.api] FINISH repoStats > return={} from=::ffff:10.4.192.79,49038, > task_id=8a7a0302-4ee3-49a8-a3f7-f9636a123765 (api:54) > 2019-07-25 16:50:44,223+0200 INFO (jsonrpc/6) [vdsm.api] START > multipath_health() from=::ffff:10.4.192.79,49038, > task_id=fb09923c-0888-4c3f-9b8a-a7750592da22 (api:48) > 2019-07-25 16:50:44,223+0200 INFO (jsonrpc/6) [vdsm.api] FINISH > multipath_health return={} from=::ffff:10.4.192.79,49038, > task_id=fb09923c-0888-4c3f-9b8a-a7750592da22 (api:54) > > After putting host into maintenance + reboot of the host and re-run of > playbook, all went well again.. > This error is related to Sanlock, it looks like you needed to wait longer for the leases to expire. > > 2) Mac Address pools > I notice that the imported VM has preserved the link state (down in my > case) but not the mac address. > The mac address is the one defined in my target engine, that is different > from source engine to avoid overlap of mac adresses > Is this an option I can customize? > You can customize the MAC of the VM in : compute -> virtual machines -> network interfaces -> create new with custom mac In the DR, in defaults/main.yaml you can set the following variable to false - # Indicate whether to reset a mac pool of a VM on register. dr_reset_mac_pool: "True" > In general a VM could have problems when changing its mac address... > > 3) Clean up dest DC after "discreet failover test" > The guide says: > 1. Detach the storage domains from the secondary site. > 2. Enable storage replication between the primary and secondary storage > domains. > > better to also restart the DR hosts? > No need for that. > > 4) VM consistency > Can we say that all the imported VMs will be "crash consistent"? > I am not sure what you mean by "crash consistent", is it means "highly available" is oVirt language? > > Thanks , > Gianluca > > -- Regards, Eyal Shenitzky
_______________________________________________ Users mailing list -- [email protected] To unsubscribe send an email to [email protected] Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/[email protected]/message/5SDDECNRDDYKCE7HYMGPC4H37JVE4MZY/

