I encountered the same problem a few months ago. With help from this list, I fixed my problems without any data loss, and posted my solution on the list. If you search the following subject line “corrupt DB after VM live migration with storage migration”, you should see my posts.
Good luck Yiping On 8/9/16, 3:30 AM, "Makrand" <[email protected]> wrote: Ilya, Point to be noted is that my job didn't failed coz of the timeout, but rather coz of some VDI parameter at XENServer with below exception. [SR_BACKEND_FAILURE_80, , Failed to mark VDI hidden [opterr=SR 96e879bf-93aa-47ca-e2d5-e595afbab294: error aborting existing process]] I am still digging on this error from SMlogs etc on XEN server. But in reality volume was migrated and I think that's important. I, off course, faced timeout error during initial testing and after some trial and error I realised that there is this "not so properly named parameter" called *wait* (1800 default value) that needs to be modified in end to make timeout error go away. So all in all I modified parameters as below:- migratewait: 36000 storage.pool.max.waitseconds: 36000 vm.op.cancel.interval: 36000 vm.op.cleanup.wait: 36000 wait:18000 -- Best, Makrand On Tue, Aug 9, 2016 at 6:07 AM, ilya <[email protected]> wrote: > this happened to us on non XEN hypervisor as well. > > CloudStack has a timeout for a long running jobs - which i assume in > your case - it has exceeded. > > Changing volumes table should be enough by referencing proper pool_id. > Just make sure that data size matches on both ends. > > consider changing > "copy.volume.wait" (if that does not help) also "vm.job.timeout" > > > Regards > ilya > > On 8/8/16 3:54 AM, Makrand wrote: > > Guys, > > > > My setup:- ACS 4.4.2. Hypervisor: XENserver 6.2. > > > > I tried moving a volume in running VM from primary storage A to primary > > storage B (using GUI of cloudstack). Please note, primary storage A LUN > > (LUN7)is coming out of one storage box and primary storage B LUN > (LUN14) > > is from another. > > > > For VM1 with 250GB data volume (51 GB used space), I was able to move > this > > volume without any glitch in about 26mins. > > > > But for VM2 with 250Gb data volume (182 GB used space), the migration > > continued for about ~110 mins and then failed with follwing exception in > > very end with message like:- > > > > 2016-08-06 14:30:57,481 WARN [c.c.h.x.r.CitrixResourceBase] > > (DirectAgent-192:ctx-5716ad6d) Task failed! Task record: > > uuid: 308a8326-2622-e4c5-2019-3beb > > 87b0d183 > > nameLabel: Async.VDI.pool_migrate > > nameDescription: > > allowedOperations: [] > > currentOperations: {} > > created: Sat Aug 06 12:36:27 UTC 2016 > > finished: Sat Aug 06 14:30:32 UTC 2016 > > status: failure > > residentOn: com.xensource.xenapi.Host@f242d3ca > > progress: 1.0 > > type: <none/> > > result: > > errorInfo: [SR_BACKEND_FAILURE_80, , Failed to mark VDI hidden > > [opterr=SR 96e879bf-93aa-47ca-e2d5-e595afbab294: error aborting existing > > process]] > > otherConfig: {} > > subtaskOf: com.xensource.xenapi.Task@aaf13f6f > > subtasks: [] > > > > > > So cloudstack just removed the JOB telling it failed, says the mangement > > server log. > > > > A) But when I am checking it at hyeprvisor level, the volume is on new SR > > i.e. on LUN14. Strange huh? So now the new uuid for this volume from XE > cli > > is like > > > > [root@gcx-bom-compute1 ~]# xe vbd-list > > vm-uuid=3fcb3070-e373-3cf9-d0aa-0a657142a38d > > uuid ( RO) : f15dc54a-3868-8de8-5427-314e341879c6 > > vm-uuid ( RO): 3fcb3070-e373-3cf9-d0aa-0a657142a38d > > vm-name-label ( RO): i-22-803-VM > > vdi-uuid ( RO): cc1f8e83-f224-44b7-9359-282a1c1e3db1 > > empty ( RO): false > > device ( RO): hdb > > > > B) But luckily I had the entry taken before migration and it shows > like:- > > > > uuid ( RO) : f15dc54a-3868-8de8-5427-314e341879c6 > > vm-uuid ( RO): 3fcb3070-e373-3cf9-d0aa-0a657142a38d > > vm-name-label ( RO): i-22-803-VM > > vdi-uuid ( RO): 7c073522-a077-41a0-b9a7-7b61847d413b > > empty ( RO): false > > device ( RO): hdb > > > > C) Since this failed at cloudstack, the DB is still holding old value. > > Here is current volume table entry in DB > > > > id: 1004 > >> account_id: 22 > >> domain_id: 15 > >> pool_id: 18 > >> last_pool_id: NULL > >> instance_id: 803 > >> device_id: 1 > >> name: > >> cloudx_globalcloudxchange_com_W2797T2808S3112_V1462960751 > >> uuid: a8f01042-d0de-4496-98fa-a0b13648bef7 > >> size: 268435456000 > >> folder: NULL > >> path: 7c073522-a077-41a0-b9a7-7b61847d413b > >> pod_id: NULL > >> data_center_id: 2 > >> iscsi_name: NULL > >> host_ip: NULL > >> volume_type: DATADISK > >> pool_type: NULL > >> disk_offering_id: 6 > >> template_id: NULL > >> first_snapshot_backup_uuid: NULL > >> recreatable: 0 > >> created: 2016-05-11 09:59:12 > >> attached: 2016-05-11 09:59:21 > >> updated: 2016-08-06 14:30:57 > >> removed: NULL > >> state: Ready > >> chain_info: NULL > >> update_count: 42 > >> disk_type: NULL > >> vm_snapshot_chain_size: NULL > >> iso_id: NULL > >> display_volume: 1 > >> format: VHD > >> min_iops: NULL > >> max_iops: NULL > >> hv_ss_reserve: 0 > >> 1 row in set (0.00 sec) > >> > > > > > > So the path variable shows value as 7c073522-a077-41a0-b9a7-7b61847d413b > > and pool id as 18. > > > > The VM is running as of now, but I am sure the moment I will reboot, this > > volume will be gone or worst VM won't boot. This is production VM BTW. > > > > D) So I think I need to edit volume table for path and pool_id parameters > > and need to place new values in place and then reboot VM. Do I need to > make > > any more changes in DB in some other tables for same? Any comment/help is > > much appreciated. > > > > > > > > > > -- > > Best, > > Makrand > > >
