Re: KVM HA is broken, let's fix it

2015-10-12 Thread Frank Louwers

> On 10 Oct 2015, at 12:35, Remi Bergsma  wrote:
> 
> Can you please explain what the issue is with KVM HA? In my tests, HA starts 
> all VMs just fine without the hypervisor coming back. At least that is on 
> current 4.6. Assuming a cluster of multiple nodes of course. It will then do 
> a neighbor check from another host in the same cluster. 
> 
> Also, malfunctioning NFS leads to corruption and therefore we fence a box 
> when the shared storage is unreliable. Combining primary and secondary NFS is 
> not a good idea for production in my opinion. 

Well, it depends how you look at it, and what your situation is.

If you use 1 NFS export als primary storage (and only NFS), then yes, the 
system works as one would expect, and doesn’t need to be fixed.

However, HA is “not functioning” in any of these scenario’s:

- you don’t use NFS as your only primary storage
- you use more than one NFS primary storage

Even worse: imagine you only use local storage as primary storage, but have 1 
NFS configured (as the UI “wizard” forces you to configure one). You don’t have 
any active VM configured on the primary storage. You then perform maintenance 
on the NFS storage, and take it offline…

All your hosts will then reboot, resulting in major downtime, that’s completely 
unnecessary. There’s not even an option to disable this at this point… We’ve 
removed the reboot instructions from the HA script on all our instances…

Regards,

Frank

getting error when adding Ceph RBD storage to cloudstack 4.4.2

2015-10-12 Thread Shetty, Pradeep
Hello,

I am getting error when adding ceph rbd storage in cloudstack.(Please find the 
management server log pasted below).Created advanced zone with local storage 
option enabled and using kvm hypervisor.

System vm's are running on local storage. I am also able to add NFS primary 
storage.

Regards,
Pradeep




_
2015-10-12 12:08:31,792 DEBUG [c.c.a.m.AgentManagerImpl] 
(catalina-exec-22:ctx-89967058 ctx-bcb1cfce) Details from executing class 
com.cloud.agent.api.ModifyStoragePoolCommand: 
com.cloud.utils.exception.CloudRuntimeException: Failed to create storage pool: 
13f02ef1-3e6d-31e7-8858-ed1f5bfdd514
at 
com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.createStoragePool(LibvirtStorageAdaptor.java:524)
at 
com.cloud.hypervisor.kvm.storage.KVMStoragePoolManager.createStoragePool(KVMStoragePoolManager.java:277)
at 
com.cloud.hypervisor.kvm.storage.KVMStoragePoolManager.createStoragePool(KVMStoragePoolManager.java:271)
at 
com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.execute(LibvirtComputingResource.java:2813)
at 
com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1325)
at com.cloud.agent.Agent.processRequest(Agent.java:501)
at com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:808)
at com.cloud.utils.nio.Task.run(Task.java:84)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

2015-10-12 12:08:31,795 WARN  [o.a.c.alerts] (catalina-exec-22:ctx-89967058 
ctx-bcb1cfce)  alertType:: 7 // dataCenterId:: 1 // podId:: 1 // clusterId:: 
null // message:: Unable to attach storage pool12 to the host1
2015-10-12 12:08:31,840 WARN  
[o.a.c.s.d.l.CloudStackPrimaryDataStoreLifeCycleImpl] 
(catalina-exec-22:ctx-89967058 ctx-bcb1cfce) Unable to establish a connection 
between Host[-1-Routing] and 
org.apache.cloudstack.storage.datastore.PrimaryDataStoreImpl@491f978a
com.cloud.utils.exception.CloudRuntimeException: Unable establish connection 
from storage head to storage pool 12 due to 
com.cloud.utils.exception.CloudRuntimeException: Failed to create storage pool: 
13f02ef1-3e6d-31e7-8858-ed1f5bfdd514
at 
com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.createStoragePool(LibvirtStorageAdaptor.java:524)
at 
com.cloud.hypervisor.kvm.storage.KVMStoragePoolManager.createStoragePool(KVMStoragePoolManager.java:277)
at 
com.cloud.hypervisor.kvm.storage.KVMStoragePoolManager.createStoragePool(KVMStoragePoolManager.java:271)
at 
com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.execute(LibvirtComputingResource.java:2813)
at 
com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1325)
at com.cloud.agent.Agent.processRequest(Agent.java:501)
at com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:808)
at com.cloud.utils.nio.Task.run(Task.java:84)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
12
at 
org.apache.cloudstack.storage.datastore.provider.DefaultHostListener.hostConnect(DefaultHostListener.java:67)
at 
com.cloud.storage.StorageManagerImpl.connectHostToSharedPool(StorageManagerImpl.java:866)
at 
org.apache.cloudstack.storage.datastore.lifecycle.CloudStackPrimaryDataStoreLifeCycleImpl.attachCluster(CloudStackPrimaryDataStoreLifeCycleImpl.java:417)
at 
com.cloud.storage.StorageManagerImpl.createPool(StorageManagerImpl.java:669)
at 
com.cloud.storage.StorageManagerImpl.createPool(StorageManagerImpl.java:178)
at 
org.apache.cloudstack.api.command.admin.storage.CreateStoragePoolCmd.execute(CreateStoragePoolCmd.java:163)
at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:141)
at com.cloud.api.ApiServer.queueCommand(ApiServer.java:682)
at com.cloud.api.ApiServer.handleRequest(ApiServer.java:511)
at com.cloud.api.ApiServlet.processRequestInContext(ApiServlet.java:330)
at com.cloud.api.ApiServlet.access$000(ApiServlet.java:54)
at com.cloud.api.ApiServlet$1.run(ApiServlet.java:118)
at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
at 

CCC EU

2015-10-12 Thread Sebastien Goasguen
Hi folks,

CCC Dublin 2015 is over and we had a blast.

Thanks to our sponsors who helped make this happen:
Citrix, Cloud Foundry foundation, Nuage Networks, Shapeblue, Cloudian, 
Solidfire, ikoula, LPI-Japan, PC Extreme and Cloud Ops.

A few take aways:

1-We had 150 people at the event, which is clearly lower than previous events. 
Mesos Conference EU which was just upstairs from us also brought in 150 people, 
and the entire Linuxcon 1500. Getting 10% of the whole of Linuxcon is quite 
good when you think of the breadth of that community.

2-For 2016, we announced CCC Latin America in Sao Paulo, Brazil. The community 
is booming over there and we are all excited to plan that event. Assuming we 
all find the budget, we should all meet up there.

3-LPI-Japan announced a CloudStack certification exam in english. We will 
distribute a code on the mailing lists. With the code if you take the exam, our 
project will get 33% of the proceeds. LPI-Japan and Shapeblue teamed up to make 
this happen.

4-Pierre-Luc Dion and Erik Weber won the Docker challenge. PL gave a 
presentation about his solution and Erik put it on github 
https://github.com/terbolous/cloudstack-docker-compose. Packaging CloudStack as 
containers is an interesting idea going forward.

5-We had several discussions around our new release process, testing and PR 
reviews. Expect a separate email summarizing these discussions and a proposal 
to move forward soon.

I expect all videos to be online in the coming weeks. If you gave a talk you 
can upload your slides through the linuxcon portal or put them on slideshare 
and tweet about it.

Congratulations to all, let’s keep on making CloudStack better,

Cheers,

-Sebastien

Re: Recurring snapshots are not auto cleaned

2015-10-12 Thread Vadim Kimlaychuk

 Andrei,

 Open bug at https://issues.apache.org/jira/browse/cloudstack.

 Regards,

 Vadim.

 On 2015-10-12 00:53, Andrei Mikhailovsky wrote:


Hi guys,

was wondering if you've seen the same behaviour as I am currently 
experiencing? I've set a recurring volume snapshot to take place every 
night and set it to keep 7 snapshots. The snapshots are being taken, 
but ACS does not remove them. So far, i've got around 17 snapshots 
since the beginning of this schedule. I've tried to manually remove the 
old snaps to free up some space and that worked. So no issues with the 
actual delete function.


I also have several volumes with a weekly recurring snapshot and the 
removal of the old snapshots work perfectly well. The problem seems to 
be with daily snapshots.


Any idea how to fix this? I am running ACS 4.5.1 with KVM hosts and rbd 
storage + nfs for secondary.


Cheers

Andrei


Timeout with live migration

2015-10-12 Thread Ryan Farrington
We are experiencing a failure in cloudstack waiting for an async job performing 
a live migration of a volume to finish. I've copied the relevant log entries 
below.We acknowledge that the migration will take a few hours based on the 
volume of the data and we are looking for a way to increase the timeout of 7200 
seconds into something we know we can work with.


2015-10-12 00:19:36,043 DEBUG [o.a.c.s.RemoteHostEndPoint] 
(Job-Executor-62:ctx-802065a9 ctx-bb27a168) Failed to send command, due to 
Agent:27, com.cloud.exception.OperationTimedoutException: Commands 835325398 to 
Host 27 timed out after 7200




Re: Timeout with live migration

2015-10-12 Thread Ryan Farrington
Here is the full log, including the stack for the exception, that we get at the 
2 hour mark. as for the migratewait it is set to 36000 which should be 10 
hours. 

2015-10-12 18:41:20,137 DEBUG [c.c.a.m.DirectAgentAttache] 
(DirectAgent-323:ctx-6d42edd7) Seq 31-1023875267: Executing request
2015-10-12 18:41:20,457 DEBUG [c.c.a.m.AgentAttache] 
(Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Waiting some more 
time because this is the current command
2015-10-12 18:41:20,457 INFO  [c.c.u.e.CSExceptionErrorCode] 
(Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Could not find exception: 
com.cloud.exception.OperationTimedoutException in error code list for exceptions
2015-10-12 18:41:20,465 WARN  [c.c.a.m.AgentAttache] 
(Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Timed out on Seq 
38-996939857:  { Cmd , MgmtId: 42756806312036, via: 38(xen-nc-bc2b7), Ver: v1, 
Flags: 100111, 
[{"com.cloud.agent.api.storage.MigrateVolumeCommand":{"volumeId":808,"volumePath":"0cd3ec8c-9fa9-4caf-8380-1a85cdfd0958","pool":{"id":246,"uuid":"VNX_PR5_LUN2003","host":"localhost","path":"/VNX_PR5_LUN2003","port":0,"type":"PreSetup"},"attachedVmName":"i-34-311-VM","wait":0}}]
 }
2015-10-12 18:41:20,465 DEBUG [c.c.a.m.AgentAttache] 
(Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Cancelling.
2015-10-12 18:41:20,465 DEBUG [c.c.a.m.AgentAttache] 
(Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: No more commands 
found
2015-10-12 18:41:20,465 DEBUG [o.a.c.s.RemoteHostEndPoint] 
(Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Failed to send command, due to 
Agent:38, com.cloud.exception.OperationTimedoutException: Commands 996939857 to 
Host 38 timed out after 7200
2015-10-12 18:41:20,471 DEBUG [o.a.c.s.m.AncientDataMotionStrategy] 
(Job-Executor-63:ctx-f7b6817d ctx-c6b92515) copy failed
com.cloud.utils.exception.CloudRuntimeException: Failed to send command, due to 
Agent:38, com.cloud.exception.OperationTimedoutException: Commands 996939857 to 
Host 38 timed out after 7200
at 
org.apache.cloudstack.storage.RemoteHostEndPoint.sendMessage(RemoteHostEndPoint.java:116)
at 
org.apache.cloudstack.storage.motion.AncientDataMotionStrategy.migrateVolumeToPool(AncientDataMotionStrategy.java:382)
at 
org.apache.cloudstack.storage.motion.AncientDataMotionStrategy.copyAsync(AncientDataMotionStrategy.java:421)
at 
org.apache.cloudstack.storage.motion.DataMotionServiceImpl.copyAsync(DataMotionServiceImpl.java:70)
at 
org.apache.cloudstack.storage.volume.VolumeServiceImpl.migrateVolume(VolumeServiceImpl.java:931)
at 
com.cloud.storage.VolumeApiServiceImpl.liveMigrateVolume(VolumeApiServiceImpl.java:1680)
at 
com.cloud.storage.VolumeApiServiceImpl.orchestrateMigrateVolume(VolumeApiServiceImpl.java:1666)
at 
com.cloud.storage.VolumeApiServiceImpl.migrateVolume(VolumeApiServiceImpl.java:1622)
at sun.reflect.GeneratedMethodAccessor335.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:622)
at 
org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)
at 
org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183)
at 
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
at 
org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:91)
at 
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
at 
org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
at com.sun.proxy.$Proxy196.migrateVolume(Unknown Source)
at 
org.apache.cloudstack.api.command.user.volume.MigrateVolumeCmd.execute(MigrateVolumeCmd.java:103)
at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:161)
at 
com.cloud.api.ApiAsyncJobDispatcher.runJobInContext(ApiAsyncJobDispatcher.java:109)
at 
com.cloud.api.ApiAsyncJobDispatcher$1.run(ApiAsyncJobDispatcher.java:66)
at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
at 
org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
at 
com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:63)
at 
org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:509)
at 
org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
at 

Re: Timeout with live migration

2015-10-12 Thread Rafael Weingärtner
I would first check your NICs' speed and load, the amount of RAM allocated
for the migrating VM and than check the hypervisor log files.

On Mon, Oct 12, 2015 at 8:19 PM, Jan-Arve Nygård 
wrote:

> What version are you running? Check if the copy.volume.wait setting is set
> to 7200 and increase it. If not you could also check
> job.cancel.threshold.minutes and job.expire.minutes.
>
> -Jan-Arve
>
> 2015-10-13 0:46 GMT+02:00 Ryan Farrington :
>
> > We are experiencing a failure in cloudstack waiting for an async job
> > performing a live migration of a volume to finish. I've copied the
> relevant
> > log entries below.We acknowledge that the migration will take a few hours
> > based on the volume of the data and we are looking for a way to increase
> > the timeout of 7200 seconds into something we know we can work with.
> >
> >
> > 2015-10-12 00:19:36,043 DEBUG [o.a.c.s.RemoteHostEndPoint]
> > (Job-Executor-62:ctx-802065a9 ctx-bb27a168) Failed to send command, due
> to
> > Agent:27, com.cloud.exception.OperationTimedoutException: Commands
> > 835325398 to Host 27 timed out after 7200
> >
> >
> >
>



-- 
Rafael Weingärtner


Re: Timeout with live migration

2015-10-12 Thread Ryan Farrington
We are currently on version 4.3.0.  Hypervisor is XenServer.None of the 
settings are set to 7200 seconds (or any variation that would yield 7200 
seconds) but i have provided them below as a reference.   Is there any other 
place where 7200 might be hard coded?  We are planning on an upgrade to 4.5.2 
next month but this migration needs to happen.  We have become pretty 
proficient at the post volume migration cleanup by manually mucking with the 
database but it is annoying and I would much rather have cloudstack just wait 
like i told it to.  

copy.volume.wait = 10800 (3 hours)
job.cancel.threshold.minutes = 60 (1 hour)
job.expire.minutes = 1440 (24 hours)





From: Jan-Arve Nygård [jan.arve.nyg...@gmail.com]
Sent: Monday, October 12, 2015 6:19 PM
To: users@cloudstack.apache.org
Subject: [Questionable]  Re: Timeout with live migration

What version are you running? Check if the copy.volume.wait setting is set
to 7200 and increase it. If not you could also check
job.cancel.threshold.minutes and job.expire.minutes.

-Jan-Arve

2015-10-13 0:46 GMT+02:00 Ryan Farrington :

> We are experiencing a failure in cloudstack waiting for an async job
> performing a live migration of a volume to finish. I've copied the relevant
> log entries below.We acknowledge that the migration will take a few hours
> based on the volume of the data and we are looking for a way to increase
> the timeout of 7200 seconds into something we know we can work with.
>
>
> 2015-10-12 00:19:36,043 DEBUG [o.a.c.s.RemoteHostEndPoint]
> (Job-Executor-62:ctx-802065a9 ctx-bb27a168) Failed to send command, due to
> Agent:27, com.cloud.exception.OperationTimedoutException: Commands
> 835325398 to Host 27 timed out after 7200
>
>
>


Re: Timeout with live migration

2015-10-12 Thread Ryan Farrington
The slow transfer is related to the storage we are trying to migrate off of.  
We are capable of getting about 350mbps off the disks but when we are moving 
volumes that are greater than about 500GB we end up racing the clock and hoping 
that the migration finishes before the job times out.   It would be awesome to 
be able to manage that timeout and I know there are a ton of settings I just 
don't know about and am hoping someone might be able to point me in the right 
direction.  



From: Rafael Weingärtner [rafaelweingart...@gmail.com]
Sent: Monday, October 12, 2015 6:40 PM
To: users@cloudstack.apache.org
Subject: [Questionable]  Re: Timeout with live migration

I would first check your NICs' speed and load, the amount of RAM allocated
for the migrating VM and than check the hypervisor log files.

On Mon, Oct 12, 2015 at 8:19 PM, Jan-Arve Nygård 
wrote:

> What version are you running? Check if the copy.volume.wait setting is set
> to 7200 and increase it. If not you could also check
> job.cancel.threshold.minutes and job.expire.minutes.
>
> -Jan-Arve
>
> 2015-10-13 0:46 GMT+02:00 Ryan Farrington :
>
> > We are experiencing a failure in cloudstack waiting for an async job
> > performing a live migration of a volume to finish. I've copied the
> relevant
> > log entries below.We acknowledge that the migration will take a few hours
> > based on the volume of the data and we are looking for a way to increase
> > the timeout of 7200 seconds into something we know we can work with.
> >
> >
> > 2015-10-12 00:19:36,043 DEBUG [o.a.c.s.RemoteHostEndPoint]
> > (Job-Executor-62:ctx-802065a9 ctx-bb27a168) Failed to send command, due
> to
> > Agent:27, com.cloud.exception.OperationTimedoutException: Commands
> > 835325398 to Host 27 timed out after 7200
> >
> >
> >
>



--
Rafael Weingärtner


Re: Timeout with live migration

2015-10-12 Thread Ryan Farrington
Yes the parameter was set long ago and the management server has been restarted 
numerous time over the past few days as we played with other parameters to no 
effect.  

After looking at the log a little more does the "Failed to send command, due to 
Agent:38, com.cloud.exception.OperationTimedoutException: Commands 996939857 to 
Host 38 timed out after 7200" mean that the migration start command is being 
sent in some kind of synchronous mode and not returning control back to the job 
manager?  





From: Rafael Weingärtner [rafaelweingart...@gmail.com]
Sent: Monday, October 12, 2015 8:46 PM
To: users@cloudstack.apache.org
Subject: [Questionable]  Re: Timeout with live migration

I thought you using the command  “migrateVirtualMachineWithVolume” but it
seems that you are using “migrateVolume” command from ACS's API.


For the code I debugged “migrateVirtualMachineWithVolume”, the parameter
3600, means 1 hour of timeout.

For the “migrateVolume” is the same, they both end up in
“com.cloud.hypervisor.xen.resource.XenServer610Resource.execute(MigrateVolumeCommand)”,
and in that method the parameter is the same.


If your parameter is set to 36000 (10 hours) I do not see why you are
getting the exception after 2 hours.

Did you restart the management servers after you changed the parameter?

On Mon, Oct 12, 2015 at 10:31 PM, Ryan Farrington  wrote:

> Here is the full log, including the stack for the exception, that we get
> at the 2 hour mark. as for the migratewait it is set to 36000 which should
> be 10 hours.
>
> 2015-10-12 18:41:20,137 DEBUG [c.c.a.m.DirectAgentAttache]
> (DirectAgent-323:ctx-6d42edd7) Seq 31-1023875267: Executing request
> 2015-10-12 18:41:20,457 DEBUG [c.c.a.m.AgentAttache]
> (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Waiting some
> more time because this is the current command
> 2015-10-12 18:41:20,457 INFO  [c.c.u.e.CSExceptionErrorCode]
> (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Could not find exception:
> com.cloud.exception.OperationTimedoutException in error code list for
> exceptions
> 2015-10-12 18:41:20,465 WARN  [c.c.a.m.AgentAttache]
> (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Timed out on
> Seq 38-996939857:  { Cmd , MgmtId: 42756806312036, via: 38(xen-nc-bc2b7),
> Ver: v1, Flags: 100111,
> [{"com.cloud.agent.api.storage.MigrateVolumeCommand":{"volumeId":808,"volumePath":"0cd3ec8c-9fa9-4caf-8380-1a85cdfd0958","pool":{"id":246,"uuid":"VNX_PR5_LUN2003","host":"localhost","path":"/VNX_PR5_LUN2003","port":0,"type":"PreSetup"},"attachedVmName":"i-34-311-VM","wait":0}}]
> }
> 2015-10-12 18:41:20,465 DEBUG [c.c.a.m.AgentAttache]
> (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Cancelling.
> 2015-10-12 18:41:20,465 DEBUG [c.c.a.m.AgentAttache]
> (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: No more
> commands found
> 2015-10-12 18:41:20,465 DEBUG [o.a.c.s.RemoteHostEndPoint]
> (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Failed to send command, due to
> Agent:38, com.cloud.exception.OperationTimedoutException: Commands
> 996939857 to Host 38 timed out after 7200
> 2015-10-12 18:41:20,471 DEBUG [o.a.c.s.m.AncientDataMotionStrategy]
> (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) copy failed
> com.cloud.utils.exception.CloudRuntimeException: Failed to send command,
> due to Agent:38, com.cloud.exception.OperationTimedoutException: Commands
> 996939857 to Host 38 timed out after 7200
> at
> org.apache.cloudstack.storage.RemoteHostEndPoint.sendMessage(RemoteHostEndPoint.java:116)
> at
> org.apache.cloudstack.storage.motion.AncientDataMotionStrategy.migrateVolumeToPool(AncientDataMotionStrategy.java:382)
> at
> org.apache.cloudstack.storage.motion.AncientDataMotionStrategy.copyAsync(AncientDataMotionStrategy.java:421)
> at
> org.apache.cloudstack.storage.motion.DataMotionServiceImpl.copyAsync(DataMotionServiceImpl.java:70)
> at
> org.apache.cloudstack.storage.volume.VolumeServiceImpl.migrateVolume(VolumeServiceImpl.java:931)
> at
> com.cloud.storage.VolumeApiServiceImpl.liveMigrateVolume(VolumeApiServiceImpl.java:1680)
> at
> com.cloud.storage.VolumeApiServiceImpl.orchestrateMigrateVolume(VolumeApiServiceImpl.java:1666)
> at
> com.cloud.storage.VolumeApiServiceImpl.migrateVolume(VolumeApiServiceImpl.java:1622)
> at sun.reflect.GeneratedMethodAccessor335.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:622)
> at
> org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)
> at
> org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183)
> at
> 

Re: Timeout with live migration

2015-10-12 Thread Rafael Weingärtner
Are you live migrating a VM, or migrating a volume of a stopped VM to a
different primary storage?

If it is a running VM, is the VM allocated in a shared storage or local
storage?

On Mon, Oct 12, 2015 at 9:17 PM, Ryan Farrington 
wrote:

> The slow transfer is related to the storage we are trying to migrate off
> of.  We are capable of getting about 350mbps off the disks but when we are
> moving volumes that are greater than about 500GB we end up racing the clock
> and hoping that the migration finishes before the job times out.   It would
> be awesome to be able to manage that timeout and I know there are a ton of
> settings I just don't know about and am hoping someone might be able to
> point me in the right direction.
>
>
> 
> From: Rafael Weingärtner [rafaelweingart...@gmail.com]
> Sent: Monday, October 12, 2015 6:40 PM
> To: users@cloudstack.apache.org
> Subject: [Questionable]  Re: Timeout with live migration
>
> I would first check your NICs' speed and load, the amount of RAM allocated
> for the migrating VM and than check the hypervisor log files.
>
> On Mon, Oct 12, 2015 at 8:19 PM, Jan-Arve Nygård <
> jan.arve.nyg...@gmail.com>
> wrote:
>
> > What version are you running? Check if the copy.volume.wait setting is
> set
> > to 7200 and increase it. If not you could also check
> > job.cancel.threshold.minutes and job.expire.minutes.
> >
> > -Jan-Arve
> >
> > 2015-10-13 0:46 GMT+02:00 Ryan Farrington :
> >
> > > We are experiencing a failure in cloudstack waiting for an async job
> > > performing a live migration of a volume to finish. I've copied the
> > relevant
> > > log entries below.We acknowledge that the migration will take a few
> hours
> > > based on the volume of the data and we are looking for a way to
> increase
> > > the timeout of 7200 seconds into something we know we can work with.
> > >
> > >
> > > 2015-10-12 00:19:36,043 DEBUG [o.a.c.s.RemoteHostEndPoint]
> > > (Job-Executor-62:ctx-802065a9 ctx-bb27a168) Failed to send command, due
> > to
> > > Agent:27, com.cloud.exception.OperationTimedoutException: Commands
> > > 835325398 to Host 27 timed out after 7200
> > >
> > >
> > >
> >
>
>
>
> --
> Rafael Weingärtner
>



-- 
Rafael Weingärtner


Re: Timeout with live migration

2015-10-12 Thread Ryan Farrington
Live migrating a data volume. We are purely on shared storage so no local 
storage is involved.  


From: Rafael Weingärtner [rafaelweingart...@gmail.com]
Sent: Monday, October 12, 2015 7:37 PM
To: users@cloudstack.apache.org
Subject: [Questionable]  Re: Timeout with live migration

Are you live migrating a VM, or migrating a volume of a stopped VM to a
different primary storage?

If it is a running VM, is the VM allocated in a shared storage or local
storage?

On Mon, Oct 12, 2015 at 9:17 PM, Ryan Farrington 
wrote:

> The slow transfer is related to the storage we are trying to migrate off
> of.  We are capable of getting about 350mbps off the disks but when we are
> moving volumes that are greater than about 500GB we end up racing the clock
> and hoping that the migration finishes before the job times out.   It would
> be awesome to be able to manage that timeout and I know there are a ton of
> settings I just don't know about and am hoping someone might be able to
> point me in the right direction.
>
>
> 
> From: Rafael Weingärtner [rafaelweingart...@gmail.com]
> Sent: Monday, October 12, 2015 6:40 PM
> To: users@cloudstack.apache.org
> Subject: [Questionable]  Re: Timeout with live migration
>
> I would first check your NICs' speed and load, the amount of RAM allocated
> for the migrating VM and than check the hypervisor log files.
>
> On Mon, Oct 12, 2015 at 8:19 PM, Jan-Arve Nygård <
> jan.arve.nyg...@gmail.com>
> wrote:
>
> > What version are you running? Check if the copy.volume.wait setting is
> set
> > to 7200 and increase it. If not you could also check
> > job.cancel.threshold.minutes and job.expire.minutes.
> >
> > -Jan-Arve
> >
> > 2015-10-13 0:46 GMT+02:00 Ryan Farrington :
> >
> > > We are experiencing a failure in cloudstack waiting for an async job
> > > performing a live migration of a volume to finish. I've copied the
> > relevant
> > > log entries below.We acknowledge that the migration will take a few
> hours
> > > based on the volume of the data and we are looking for a way to
> increase
> > > the timeout of 7200 seconds into something we know we can work with.
> > >
> > >
> > > 2015-10-12 00:19:36,043 DEBUG [o.a.c.s.RemoteHostEndPoint]
> > > (Job-Executor-62:ctx-802065a9 ctx-bb27a168) Failed to send command, due
> > to
> > > Agent:27, com.cloud.exception.OperationTimedoutException: Commands
> > > 835325398 to Host 27 timed out after 7200
> > >
> > >
> > >
> >
>
>
>
> --
> Rafael Weingärtner
>



--
Rafael Weingärtner


RE: [Questionable] Re: Timeout with live migration

2015-10-12 Thread Ryan Farrington
Hypervisor:  XenServer

We are moving a data volume from one storage onto another without shutting down 
the VM cause that would just be silly and a triplication of effort with the 
whole copying to secondary storage and then back off again. The volume is 
staying in the same cluster just moving to a different Primary storage (or SR 
in the XenServer vernacular) 

If you are familiar with ESX this is a "Storage VMotion" where as in XenServer 
it is called "Storage XenMotion". 


From: Rafael Weingärtner [rafaelweingart...@gmail.com]
Sent: Monday, October 12, 2015 7:53 PM
To: users@cloudstack.apache.org
Subject: [Questionable]  Re: Timeout with live migration

what do you mean with livre migrating data volume ?!
I understand a live migration of a VM, but volumes...

do you mean live migrating a VM that has a volume attached?
are you migrating that volume to a different cluster? or just a different
storage in the same cluster?
What hypervisor are you using ?


On Mon, Oct 12, 2015 at 9:47 PM, Ryan Farrington 
wrote:

> Live migrating a data volume. We are purely on shared storage so no local
> storage is involved.
>
> 
> From: Rafael Weingärtner [rafaelweingart...@gmail.com]
> Sent: Monday, October 12, 2015 7:37 PM
> To: users@cloudstack.apache.org
> Subject: [Questionable]  Re: Timeout with live migration
>
> Are you live migrating a VM, or migrating a volume of a stopped VM to a
> different primary storage?
>
> If it is a running VM, is the VM allocated in a shared storage or local
> storage?
>
> On Mon, Oct 12, 2015 at 9:17 PM, Ryan Farrington <
> rfarring...@remitdata.com>
> wrote:
>
> > The slow transfer is related to the storage we are trying to migrate off
> > of.  We are capable of getting about 350mbps off the disks but when we
> are
> > moving volumes that are greater than about 500GB we end up racing the
> clock
> > and hoping that the migration finishes before the job times out.   It
> would
> > be awesome to be able to manage that timeout and I know there are a ton
> of
> > settings I just don't know about and am hoping someone might be able to
> > point me in the right direction.
> >
> >
> > 
> > From: Rafael Weingärtner [rafaelweingart...@gmail.com]
> > Sent: Monday, October 12, 2015 6:40 PM
> > To: users@cloudstack.apache.org
> > Subject: [Questionable]  Re: Timeout with live migration
> >
> > I would first check your NICs' speed and load, the amount of RAM
> allocated
> > for the migrating VM and than check the hypervisor log files.
> >
> > On Mon, Oct 12, 2015 at 8:19 PM, Jan-Arve Nygård <
> > jan.arve.nyg...@gmail.com>
> > wrote:
> >
> > > What version are you running? Check if the copy.volume.wait setting is
> > set
> > > to 7200 and increase it. If not you could also check
> > > job.cancel.threshold.minutes and job.expire.minutes.
> > >
> > > -Jan-Arve
> > >
> > > 2015-10-13 0:46 GMT+02:00 Ryan Farrington :
> > >
> > > > We are experiencing a failure in cloudstack waiting for an async job
> > > > performing a live migration of a volume to finish. I've copied the
> > > relevant
> > > > log entries below.We acknowledge that the migration will take a few
> > hours
> > > > based on the volume of the data and we are looking for a way to
> > increase
> > > > the timeout of 7200 seconds into something we know we can work with.
> > > >
> > > >
> > > > 2015-10-12 00:19:36,043 DEBUG [o.a.c.s.RemoteHostEndPoint]
> > > > (Job-Executor-62:ctx-802065a9 ctx-bb27a168) Failed to send command,
> due
> > > to
> > > > Agent:27, com.cloud.exception.OperationTimedoutException: Commands
> > > > 835325398 to Host 27 timed out after 7200
> > > >
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Rafael Weingärtner
> >
>
>
>
> --
> Rafael Weingärtner
>



--
Rafael Weingärtner


Re: Timeout with live migration

2015-10-12 Thread Jan-Arve Nygård
What version are you running? Check if the copy.volume.wait setting is set
to 7200 and increase it. If not you could also check
job.cancel.threshold.minutes and job.expire.minutes.

-Jan-Arve

2015-10-13 0:46 GMT+02:00 Ryan Farrington :

> We are experiencing a failure in cloudstack waiting for an async job
> performing a live migration of a volume to finish. I've copied the relevant
> log entries below.We acknowledge that the migration will take a few hours
> based on the volume of the data and we are looking for a way to increase
> the timeout of 7200 seconds into something we know we can work with.
>
>
> 2015-10-12 00:19:36,043 DEBUG [o.a.c.s.RemoteHostEndPoint]
> (Job-Executor-62:ctx-802065a9 ctx-bb27a168) Failed to send command, due to
> Agent:27, com.cloud.exception.OperationTimedoutException: Commands
> 835325398 to Host 27 timed out after 7200
>
>
>


Re: Timeout with live migration

2015-10-12 Thread Rafael Weingärtner
what do you mean with livre migrating data volume ?!
I understand a live migration of a VM, but volumes...

do you mean live migrating a VM that has a volume attached?
are you migrating that volume to a different cluster? or just a different
storage in the same cluster?
What hypervisor are you using ?


On Mon, Oct 12, 2015 at 9:47 PM, Ryan Farrington 
wrote:

> Live migrating a data volume. We are purely on shared storage so no local
> storage is involved.
>
> 
> From: Rafael Weingärtner [rafaelweingart...@gmail.com]
> Sent: Monday, October 12, 2015 7:37 PM
> To: users@cloudstack.apache.org
> Subject: [Questionable]  Re: Timeout with live migration
>
> Are you live migrating a VM, or migrating a volume of a stopped VM to a
> different primary storage?
>
> If it is a running VM, is the VM allocated in a shared storage or local
> storage?
>
> On Mon, Oct 12, 2015 at 9:17 PM, Ryan Farrington <
> rfarring...@remitdata.com>
> wrote:
>
> > The slow transfer is related to the storage we are trying to migrate off
> > of.  We are capable of getting about 350mbps off the disks but when we
> are
> > moving volumes that are greater than about 500GB we end up racing the
> clock
> > and hoping that the migration finishes before the job times out.   It
> would
> > be awesome to be able to manage that timeout and I know there are a ton
> of
> > settings I just don't know about and am hoping someone might be able to
> > point me in the right direction.
> >
> >
> > 
> > From: Rafael Weingärtner [rafaelweingart...@gmail.com]
> > Sent: Monday, October 12, 2015 6:40 PM
> > To: users@cloudstack.apache.org
> > Subject: [Questionable]  Re: Timeout with live migration
> >
> > I would first check your NICs' speed and load, the amount of RAM
> allocated
> > for the migrating VM and than check the hypervisor log files.
> >
> > On Mon, Oct 12, 2015 at 8:19 PM, Jan-Arve Nygård <
> > jan.arve.nyg...@gmail.com>
> > wrote:
> >
> > > What version are you running? Check if the copy.volume.wait setting is
> > set
> > > to 7200 and increase it. If not you could also check
> > > job.cancel.threshold.minutes and job.expire.minutes.
> > >
> > > -Jan-Arve
> > >
> > > 2015-10-13 0:46 GMT+02:00 Ryan Farrington :
> > >
> > > > We are experiencing a failure in cloudstack waiting for an async job
> > > > performing a live migration of a volume to finish. I've copied the
> > > relevant
> > > > log entries below.We acknowledge that the migration will take a few
> > hours
> > > > based on the volume of the data and we are looking for a way to
> > increase
> > > > the timeout of 7200 seconds into something we know we can work with.
> > > >
> > > >
> > > > 2015-10-12 00:19:36,043 DEBUG [o.a.c.s.RemoteHostEndPoint]
> > > > (Job-Executor-62:ctx-802065a9 ctx-bb27a168) Failed to send command,
> due
> > > to
> > > > Agent:27, com.cloud.exception.OperationTimedoutException: Commands
> > > > 835325398 to Host 27 timed out after 7200
> > > >
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Rafael Weingärtner
> >
>
>
>
> --
> Rafael Weingärtner
>



-- 
Rafael Weingärtner


Re: [Questionable] Re: Timeout with live migration

2015-10-12 Thread Rafael Weingärtner
Now I understand what you are doing, I am familiar with that concept (live
migration of VM within a cluster, having the VHD being moved from one SR to
another).

I just got confused when I read live migration of volumes (a volume does
not run by itself, so that why I asked a little for some more information).

Looking at the source code this is the variable used to control the timeout:
"long timeout = (_migratewait) * 1000L;"

The value of "_migratewait" is taken from this parameter:
value = (String) params.get("migratewait");
_migratewait = NumbersUtil.parseInt(value, 3600);

Therefore, the name of the parameter to be configured is "migratewait", the
default value is 3600.


BTW1: I think that is a terrible parameter name. We should refactor that,
could you open a Jira ticket for that?

BTW2: that error message you posted does not seem to be related to the
migration timeout; hence, in the code if the copy times out the message
would be:
"Async " + timeout/1000 + " seconds timeout for task " + task.toString()"

Maybe because it throws a "Types.BadAsyncResult(msg)" and that might be
translated into that message, or that might not be related to the problem
itself, and you just thought that it was.


Does it help you?


On Mon, Oct 12, 2015 at 10:00 PM, Ryan Farrington  wrote:

> Hypervisor:  XenServer
>
> We are moving a data volume from one storage onto another without shutting
> down the VM cause that would just be silly and a triplication of effort
> with the whole copying to secondary storage and then back off again. The
> volume is staying in the same cluster just moving to a different Primary
> storage (or SR in the XenServer vernacular)
>
> If you are familiar with ESX this is a "Storage VMotion" where as in
> XenServer it is called "Storage XenMotion".
>
> 
> From: Rafael Weingärtner [rafaelweingart...@gmail.com]
> Sent: Monday, October 12, 2015 7:53 PM
> To: users@cloudstack.apache.org
> Subject: [Questionable]  Re: Timeout with live migration
>
> what do you mean with livre migrating data volume ?!
> I understand a live migration of a VM, but volumes...
>
> do you mean live migrating a VM that has a volume attached?
> are you migrating that volume to a different cluster? or just a different
> storage in the same cluster?
> What hypervisor are you using ?
>
>
> On Mon, Oct 12, 2015 at 9:47 PM, Ryan Farrington <
> rfarring...@remitdata.com>
> wrote:
>
> > Live migrating a data volume. We are purely on shared storage so no local
> > storage is involved.
> >
> > 
> > From: Rafael Weingärtner [rafaelweingart...@gmail.com]
> > Sent: Monday, October 12, 2015 7:37 PM
> > To: users@cloudstack.apache.org
> > Subject: [Questionable]  Re: Timeout with live migration
> >
> > Are you live migrating a VM, or migrating a volume of a stopped VM to a
> > different primary storage?
> >
> > If it is a running VM, is the VM allocated in a shared storage or local
> > storage?
> >
> > On Mon, Oct 12, 2015 at 9:17 PM, Ryan Farrington <
> > rfarring...@remitdata.com>
> > wrote:
> >
> > > The slow transfer is related to the storage we are trying to migrate
> off
> > > of.  We are capable of getting about 350mbps off the disks but when we
> > are
> > > moving volumes that are greater than about 500GB we end up racing the
> > clock
> > > and hoping that the migration finishes before the job times out.   It
> > would
> > > be awesome to be able to manage that timeout and I know there are a ton
> > of
> > > settings I just don't know about and am hoping someone might be able to
> > > point me in the right direction.
> > >
> > >
> > > 
> > > From: Rafael Weingärtner [rafaelweingart...@gmail.com]
> > > Sent: Monday, October 12, 2015 6:40 PM
> > > To: users@cloudstack.apache.org
> > > Subject: [Questionable]  Re: Timeout with live migration
> > >
> > > I would first check your NICs' speed and load, the amount of RAM
> > allocated
> > > for the migrating VM and than check the hypervisor log files.
> > >
> > > On Mon, Oct 12, 2015 at 8:19 PM, Jan-Arve Nygård <
> > > jan.arve.nyg...@gmail.com>
> > > wrote:
> > >
> > > > What version are you running? Check if the copy.volume.wait setting
> is
> > > set
> > > > to 7200 and increase it. If not you could also check
> > > > job.cancel.threshold.minutes and job.expire.minutes.
> > > >
> > > > -Jan-Arve
> > > >
> > > > 2015-10-13 0:46 GMT+02:00 Ryan Farrington  >:
> > > >
> > > > > We are experiencing a failure in cloudstack waiting for an async
> job
> > > > > performing a live migration of a volume to finish. I've copied the
> > > > relevant
> > > > > log entries below.We acknowledge that the migration will take a few
> > > hours
> > > > > based on the volume of the data and we are looking for a way to
> > > increase
> > > > > the timeout of 7200 seconds into something we know we can work
> with.
> > > > >
> > > > 

Re: Timeout with live migration

2015-10-12 Thread Rafael Weingärtner
There is your problem, there are currently two distinct values conrolling
those async jobs.
Change that value and everything will work for u.
Can you open a jira ticket?

On Mon, Oct 12, 2015 at 11:51 PM, Ryan Farrington  wrote:

> wait is currently configured to be 3600
>
>
>
> 
> From: Rafael Weingärtner [rafaelweingart...@gmail.com]
> Sent: Monday, October 12, 2015 9:46 PM
> To: users@cloudstack.apache.org
> Subject: [Questionable]  Re: Timeout with live migration
>
> I found something odd,
> can you check the parameter called "wait", what value is it using ?
>
> On Mon, Oct 12, 2015 at 10:54 PM, Ryan Farrington <
> rfarring...@remitdata.com
> > wrote:
>
> > Yes the parameter was set long ago and the management server has been
> > restarted numerous time over the past few days as we played with other
> > parameters to no effect.
> >
> > After looking at the log a little more does the "Failed to send command,
> > due to Agent:38, com.cloud.exception.OperationTimedoutException: Commands
> > 996939857 to Host 38 timed out after 7200" mean that the migration start
> > command is being sent in some kind of synchronous mode and not returning
> > control back to the job manager?
> >
> >
> >
> >
> > 
> > From: Rafael Weingärtner [rafaelweingart...@gmail.com]
> > Sent: Monday, October 12, 2015 8:46 PM
> > To: users@cloudstack.apache.org
> > Subject: [Questionable]  Re: Timeout with live migration
> >
> > I thought you using the command  “migrateVirtualMachineWithVolume” but it
> > seems that you are using “migrateVolume” command from ACS's API.
> >
> >
> > For the code I debugged “migrateVirtualMachineWithVolume”, the parameter
> > 3600, means 1 hour of timeout.
> >
> > For the “migrateVolume” is the same, they both end up in
> >
> >
> “com.cloud.hypervisor.xen.resource.XenServer610Resource.execute(MigrateVolumeCommand)”,
> > and in that method the parameter is the same.
> >
> >
> > If your parameter is set to 36000 (10 hours) I do not see why you are
> > getting the exception after 2 hours.
> >
> > Did you restart the management servers after you changed the parameter?
> >
> > On Mon, Oct 12, 2015 at 10:31 PM, Ryan Farrington <
> > rfarring...@remitdata.com
> > > wrote:
> >
> > > Here is the full log, including the stack for the exception, that we
> get
> > > at the 2 hour mark. as for the migratewait it is set to 36000 which
> > should
> > > be 10 hours.
> > >
> > > 2015-10-12 18:41:20,137 DEBUG [c.c.a.m.DirectAgentAttache]
> > > (DirectAgent-323:ctx-6d42edd7) Seq 31-1023875267: Executing request
> > > 2015-10-12 18:41:20,457 DEBUG [c.c.a.m.AgentAttache]
> > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Waiting
> > some
> > > more time because this is the current command
> > > 2015-10-12 18:41:20,457 INFO  [c.c.u.e.CSExceptionErrorCode]
> > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Could not find exception:
> > > com.cloud.exception.OperationTimedoutException in error code list for
> > > exceptions
> > > 2015-10-12 18:41:20,465 WARN  [c.c.a.m.AgentAttache]
> > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Timed out
> > on
> > > Seq 38-996939857:  { Cmd , MgmtId: 42756806312036, via:
> 38(xen-nc-bc2b7),
> > > Ver: v1, Flags: 100111,
> > >
> >
> [{"com.cloud.agent.api.storage.MigrateVolumeCommand":{"volumeId":808,"volumePath":"0cd3ec8c-9fa9-4caf-8380-1a85cdfd0958","pool":{"id":246,"uuid":"VNX_PR5_LUN2003","host":"localhost","path":"/VNX_PR5_LUN2003","port":0,"type":"PreSetup"},"attachedVmName":"i-34-311-VM","wait":0}}]
> > > }
> > > 2015-10-12 18:41:20,465 DEBUG [c.c.a.m.AgentAttache]
> > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857:
> Cancelling.
> > > 2015-10-12 18:41:20,465 DEBUG [c.c.a.m.AgentAttache]
> > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: No more
> > > commands found
> > > 2015-10-12 18:41:20,465 DEBUG [o.a.c.s.RemoteHostEndPoint]
> > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Failed to send command, due
> > to
> > > Agent:38, com.cloud.exception.OperationTimedoutException: Commands
> > > 996939857 to Host 38 timed out after 7200
> > > 2015-10-12 18:41:20,471 DEBUG [o.a.c.s.m.AncientDataMotionStrategy]
> > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) copy failed
> > > com.cloud.utils.exception.CloudRuntimeException: Failed to send
> command,
> > > due to Agent:38, com.cloud.exception.OperationTimedoutException:
> Commands
> > > 996939857 to Host 38 timed out after 7200
> > > at
> > >
> >
> org.apache.cloudstack.storage.RemoteHostEndPoint.sendMessage(RemoteHostEndPoint.java:116)
> > > at
> > >
> >
> org.apache.cloudstack.storage.motion.AncientDataMotionStrategy.migrateVolumeToPool(AncientDataMotionStrategy.java:382)
> > > at
> > >
> >
> org.apache.cloudstack.storage.motion.AncientDataMotionStrategy.copyAsync(AncientDataMotionStrategy.java:421)
> > > at
> > >
> >
> 

Re: Timeout with live migration

2015-10-12 Thread Rafael Weingärtner
I found something odd,
can you check the parameter called "wait", what value is it using ?

On Mon, Oct 12, 2015 at 10:54 PM, Ryan Farrington  wrote:

> Yes the parameter was set long ago and the management server has been
> restarted numerous time over the past few days as we played with other
> parameters to no effect.
>
> After looking at the log a little more does the "Failed to send command,
> due to Agent:38, com.cloud.exception.OperationTimedoutException: Commands
> 996939857 to Host 38 timed out after 7200" mean that the migration start
> command is being sent in some kind of synchronous mode and not returning
> control back to the job manager?
>
>
>
>
> 
> From: Rafael Weingärtner [rafaelweingart...@gmail.com]
> Sent: Monday, October 12, 2015 8:46 PM
> To: users@cloudstack.apache.org
> Subject: [Questionable]  Re: Timeout with live migration
>
> I thought you using the command  “migrateVirtualMachineWithVolume” but it
> seems that you are using “migrateVolume” command from ACS's API.
>
>
> For the code I debugged “migrateVirtualMachineWithVolume”, the parameter
> 3600, means 1 hour of timeout.
>
> For the “migrateVolume” is the same, they both end up in
>
> “com.cloud.hypervisor.xen.resource.XenServer610Resource.execute(MigrateVolumeCommand)”,
> and in that method the parameter is the same.
>
>
> If your parameter is set to 36000 (10 hours) I do not see why you are
> getting the exception after 2 hours.
>
> Did you restart the management servers after you changed the parameter?
>
> On Mon, Oct 12, 2015 at 10:31 PM, Ryan Farrington <
> rfarring...@remitdata.com
> > wrote:
>
> > Here is the full log, including the stack for the exception, that we get
> > at the 2 hour mark. as for the migratewait it is set to 36000 which
> should
> > be 10 hours.
> >
> > 2015-10-12 18:41:20,137 DEBUG [c.c.a.m.DirectAgentAttache]
> > (DirectAgent-323:ctx-6d42edd7) Seq 31-1023875267: Executing request
> > 2015-10-12 18:41:20,457 DEBUG [c.c.a.m.AgentAttache]
> > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Waiting
> some
> > more time because this is the current command
> > 2015-10-12 18:41:20,457 INFO  [c.c.u.e.CSExceptionErrorCode]
> > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Could not find exception:
> > com.cloud.exception.OperationTimedoutException in error code list for
> > exceptions
> > 2015-10-12 18:41:20,465 WARN  [c.c.a.m.AgentAttache]
> > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Timed out
> on
> > Seq 38-996939857:  { Cmd , MgmtId: 42756806312036, via: 38(xen-nc-bc2b7),
> > Ver: v1, Flags: 100111,
> >
> [{"com.cloud.agent.api.storage.MigrateVolumeCommand":{"volumeId":808,"volumePath":"0cd3ec8c-9fa9-4caf-8380-1a85cdfd0958","pool":{"id":246,"uuid":"VNX_PR5_LUN2003","host":"localhost","path":"/VNX_PR5_LUN2003","port":0,"type":"PreSetup"},"attachedVmName":"i-34-311-VM","wait":0}}]
> > }
> > 2015-10-12 18:41:20,465 DEBUG [c.c.a.m.AgentAttache]
> > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Cancelling.
> > 2015-10-12 18:41:20,465 DEBUG [c.c.a.m.AgentAttache]
> > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: No more
> > commands found
> > 2015-10-12 18:41:20,465 DEBUG [o.a.c.s.RemoteHostEndPoint]
> > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Failed to send command, due
> to
> > Agent:38, com.cloud.exception.OperationTimedoutException: Commands
> > 996939857 to Host 38 timed out after 7200
> > 2015-10-12 18:41:20,471 DEBUG [o.a.c.s.m.AncientDataMotionStrategy]
> > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) copy failed
> > com.cloud.utils.exception.CloudRuntimeException: Failed to send command,
> > due to Agent:38, com.cloud.exception.OperationTimedoutException: Commands
> > 996939857 to Host 38 timed out after 7200
> > at
> >
> org.apache.cloudstack.storage.RemoteHostEndPoint.sendMessage(RemoteHostEndPoint.java:116)
> > at
> >
> org.apache.cloudstack.storage.motion.AncientDataMotionStrategy.migrateVolumeToPool(AncientDataMotionStrategy.java:382)
> > at
> >
> org.apache.cloudstack.storage.motion.AncientDataMotionStrategy.copyAsync(AncientDataMotionStrategy.java:421)
> > at
> >
> org.apache.cloudstack.storage.motion.DataMotionServiceImpl.copyAsync(DataMotionServiceImpl.java:70)
> > at
> >
> org.apache.cloudstack.storage.volume.VolumeServiceImpl.migrateVolume(VolumeServiceImpl.java:931)
> > at
> >
> com.cloud.storage.VolumeApiServiceImpl.liveMigrateVolume(VolumeApiServiceImpl.java:1680)
> > at
> >
> com.cloud.storage.VolumeApiServiceImpl.orchestrateMigrateVolume(VolumeApiServiceImpl.java:1666)
> > at
> >
> com.cloud.storage.VolumeApiServiceImpl.migrateVolume(VolumeApiServiceImpl.java:1622)
> > at sun.reflect.GeneratedMethodAccessor335.invoke(Unknown Source)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at 

Re: Timeout with live migration

2015-10-12 Thread Ryan Farrington
wait is currently configured to be 3600




From: Rafael Weingärtner [rafaelweingart...@gmail.com]
Sent: Monday, October 12, 2015 9:46 PM
To: users@cloudstack.apache.org
Subject: [Questionable]  Re: Timeout with live migration

I found something odd,
can you check the parameter called "wait", what value is it using ?

On Mon, Oct 12, 2015 at 10:54 PM, Ryan Farrington  wrote:

> Yes the parameter was set long ago and the management server has been
> restarted numerous time over the past few days as we played with other
> parameters to no effect.
>
> After looking at the log a little more does the "Failed to send command,
> due to Agent:38, com.cloud.exception.OperationTimedoutException: Commands
> 996939857 to Host 38 timed out after 7200" mean that the migration start
> command is being sent in some kind of synchronous mode and not returning
> control back to the job manager?
>
>
>
>
> 
> From: Rafael Weingärtner [rafaelweingart...@gmail.com]
> Sent: Monday, October 12, 2015 8:46 PM
> To: users@cloudstack.apache.org
> Subject: [Questionable]  Re: Timeout with live migration
>
> I thought you using the command  “migrateVirtualMachineWithVolume” but it
> seems that you are using “migrateVolume” command from ACS's API.
>
>
> For the code I debugged “migrateVirtualMachineWithVolume”, the parameter
> 3600, means 1 hour of timeout.
>
> For the “migrateVolume” is the same, they both end up in
>
> “com.cloud.hypervisor.xen.resource.XenServer610Resource.execute(MigrateVolumeCommand)”,
> and in that method the parameter is the same.
>
>
> If your parameter is set to 36000 (10 hours) I do not see why you are
> getting the exception after 2 hours.
>
> Did you restart the management servers after you changed the parameter?
>
> On Mon, Oct 12, 2015 at 10:31 PM, Ryan Farrington <
> rfarring...@remitdata.com
> > wrote:
>
> > Here is the full log, including the stack for the exception, that we get
> > at the 2 hour mark. as for the migratewait it is set to 36000 which
> should
> > be 10 hours.
> >
> > 2015-10-12 18:41:20,137 DEBUG [c.c.a.m.DirectAgentAttache]
> > (DirectAgent-323:ctx-6d42edd7) Seq 31-1023875267: Executing request
> > 2015-10-12 18:41:20,457 DEBUG [c.c.a.m.AgentAttache]
> > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Waiting
> some
> > more time because this is the current command
> > 2015-10-12 18:41:20,457 INFO  [c.c.u.e.CSExceptionErrorCode]
> > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Could not find exception:
> > com.cloud.exception.OperationTimedoutException in error code list for
> > exceptions
> > 2015-10-12 18:41:20,465 WARN  [c.c.a.m.AgentAttache]
> > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Timed out
> on
> > Seq 38-996939857:  { Cmd , MgmtId: 42756806312036, via: 38(xen-nc-bc2b7),
> > Ver: v1, Flags: 100111,
> >
> [{"com.cloud.agent.api.storage.MigrateVolumeCommand":{"volumeId":808,"volumePath":"0cd3ec8c-9fa9-4caf-8380-1a85cdfd0958","pool":{"id":246,"uuid":"VNX_PR5_LUN2003","host":"localhost","path":"/VNX_PR5_LUN2003","port":0,"type":"PreSetup"},"attachedVmName":"i-34-311-VM","wait":0}}]
> > }
> > 2015-10-12 18:41:20,465 DEBUG [c.c.a.m.AgentAttache]
> > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Cancelling.
> > 2015-10-12 18:41:20,465 DEBUG [c.c.a.m.AgentAttache]
> > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: No more
> > commands found
> > 2015-10-12 18:41:20,465 DEBUG [o.a.c.s.RemoteHostEndPoint]
> > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Failed to send command, due
> to
> > Agent:38, com.cloud.exception.OperationTimedoutException: Commands
> > 996939857 to Host 38 timed out after 7200
> > 2015-10-12 18:41:20,471 DEBUG [o.a.c.s.m.AncientDataMotionStrategy]
> > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) copy failed
> > com.cloud.utils.exception.CloudRuntimeException: Failed to send command,
> > due to Agent:38, com.cloud.exception.OperationTimedoutException: Commands
> > 996939857 to Host 38 timed out after 7200
> > at
> >
> org.apache.cloudstack.storage.RemoteHostEndPoint.sendMessage(RemoteHostEndPoint.java:116)
> > at
> >
> org.apache.cloudstack.storage.motion.AncientDataMotionStrategy.migrateVolumeToPool(AncientDataMotionStrategy.java:382)
> > at
> >
> org.apache.cloudstack.storage.motion.AncientDataMotionStrategy.copyAsync(AncientDataMotionStrategy.java:421)
> > at
> >
> org.apache.cloudstack.storage.motion.DataMotionServiceImpl.copyAsync(DataMotionServiceImpl.java:70)
> > at
> >
> org.apache.cloudstack.storage.volume.VolumeServiceImpl.migrateVolume(VolumeServiceImpl.java:931)
> > at
> >
> com.cloud.storage.VolumeApiServiceImpl.liveMigrateVolume(VolumeApiServiceImpl.java:1680)
> > at
> >
> com.cloud.storage.VolumeApiServiceImpl.orchestrateMigrateVolume(VolumeApiServiceImpl.java:1666)
> > at
> >
> 

RE: [Questionable] Re: Timeout with live migration

2015-10-12 Thread Ryan Farrington
Yes i can open JIRA tickets. What would you like for me to do?

I'll be happy to change the "wait" parameter.  Do I assume it should be 1/2 of 
the value i want it to be? 




From: Rafael Weingärtner [rafaelweingart...@gmail.com]
Sent: Monday, October 12, 2015 10:12 PM
To: users@cloudstack.apache.org
Subject: [Questionable]  Re: Timeout with live migration

There is your problem, there are currently two distinct values conrolling
those async jobs.
Change that value and everything will work for u.
Can you open a jira ticket?

On Mon, Oct 12, 2015 at 11:51 PM, Ryan Farrington  wrote:

> wait is currently configured to be 3600
>
>
>
> 
> From: Rafael Weingärtner [rafaelweingart...@gmail.com]
> Sent: Monday, October 12, 2015 9:46 PM
> To: users@cloudstack.apache.org
> Subject: [Questionable]  Re: Timeout with live migration
>
> I found something odd,
> can you check the parameter called "wait", what value is it using ?
>
> On Mon, Oct 12, 2015 at 10:54 PM, Ryan Farrington <
> rfarring...@remitdata.com
> > wrote:
>
> > Yes the parameter was set long ago and the management server has been
> > restarted numerous time over the past few days as we played with other
> > parameters to no effect.
> >
> > After looking at the log a little more does the "Failed to send command,
> > due to Agent:38, com.cloud.exception.OperationTimedoutException: Commands
> > 996939857 to Host 38 timed out after 7200" mean that the migration start
> > command is being sent in some kind of synchronous mode and not returning
> > control back to the job manager?
> >
> >
> >
> >
> > 
> > From: Rafael Weingärtner [rafaelweingart...@gmail.com]
> > Sent: Monday, October 12, 2015 8:46 PM
> > To: users@cloudstack.apache.org
> > Subject: [Questionable]  Re: Timeout with live migration
> >
> > I thought you using the command  “migrateVirtualMachineWithVolume” but it
> > seems that you are using “migrateVolume” command from ACS's API.
> >
> >
> > For the code I debugged “migrateVirtualMachineWithVolume”, the parameter
> > 3600, means 1 hour of timeout.
> >
> > For the “migrateVolume” is the same, they both end up in
> >
> >
> “com.cloud.hypervisor.xen.resource.XenServer610Resource.execute(MigrateVolumeCommand)”,
> > and in that method the parameter is the same.
> >
> >
> > If your parameter is set to 36000 (10 hours) I do not see why you are
> > getting the exception after 2 hours.
> >
> > Did you restart the management servers after you changed the parameter?
> >
> > On Mon, Oct 12, 2015 at 10:31 PM, Ryan Farrington <
> > rfarring...@remitdata.com
> > > wrote:
> >
> > > Here is the full log, including the stack for the exception, that we
> get
> > > at the 2 hour mark. as for the migratewait it is set to 36000 which
> > should
> > > be 10 hours.
> > >
> > > 2015-10-12 18:41:20,137 DEBUG [c.c.a.m.DirectAgentAttache]
> > > (DirectAgent-323:ctx-6d42edd7) Seq 31-1023875267: Executing request
> > > 2015-10-12 18:41:20,457 DEBUG [c.c.a.m.AgentAttache]
> > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Waiting
> > some
> > > more time because this is the current command
> > > 2015-10-12 18:41:20,457 INFO  [c.c.u.e.CSExceptionErrorCode]
> > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Could not find exception:
> > > com.cloud.exception.OperationTimedoutException in error code list for
> > > exceptions
> > > 2015-10-12 18:41:20,465 WARN  [c.c.a.m.AgentAttache]
> > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: Timed out
> > on
> > > Seq 38-996939857:  { Cmd , MgmtId: 42756806312036, via:
> 38(xen-nc-bc2b7),
> > > Ver: v1, Flags: 100111,
> > >
> >
> [{"com.cloud.agent.api.storage.MigrateVolumeCommand":{"volumeId":808,"volumePath":"0cd3ec8c-9fa9-4caf-8380-1a85cdfd0958","pool":{"id":246,"uuid":"VNX_PR5_LUN2003","host":"localhost","path":"/VNX_PR5_LUN2003","port":0,"type":"PreSetup"},"attachedVmName":"i-34-311-VM","wait":0}}]
> > > }
> > > 2015-10-12 18:41:20,465 DEBUG [c.c.a.m.AgentAttache]
> > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857:
> Cancelling.
> > > 2015-10-12 18:41:20,465 DEBUG [c.c.a.m.AgentAttache]
> > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Seq 38-996939857: No more
> > > commands found
> > > 2015-10-12 18:41:20,465 DEBUG [o.a.c.s.RemoteHostEndPoint]
> > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) Failed to send command, due
> > to
> > > Agent:38, com.cloud.exception.OperationTimedoutException: Commands
> > > 996939857 to Host 38 timed out after 7200
> > > 2015-10-12 18:41:20,471 DEBUG [o.a.c.s.m.AncientDataMotionStrategy]
> > > (Job-Executor-63:ctx-f7b6817d ctx-c6b92515) copy failed
> > > com.cloud.utils.exception.CloudRuntimeException: Failed to send
> command,
> > > due to Agent:38, com.cloud.exception.OperationTimedoutException:
> Commands
> > > 996939857 to Host 38 timed out after 7200
> > > at
> > >
> >
>