Re: Help! Jobs stuck in pending state

2019-01-24 Thread Suresh Kumar Anaparti
Hi Alireza,

Tables details below as per my knowledge. @Dev Please correct if any detail
is wrong.

- sync_queue and sync_queue_item tables are used for handling the entity
(VM, host, etc) queues and concurrent control. Mainly, all the VM sync jobs
pass through this queuing.
- async_job - all the async jobs and related place holder VM async jobs (if
any).
- vm_work_job - extension to place holder VM async job in async_job, which
holds VM id and the job stage.
- op_ha_work - holds the VM work items to perform HA on the VMs, scheduled
or cancelled based on the VM state.
- op_lock - Used to acquire lock on a record in the given table (key:
 + ) for a transaction by a running thread in the
Management Server. Lock is released once the transaction is completed and
corresponding record will be deleted.

Hope this helps!

-Suresh

On Thu, Jan 24, 2019 at 12:49 AM Alireza Eskandari 
wrote:

> Dear Suresh and Andrei
> Thanks for your help.
> I have upgrade CloudStack from 4.9.3 to 4.11.2 but the problem still
> persists.
> Then I inspect database tables and I found that these three tables could be
> the root cause:
> - op_ha_work
> - op_lock
> - vm_work_job
> So I delete all records in those tables and problem solved.
> The content of those tables are submitted as a comment in the bug report in
> jira:
> https://issues.apache.org/jira/browse/CLOUDSTACK-10401
> Suresh, could you tell me more about the role of those tables in CS?
> I think CS had been more sensitive about concurrent jobs. Previous versions
> works better.
> Regards
>
> On Wed, Jan 23, 2019 at 9:43 PM Suresh Kumar Anaparti <
> sureshkumar.anapa...@gmail.com> wrote:
>
> > Hi Alireza,
> >
> > *sync_queue *table is the actual VM sync queue which holds a queue id for
> > each VM (*sync_objtype*: VmWorkJobQueue, *sync_objid*: ) and the
> VM
> > jobs would reside in *sync_queue_item* table against that queue id. Only
> > one running job is allowed per VM queue (*queue_size_limit*: 1 in
> > *sync_queue* table). The active/running job would have the
> *queue_proc_id*,
> > *queue_proc_number* and *queue_proc_time* set in the *sync_queue_item*
> > table
> > and the rest jobs with that queue id would be waiting for active job to
> > complete. So, to delete pending jobs, records in the *sync_queue_item
> > *table
> > has to be cleared for the respective VMs, not the *sync_queue *table.
> >
> > I think, in your case, snapshots is taking long time and other jobs in
> that
> > VM are pending for long time as they are in queue waiting for snapshot
> job
> > to complete. What are the config values set for
> > "job.cancel.threshold.minutes", "job.expire.minutes" and
> > "volume.snapshot.job.cancel.threshold"? Are the jobs cancelled after the
> > threshold time?
> >
> > Thanks,
> > Suresh
> >
> > On Wed, Jan 23, 2019 at 7:14 PM Andrei Mikhailovsky
> >  wrote:
> >
> > > Hi
> > >
> > > I've had this issue a few times in 2018 and managed to get it fixed
> > pretty
> > > easily, although had spent a number of hours initially trying to figure
> > out
> > > WTF is going on. This issue looks like one of those artefacts that
> > creeped
> > > up in one of the versions released in 2018 and hasn't been addressed by
> > the
> > > dev team.
> > >
> > > The way I fixed it was similar to what has been recommended earlier.
> > > However, the difference was that I am sure I've looked at more tables
> > than
> > > just the two suggested. Basically, I've stopped the management server,
> > > created the sql backup, connected to the sql db and listed all tables.
> > > Grepped for the words like job/schedule/queue/sync. After that I've
> went
> > > through all the tables and pretty much removed all the past / active /
> > > awaiting execution jobs. I have started by looking at the vm related
> jobs
> > > (the vm that I've tried to start but wasn't able to). This has worked
> > once,
> > > but the second time I had to remove a lot more jobs which relate to
> other
> > > vms. After that I've started the management server and all went well
> from
> > > there.
> > >
> > > What I have also noticed is that my snapshot jobs (I use KVM and Ceph)
> > > seem to be blocking jobs on the hypervisor hosts which are running
> these
> > > snapshots. So, if I am trying to perform various vm related jobs on a
> > host
> > > server which is currently running a snapshot process, that job will not
> > be
> > > executed until the snapshot process is done. I've t

Re: Help! Jobs stuck in pending state

2019-01-23 Thread Alireza Eskandari
Dear Suresh and Andrei
Thanks for your help.
I have upgrade CloudStack from 4.9.3 to 4.11.2 but the problem still
persists.
Then I inspect database tables and I found that these three tables could be
the root cause:
- op_ha_work
- op_lock
- vm_work_job
So I delete all records in those tables and problem solved.
The content of those tables are submitted as a comment in the bug report in
jira:
https://issues.apache.org/jira/browse/CLOUDSTACK-10401
Suresh, could you tell me more about the role of those tables in CS?
I think CS had been more sensitive about concurrent jobs. Previous versions
works better.
Regards

On Wed, Jan 23, 2019 at 9:43 PM Suresh Kumar Anaparti <
sureshkumar.anapa...@gmail.com> wrote:

> Hi Alireza,
>
> *sync_queue *table is the actual VM sync queue which holds a queue id for
> each VM (*sync_objtype*: VmWorkJobQueue, *sync_objid*: ) and the VM
> jobs would reside in *sync_queue_item* table against that queue id. Only
> one running job is allowed per VM queue (*queue_size_limit*: 1 in
> *sync_queue* table). The active/running job would have the *queue_proc_id*,
> *queue_proc_number* and *queue_proc_time* set in the *sync_queue_item*
> table
> and the rest jobs with that queue id would be waiting for active job to
> complete. So, to delete pending jobs, records in the *sync_queue_item
> *table
> has to be cleared for the respective VMs, not the *sync_queue *table.
>
> I think, in your case, snapshots is taking long time and other jobs in that
> VM are pending for long time as they are in queue waiting for snapshot job
> to complete. What are the config values set for
> "job.cancel.threshold.minutes", "job.expire.minutes" and
> "volume.snapshot.job.cancel.threshold"? Are the jobs cancelled after the
> threshold time?
>
> Thanks,
> Suresh
>
> On Wed, Jan 23, 2019 at 7:14 PM Andrei Mikhailovsky
>  wrote:
>
> > Hi
> >
> > I've had this issue a few times in 2018 and managed to get it fixed
> pretty
> > easily, although had spent a number of hours initially trying to figure
> out
> > WTF is going on. This issue looks like one of those artefacts that
> creeped
> > up in one of the versions released in 2018 and hasn't been addressed by
> the
> > dev team.
> >
> > The way I fixed it was similar to what has been recommended earlier.
> > However, the difference was that I am sure I've looked at more tables
> than
> > just the two suggested. Basically, I've stopped the management server,
> > created the sql backup, connected to the sql db and listed all tables.
> > Grepped for the words like job/schedule/queue/sync. After that I've went
> > through all the tables and pretty much removed all the past / active /
> > awaiting execution jobs. I have started by looking at the vm related jobs
> > (the vm that I've tried to start but wasn't able to). This has worked
> once,
> > but the second time I had to remove a lot more jobs which relate to other
> > vms. After that I've started the management server and all went well from
> > there.
> >
> > What I have also noticed is that my snapshot jobs (I use KVM and Ceph)
> > seem to be blocking jobs on the hypervisor hosts which are running these
> > snapshots. So, if I am trying to perform various vm related jobs on a
> host
> > server which is currently running a snapshot process, that job will not
> be
> > executed until the snapshot process is done. I've tested this countless
> > number of times and it's still the case. Again, this issued appeared in
> one
> > of the 2018 releases as I've never seen between 2012 - 2017.
> >
> > Both issues are annoying as hell!
> >
> > Cheers
> >
> > - Original Message -
> > > From: "Alireza Eskandari" 
> > > To: "dev" 
> > > Sent: Wednesday, 23 January, 2019 12:40:48
> > > Subject: Re: Help! Jobs stuck in pending state
> >
> > > I'm following this issue in github:
> > > https://github.com/apache/cloudstack/issues/3104
> > > Please leave your comments
> > > Thanks
> > >
> > > On Wed, Jan 23, 2019 at 12:39 PM Wei ZHOU 
> wrote:
> > >
> > >> Hi Alireza,
> > >>
> > >> could you try again after restarting mgt server ?
> > >>
> > >> -Wei
> > >>
> > >> Alireza Eskandari  于2019年1月23日周三 上午6:22写道:
> > >>
> > >> > First I deleted two jobs which was existed in  vm_work_job table and
> > its
> > >> > related entry in  sync_queue table but it doesn't help.
> > >> > Then I delete all the entries in sync_queue 

Re: Help! Jobs stuck in pending state

2019-01-23 Thread Suresh Kumar Anaparti
Hi Alireza,

*sync_queue *table is the actual VM sync queue which holds a queue id for
each VM (*sync_objtype*: VmWorkJobQueue, *sync_objid*: ) and the VM
jobs would reside in *sync_queue_item* table against that queue id. Only
one running job is allowed per VM queue (*queue_size_limit*: 1 in
*sync_queue* table). The active/running job would have the *queue_proc_id*,
*queue_proc_number* and *queue_proc_time* set in the *sync_queue_item* table
and the rest jobs with that queue id would be waiting for active job to
complete. So, to delete pending jobs, records in the *sync_queue_item *table
has to be cleared for the respective VMs, not the *sync_queue *table.

I think, in your case, snapshots is taking long time and other jobs in that
VM are pending for long time as they are in queue waiting for snapshot job
to complete. What are the config values set for
"job.cancel.threshold.minutes", "job.expire.minutes" and
"volume.snapshot.job.cancel.threshold"? Are the jobs cancelled after the
threshold time?

Thanks,
Suresh

On Wed, Jan 23, 2019 at 7:14 PM Andrei Mikhailovsky
 wrote:

> Hi
>
> I've had this issue a few times in 2018 and managed to get it fixed pretty
> easily, although had spent a number of hours initially trying to figure out
> WTF is going on. This issue looks like one of those artefacts that creeped
> up in one of the versions released in 2018 and hasn't been addressed by the
> dev team.
>
> The way I fixed it was similar to what has been recommended earlier.
> However, the difference was that I am sure I've looked at more tables than
> just the two suggested. Basically, I've stopped the management server,
> created the sql backup, connected to the sql db and listed all tables.
> Grepped for the words like job/schedule/queue/sync. After that I've went
> through all the tables and pretty much removed all the past / active /
> awaiting execution jobs. I have started by looking at the vm related jobs
> (the vm that I've tried to start but wasn't able to). This has worked once,
> but the second time I had to remove a lot more jobs which relate to other
> vms. After that I've started the management server and all went well from
> there.
>
> What I have also noticed is that my snapshot jobs (I use KVM and Ceph)
> seem to be blocking jobs on the hypervisor hosts which are running these
> snapshots. So, if I am trying to perform various vm related jobs on a host
> server which is currently running a snapshot process, that job will not be
> executed until the snapshot process is done. I've tested this countless
> number of times and it's still the case. Again, this issued appeared in one
> of the 2018 releases as I've never seen between 2012 - 2017.
>
> Both issues are annoying as hell!
>
> Cheers
>
> - Original Message -----
> > From: "Alireza Eskandari" 
> > To: "dev" 
> > Sent: Wednesday, 23 January, 2019 12:40:48
> > Subject: Re: Help! Jobs stuck in pending state
>
> > I'm following this issue in github:
> > https://github.com/apache/cloudstack/issues/3104
> > Please leave your comments
> > Thanks
> >
> > On Wed, Jan 23, 2019 at 12:39 PM Wei ZHOU  wrote:
> >
> >> Hi Alireza,
> >>
> >> could you try again after restarting mgt server ?
> >>
> >> -Wei
> >>
> >> Alireza Eskandari  于2019年1月23日周三 上午6:22写道:
> >>
> >> > First I deleted two jobs which was existed in  vm_work_job table and
> its
> >> > related entry in  sync_queue table but it doesn't help.
> >> > Then I delete all the entries in sync_queue tables and again no
> success.
> >> > Any idea?
> >> >
> >> > On Wed, Jan 23, 2019 at 1:50 AM Wei ZHOU 
> wrote:
> >> >
> >> > > If you know the instance id and mysql password, it should work after
> >> > > removing some records in mysql.
> >> > >
> >> > > ```
> >> > > set @id=X;
> >> > >
> >> > > delete from vm_work_job where vm_instance_id=@id;
> >> > > delete from sync_queue where sync_objid=@id;
> >> > > ```
> >> > >
> >> > > Alireza Eskandari  于2019年1月22日周二
> 下午10:59写道:
> >> > >
> >> > > > Hi guys
> >> > > > I have opened a bug in jira about my problem in CS:
> >> > > > https://issues.apache.org/jira/browse/CLOUDSTACK-10401
> >> > > > CloudStack doesn't process jobs! My cloud in totally unusable.
> >> > > > Thanks in advance for you help.
> >> > > >
> >> > >
> >> >
>


Re: Help! Jobs stuck in pending state

2019-01-23 Thread Andrei Mikhailovsky
Hi

I've had this issue a few times in 2018 and managed to get it fixed pretty 
easily, although had spent a number of hours initially trying to figure out WTF 
is going on. This issue looks like one of those artefacts that creeped up in 
one of the versions released in 2018 and hasn't been addressed by the dev team.

The way I fixed it was similar to what has been recommended earlier. However, 
the difference was that I am sure I've looked at more tables than just the two 
suggested. Basically, I've stopped the management server, created the sql 
backup, connected to the sql db and listed all tables. Grepped for the words 
like job/schedule/queue/sync. After that I've went through all the tables and 
pretty much removed all the past / active / awaiting execution jobs. I have 
started by looking at the vm related jobs (the vm that I've tried to start but 
wasn't able to). This has worked once, but the second time I had to remove a 
lot more jobs which relate to other vms. After that I've started the management 
server and all went well from there.

What I have also noticed is that my snapshot jobs (I use KVM and Ceph) seem to 
be blocking jobs on the hypervisor hosts which are running these snapshots. So, 
if I am trying to perform various vm related jobs on a host server which is 
currently running a snapshot process, that job will not be executed until the 
snapshot process is done. I've tested this countless number of times and it's 
still the case. Again, this issued appeared in one of the 2018 releases as I've 
never seen between 2012 - 2017.

Both issues are annoying as hell!

Cheers

- Original Message -
> From: "Alireza Eskandari" 
> To: "dev" 
> Sent: Wednesday, 23 January, 2019 12:40:48
> Subject: Re: Help! Jobs stuck in pending state

> I'm following this issue in github:
> https://github.com/apache/cloudstack/issues/3104
> Please leave your comments
> Thanks
> 
> On Wed, Jan 23, 2019 at 12:39 PM Wei ZHOU  wrote:
> 
>> Hi Alireza,
>>
>> could you try again after restarting mgt server ?
>>
>> -Wei
>>
>> Alireza Eskandari  于2019年1月23日周三 上午6:22写道:
>>
>> > First I deleted two jobs which was existed in  vm_work_job table and its
>> > related entry in  sync_queue table but it doesn't help.
>> > Then I delete all the entries in sync_queue tables and again no success.
>> > Any idea?
>> >
>> > On Wed, Jan 23, 2019 at 1:50 AM Wei ZHOU  wrote:
>> >
>> > > If you know the instance id and mysql password, it should work after
>> > > removing some records in mysql.
>> > >
>> > > ```
>> > > set @id=X;
>> > >
>> > > delete from vm_work_job where vm_instance_id=@id;
>> > > delete from sync_queue where sync_objid=@id;
>> > > ```
>> > >
>> > > Alireza Eskandari  于2019年1月22日周二 下午10:59写道:
>> > >
>> > > > Hi guys
>> > > > I have opened a bug in jira about my problem in CS:
>> > > > https://issues.apache.org/jira/browse/CLOUDSTACK-10401
>> > > > CloudStack doesn't process jobs! My cloud in totally unusable.
>> > > > Thanks in advance for you help.
>> > > >
>> > >
>> >


Re: Help! Jobs stuck in pending state

2019-01-23 Thread Alireza Eskandari
I'm following this issue in github:
https://github.com/apache/cloudstack/issues/3104
Please leave your comments
Thanks

On Wed, Jan 23, 2019 at 12:39 PM Wei ZHOU  wrote:

> Hi Alireza,
>
> could you try again after restarting mgt server ?
>
> -Wei
>
> Alireza Eskandari  于2019年1月23日周三 上午6:22写道:
>
> > First I deleted two jobs which was existed in  vm_work_job table and its
> > related entry in  sync_queue table but it doesn't help.
> > Then I delete all the entries in sync_queue tables and again no success.
> > Any idea?
> >
> > On Wed, Jan 23, 2019 at 1:50 AM Wei ZHOU  wrote:
> >
> > > If you know the instance id and mysql password, it should work after
> > > removing some records in mysql.
> > >
> > > ```
> > > set @id=X;
> > >
> > > delete from vm_work_job where vm_instance_id=@id;
> > > delete from sync_queue where sync_objid=@id;
> > > ```
> > >
> > > Alireza Eskandari  于2019年1月22日周二 下午10:59写道:
> > >
> > > > Hi guys
> > > > I have opened a bug in jira about my problem in CS:
> > > > https://issues.apache.org/jira/browse/CLOUDSTACK-10401
> > > > CloudStack doesn't process jobs! My cloud in totally unusable.
> > > > Thanks in advance for you help.
> > > >
> > >
> >
>


Re: Help! Jobs stuck in pending state

2019-01-23 Thread Wei ZHOU
Hi Alireza,

could you try again after restarting mgt server ?

-Wei

Alireza Eskandari  于2019年1月23日周三 上午6:22写道:

> First I deleted two jobs which was existed in  vm_work_job table and its
> related entry in  sync_queue table but it doesn't help.
> Then I delete all the entries in sync_queue tables and again no success.
> Any idea?
>
> On Wed, Jan 23, 2019 at 1:50 AM Wei ZHOU  wrote:
>
> > If you know the instance id and mysql password, it should work after
> > removing some records in mysql.
> >
> > ```
> > set @id=X;
> >
> > delete from vm_work_job where vm_instance_id=@id;
> > delete from sync_queue where sync_objid=@id;
> > ```
> >
> > Alireza Eskandari  于2019年1月22日周二 下午10:59写道:
> >
> > > Hi guys
> > > I have opened a bug in jira about my problem in CS:
> > > https://issues.apache.org/jira/browse/CLOUDSTACK-10401
> > > CloudStack doesn't process jobs! My cloud in totally unusable.
> > > Thanks in advance for you help.
> > >
> >
>


Re: Help! Jobs stuck in pending state

2019-01-22 Thread Anurag Awasthi
Hi Alireza,

Could you elaborate on how you instantiated the jobs and any thing specific 
that went wrong in between? Usually deleting directly through SQL statements is 
very risky and first try should be through any API support.

Also, you might want to use github page 
(https://github.com/apache/cloudstack/issues) to raise an issue as I think most 
people active on project have been referring the issues list on that page.

Best Regards,
Anurag

On 1/23/19, 10:52 AM, "Alireza Eskandari"  wrote:

First I deleted two jobs which was existed in  vm_work_job table and its
related entry in  sync_queue table but it doesn't help.
Then I delete all the entries in sync_queue tables and again no success.
Any idea?


anurag.awas...@shapeblue.com 
www.shapeblue.com
Amadeus House, Floral Street, London  WC2E 9DPUK
@shapeblue
  
 

On Wed, Jan 23, 2019 at 1:50 AM Wei ZHOU  wrote:

> If you know the instance id and mysql password, it should work after
> removing some records in mysql.
>
> ```
> set @id=X;
>
> delete from vm_work_job where vm_instance_id=@id;
> delete from sync_queue where sync_objid=@id;
> ```
>
> Alireza Eskandari  于2019年1月22日周二 下午10:59写道:
>
> > Hi guys
> > I have opened a bug in jira about my problem in CS:
> > https://issues.apache.org/jira/browse/CLOUDSTACK-10401
> > CloudStack doesn't process jobs! My cloud in totally unusable.
> > Thanks in advance for you help.
> >
>




Re: Help! Jobs stuck in pending state

2019-01-22 Thread Alireza Eskandari
First I deleted two jobs which was existed in  vm_work_job table and its
related entry in  sync_queue table but it doesn't help.
Then I delete all the entries in sync_queue tables and again no success.
Any idea?

On Wed, Jan 23, 2019 at 1:50 AM Wei ZHOU  wrote:

> If you know the instance id and mysql password, it should work after
> removing some records in mysql.
>
> ```
> set @id=X;
>
> delete from vm_work_job where vm_instance_id=@id;
> delete from sync_queue where sync_objid=@id;
> ```
>
> Alireza Eskandari  于2019年1月22日周二 下午10:59写道:
>
> > Hi guys
> > I have opened a bug in jira about my problem in CS:
> > https://issues.apache.org/jira/browse/CLOUDSTACK-10401
> > CloudStack doesn't process jobs! My cloud in totally unusable.
> > Thanks in advance for you help.
> >
>


Re: Help! Jobs stuck in pending state

2019-01-22 Thread Alireza Eskandari
Here is my query on those tables:

MySQL [cloud]> select * from vm_work_job;
+---+--+--++
| id| step | vm_type  | vm_instance_id |
+---+--+--++
| 57262 | Prepare  | Instance |691 |
| 57268 | Starting | Instance |748 |
+---+--+--++
2 rows in set (0.00 sec)


MySQL [cloud]> SELECT * FROM cloud.sync_queue;
+---+++---+-+-++--+
| id| sync_objtype   | sync_objid | queue_proc_number | created
 | last_updated| queue_size | queue_size_limit |
+---+++---+-+-++--+
| 4 | VmWorkJobQueue |  1 | 3 | 2017-08-28
12:24:09 | 2017-08-28 12:24:42 |  0 |1 |
| 7 | VmWorkJobQueue |  2 | 4 | 2017-08-28
12:24:10 | 2017-08-28 13:18:54 |  0 |1 |
|19 | VmWorkJobQueue |  3 | 4 | 2017-08-28
12:44:09 | 2017-08-29 11:31:09 |  0 |1 |
|34 | VmWorkJobQueue |  4 | 2 | 2017-08-29
11:03:28 | 2017-08-29 11:24:59 |  0 |1 |
.
.
.
| 16360 | VmWorkJobQueue |745 | 2 | 2019-01-22
07:06:48 | 2019-01-22 08:06:56 |  0 |1 |
| 16369 | VmWorkJobQueue |746 | 2 | 2019-01-22
11:01:45 | 2019-01-22 12:03:54 |  0 |1 |
| 16378 | VmWorkJobQueue |747 | 2 | 2019-01-22
13:30:48 | 2019-01-22 14:32:54 |  0 |1 |
| 16390 | VmWorkJobQueue |748 | 1 | 2019-01-22
15:48:53 | 2019-01-22 16:12:53 |  0 |1 |
+---+++---+-+-++--+
740 rows in set (0.01 sec)




On Wed, Jan 23, 2019 at 1:50 AM Wei ZHOU  wrote:

> If you know the instance id and mysql password, it should work after
> removing some records in mysql.
>
> ```
> set @id=X;
>
> delete from vm_work_job where vm_instance_id=@id;
> delete from sync_queue where sync_objid=@id;
> ```
>
> Alireza Eskandari  于2019年1月22日周二 下午10:59写道:
>
> > Hi guys
> > I have opened a bug in jira about my problem in CS:
> > https://issues.apache.org/jira/browse/CLOUDSTACK-10401
> > CloudStack doesn't process jobs! My cloud in totally unusable.
> > Thanks in advance for you help.
> >
>


Re: Help! Jobs stuck in pending state

2019-01-22 Thread Wei ZHOU
If you know the instance id and mysql password, it should work after
removing some records in mysql.

```
set @id=X;

delete from vm_work_job where vm_instance_id=@id;
delete from sync_queue where sync_objid=@id;
```

Alireza Eskandari  于2019年1月22日周二 下午10:59写道:

> Hi guys
> I have opened a bug in jira about my problem in CS:
> https://issues.apache.org/jira/browse/CLOUDSTACK-10401
> CloudStack doesn't process jobs! My cloud in totally unusable.
> Thanks in advance for you help.
>


Help! Jobs stuck in pending state

2019-01-22 Thread Alireza Eskandari
Hi guys
I have opened a bug in jira about my problem in CS:
https://issues.apache.org/jira/browse/CLOUDSTACK-10401
CloudStack doesn't process jobs! My cloud in totally unusable.
Thanks in advance for you help.