Re: [galaxy-dev] Galaxy dropping jobs?

2013-11-06 Thread Nikolai Vazov

Hi again,

The loop (as explained below), did the job :)

Nikolay




Thank you very much, Nate,

1.
I have put a fix : a loop running the JobStatus check 5 times every 60 
secs and only then throwing an exception as the one below.
It happened that all connect failures happen at the same time - at 
slurm log rotation time at 3 am. Hopefully it helps :)


2.
Our slurm conf keeps the info about each job for 5 min. But looking at 
the code, it seems that in in the case you describe below, there will be 
an "Invalid job exception" leading to "Job finished" state. Am I wrong?


Anyway, I'll let you know if the loop does the job.

Thanks again,

Nikolay


On 2013-11-04 15:57, Nate Coraor wrote:

Hi Nikolay,
With slurm, the following change that I backed out should fix the 
problem:


https://bitbucket.org/galaxy/galaxy-central/diff/lib/galaxy/jobs/runners/drmaa.py?diff2=d46b64f12c52&at=default
Although I do believe that if Galaxy doesn't read the completion
state before slurm "forgets" about the job (MinJobAge in slurm.conf),
this change could result in the job becoming permanently stuck in the
running state.
I should have some enhancements to the DRMAA runner for slurm coming
soon that would prevent this.
--nate
On Oct 31, 2013, at 5:27 AM, Nikolai Vazov wrote:


Hi,

I discovered a weird issue in the job behaviour : Galaxy is running a 
long job on a cluster (more than 24h), about 15 hours later it misses 
the connection with SLURM on the cluster and throws the following 
message :

[root@galaxy-prod01 galaxy-dist]# grep 3715200 paster.log
galaxy.jobs.runners.drmaa INFO 2013-10-30 10:51:54,149 (555) queued 
as 3715200
galaxy.jobs.runners.drmaa DEBUG 2013-10-30 10:51:55,149 (555/3715200) 
state change: job is queued and active
galaxy.jobs.runners.drmaa DEBUG 2013-10-30 10:52:13,516 (555/3715200) 
state change: job is running
galaxy.jobs.runners.drmaa INFO 2013-10-31 03:29:33,090 (555/3715200) 
job left DRM queue with following message: code 1: slurm_load_jobs 
error: Unable to contact slurm controller (connect failure),job_id: 
3715200
Is there a timeout in Galaxy for contacting slurm? Yet, the job is 
still running properly on the cluster ...

Thanks for help, it's really urgent :)
Nikolay

--
Nikolay Vazov, PhD
Research Computing Centre - http://hpc.uio.no
USIT, University of Oslo
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at:
http://galaxyproject.org/search/mailinglists/


--
Nikolay Vazov, PhD
Research Computing Centre - http://hpc.uio.no
USIT, University of Oslo
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
 http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
 http://galaxyproject.org/search/mailinglists/


Re: [galaxy-dev] Galaxy dropping jobs?

2013-11-05 Thread Nikolai Vazov

Thank you very much, Nate,

1.
I have put a fix : a loop running the JobStatus check 5 times every 60 
secs and only then throwing an exception as the one below.
It happened that all connect failures happen at the same time - at 
slurm log rotation time at 3 am. Hopefully it helps :)


2.
Our slurm conf keeps the info about each job for 5 min. But looking at 
the code, it seems that in in the case you describe below, there will be 
an "Invalid job exception" leading to "Job finished" state. Am I wrong?


Anyway, I'll let you know if the loop does the job.

Thanks again,

Nikolay


On 2013-11-04 15:57, Nate Coraor wrote:

Hi Nikolay,

With slurm, the following change that I backed out should fix the 
problem:



https://bitbucket.org/galaxy/galaxy-central/diff/lib/galaxy/jobs/runners/drmaa.py?diff2=d46b64f12c52&at=default

Although I do believe that if Galaxy doesn't read the completion
state before slurm "forgets" about the job (MinJobAge in slurm.conf),
this change could result in the job becoming permanently stuck in the
running state.

I should have some enhancements to the DRMAA runner for slurm coming
soon that would prevent this.

--nate

On Oct 31, 2013, at 5:27 AM, Nikolai Vazov wrote:



Hi,


I discovered a weird issue in the job behaviour : Galaxy is running a 
long job on a cluster (more than 24h), about 15 hours later it misses 
the connection with SLURM on the cluster and throws the following 
message :


[root@galaxy-prod01 galaxy-dist]# grep 3715200 paster.log
galaxy.jobs.runners.drmaa INFO 2013-10-30 10:51:54,149 (555) queued 
as 3715200
galaxy.jobs.runners.drmaa DEBUG 2013-10-30 10:51:55,149 (555/3715200) 
state change: job is queued and active
galaxy.jobs.runners.drmaa DEBUG 2013-10-30 10:52:13,516 (555/3715200) 
state change: job is running
galaxy.jobs.runners.drmaa INFO 2013-10-31 03:29:33,090 (555/3715200) 
job left DRM queue with following message: code 1: slurm_load_jobs 
error: Unable to contact slurm controller (connect failure),job_id: 
3715200


Is there a timeout in Galaxy for contacting slurm? Yet, the job is 
still running properly on the cluster ...


Thanks for help, it's really urgent :)

Nikolay


--
Nikolay Vazov, PhD
Research Computing Centre - http://hpc.uio.no
USIT, University of Oslo
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
http://galaxyproject.org/search/mailinglists/


--
Nikolay Vazov, PhD
Research Computing Centre - http://hpc.uio.no
USIT, University of Oslo
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
 http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
 http://galaxyproject.org/search/mailinglists/


Re: [galaxy-dev] Galaxy dropping jobs?

2013-11-04 Thread Nate Coraor
Hi Nikolay,

With slurm, the following change that I backed out should fix the problem:

  
https://bitbucket.org/galaxy/galaxy-central/diff/lib/galaxy/jobs/runners/drmaa.py?diff2=d46b64f12c52&at=default

Although I do believe that if Galaxy doesn't read the completion state before 
slurm "forgets" about the job (MinJobAge in slurm.conf), this change could 
result in the job becoming permanently stuck in the running state.

I should have some enhancements to the DRMAA runner for slurm coming soon that 
would prevent this.

--nate

On Oct 31, 2013, at 5:27 AM, Nikolai Vazov wrote:

> 
> Hi,
> 
> 
> I discovered a weird issue in the job behaviour : Galaxy is running a long 
> job on a cluster (more than 24h), about 15 hours later it misses the 
> connection with SLURM on the cluster and throws the following message :
> 
> [root@galaxy-prod01 galaxy-dist]# grep 3715200 paster.log
> galaxy.jobs.runners.drmaa INFO 2013-10-30 10:51:54,149 (555) queued as 3715200
> galaxy.jobs.runners.drmaa DEBUG 2013-10-30 10:51:55,149 (555/3715200) state 
> change: job is queued and active
> galaxy.jobs.runners.drmaa DEBUG 2013-10-30 10:52:13,516 (555/3715200) state 
> change: job is running
> galaxy.jobs.runners.drmaa INFO 2013-10-31 03:29:33,090 (555/3715200) job left 
> DRM queue with following message: code 1: slurm_load_jobs error: Unable to 
> contact slurm controller (connect failure),job_id: 3715200
> 
> Is there a timeout in Galaxy for contacting slurm? Yet, the job is still 
> running properly on the cluster ...
> 
> Thanks for help, it's really urgent :)
> 
> Nikolay
> 
> 
> -- 
> Nikolay Vazov, PhD
> Research Computing Centre - http://hpc.uio.no
> USIT, University of Oslo
> ___
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> http://lists.bx.psu.edu/
> 
> To search Galaxy mailing lists use the unified search at:
> http://galaxyproject.org/search/mailinglists/


___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/