Re: [galaxy-dev] condor jobs

2015-01-30 Thread John Chilton
On Fri, Jan 30, 2015 at 9:24 AM, Shrum, Donald C  wrote:
> Hi all,
>
>
>
> I’ve configured one of our tools to submit jobs to our condor cluster.  I
> can see the job is routed to the condor runner:
>
>
>
> ==> handler4.log <==
>
> galaxy.jobs.handler DEBUG 2015-01-30 09:14:58,092 (508) Dispatching to
> condor runner
>
> galaxy.jobs DEBUG 2015-01-30 09:14:58,204 (508) Persisting job destination
> (destination id: condor)
>
>
>
> I can see that indeed the job is submitted to the condor cluster:
>
> [root@galaxy galaxy-dist]# condor_q
>
> -- Submitter: galaxy.local : <10.177.61.90:55265> : galaxy.local
>
> ID  OWNERSUBMITTED RUN_TIME ST PRI SIZE CMD
>
>   21.0   galaxy  1/30 09:15   0+00:00:02 R  0   0.0  galaxy_509.sh
>
>
>
>
>
> The job begins to run:
>
> ==> handler4.log <==
>
> galaxy.jobs.runners.condor DEBUG 2015-01-30 09:15:03,827 (508/20) job is now
> running
>
> galaxy.jobs.runners.condor DEBUG 2015-01-30 09:15:05,183 (508/20) job has
> completed
>
>
>
> Galaxy is almost immediately removing the job working directory:
>
>
>
> Here is a snippet of the errors:
>
> ==> handler4.log <==
>
> galaxy.jobs.runners DEBUG 2015-01-30 09:15:06,372 (508/20) Unable to cleanup
> /panfs/storage.local/opt/galaxy-dist/database/pbs/galaxy_508.ec: [Errno 2]
> No such file or directory:
> '/panfs/storage.local/opt/galaxy-dist/database/pbs/galaxy_508.ec'
>
> galaxy.jobs DEBUG 2015-01-30 09:15:06,816 setting dataset state to ERROR
>
> galaxy.datatypes.metadata DEBUG 2015-01-30 09:15:06,996 Failed to cleanup
> MetadataTempFile temp files from
> /panfs/storage.local/galaxy-data/job_working_directory/000/508/metadata_out_HistoryDatasetAssociation_717_zrzVqh:
> No JSON object could be decoded
>
>
>
>
>
> Is it possible galaxy is attempting to query condor and see if the job is
> running, not finding anything and deciding that the job is not running and
> bailing out?
>
>
>
> I’ve reconstructed the process step by step using the logs but I have not
> been able to see exactly where the condor_submit command is shown so I can
> try to submit the same job manually.
>
>
>
> Does anyone have a suggestion for debugging this?

Two most relevant files would be:

https://bitbucket.org/galaxy/galaxy-central/src/tip/lib/galaxy/jobs/runners/condor.py
https://bitbucket.org/galaxy/galaxy-central/src/tip/lib/galaxy/jobs/runners/util/condor/__init__.py

The condor job logic strikes me as pretty brittle - it has always
worked for me when I have tested it but given how it is readiing logs
I would imagine very small changes to condor might cause it to fail.
So one thing to check is summarize_condor_log in
lib/galaxy/jobs/runners/util/condor/__init__.py - to see if that logic
matches the way your condor produces log files.

Given the symptoms - it might be also worth just sleeping for 5
seconds in condor_submit in
lib/galaxy/jobs/runners/util/condor/__init__.py and then verifying
that Galaxy is actually properly parsing the correct external id.

Here is an untested diff that adds the sleep and some more log
statements that might help:
https://gist.github.com/jmchilton/d0afd7242370642d5b43

If you are able to fix the problem - please let us know how so we can
fix it upstream.

-John

>
>
>
> Thanks,
>
> Don
>
> Florida State University
>
> Research Computing Center
>
>
> ___
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   https://lists.galaxyproject.org/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

[galaxy-dev] condor jobs

2015-01-30 Thread Shrum, Donald C
Hi all,

I've configured one of our tools to submit jobs to our condor cluster.  I can 
see the job is routed to the condor runner:

==> handler4.log <==
galaxy.jobs.handler DEBUG 2015-01-30 09:14:58,092 (508) Dispatching to condor 
runner
galaxy.jobs DEBUG 2015-01-30 09:14:58,204 (508) Persisting job destination 
(destination id: condor)

I can see that indeed the job is submitted to the condor cluster:
[root@galaxy galaxy-dist]# condor_q
-- Submitter: galaxy.local : <10.177.61.90:55265> : galaxy.local
ID  OWNERSUBMITTED RUN_TIME ST PRI SIZE CMD
  21.0   galaxy  1/30 09:15   0+00:00:02 R  0   0.0  galaxy_509.sh


The job begins to run:
==> handler4.log <==
galaxy.jobs.runners.condor DEBUG 2015-01-30 09:15:03,827 (508/20) job is now 
running
galaxy.jobs.runners.condor DEBUG 2015-01-30 09:15:05,183 (508/20) job has 
completed

Galaxy is almost immediately removing the job working directory:

Here is a snippet of the errors:
==> handler4.log <==
galaxy.jobs.runners DEBUG 2015-01-30 09:15:06,372 (508/20) Unable to cleanup 
/panfs/storage.local/opt/galaxy-dist/database/pbs/galaxy_508.ec: [Errno 2] No 
such file or directory: 
'/panfs/storage.local/opt/galaxy-dist/database/pbs/galaxy_508.ec'
galaxy.jobs DEBUG 2015-01-30 09:15:06,816 setting dataset state to ERROR
galaxy.datatypes.metadata DEBUG 2015-01-30 09:15:06,996 Failed to cleanup 
MetadataTempFile temp files from 
/panfs/storage.local/galaxy-data/job_working_directory/000/508/metadata_out_HistoryDatasetAssociation_717_zrzVqh:
 No JSON object could be decoded


Is it possible galaxy is attempting to query condor and see if the job is 
running, not finding anything and deciding that the job is not running and 
bailing out?

I've reconstructed the process step by step using the logs but I have not been 
able to see exactly where the condor_submit command is shown so I can try to 
submit the same job manually.

Does anyone have a suggestion for debugging this?

Thanks,
Don
Florida State University
Research Computing Center
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/