Re: [galaxy-dev] Galaxy with Univa Grid Engine (UGE) instead of SGE?

2013-01-23 Thread Peter Cock
On Thu, Jan 17, 2013 at 11:14 PM, Peter Cock p.j.a.c...@googlemail.com wrote:


 On Thursday, January 17, 2013, Carlos Borroto wrote:

 On Wed, Jan 16, 2013 at 7:28 AM, Peter Cock p.j.a.c...@googlemail.com
 wrote:
  Renaming the file to replace the colon with (say) an underscore allows
  a manual qsub to work fine with UGE. I've edited Galaxy to avoid the
  colons (patch below) but the submission still fails.
 

 Hi Peter,

 After seeing your email I now wonder if the problem I described
 here[1] and didn't get any answer about it is related to your findings
 while trying UGE.


 [1]http://dev.list.galaxyproject.org/Issue-when-enabling-use-tasked-jobs-with-torque-and-nfs-td4657294.html

 I noticed the only mayor different I can notice between jobs
 submission with and without tasked option enabled is a colon in the
 name.


 Some overlap yes, and I do normally have BLAST running with
 task splitting. That does probably explain the source of the colon.
 I now suspect the colon is only a problem in SGE / UGE at the
 command line using qsub - it may work via the Python API.

Confirmed - on our system using UGE (and likely our old system
with SGE), when task split jobs are assigned job shell scripts with
a colon in the filename it does work. However, the presence of
the colon prevents manual submission of the same script, which
I have found extremely useful in debugging. e.g.

$ qsub sleep:colon.sh
Unable to run job: Colon (':') not allowed in objectname.
Exiting.

Escaping the colon did not help:

$ qsub sleep\:colon.sh
Unable to run job: Colon (':') not allowed in objectname.
Exiting.

(Sadly the Univa Grid Engine UGE version of qsub does not
appear to have a version switch so I'm not sure which version
this is)

For this reason alone, I would prefer Galaxy avoided colons
in its job filenames - and I suspect there could be other corner
cases with other cluster systems and shared file-systems (as
found by Carlos).

Regards,

Peter
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] Galaxy with Univa Grid Engine (UGE) instead of SGE?

2013-01-16 Thread Peter Cock
On Tue, Jan 15, 2013 at 7:02 PM, Peter Cock p.j.a.c...@googlemail.com wrote:
 Hello all,

 Our local Galaxy server had been running happily under SGE, using
 one of the last free releases (not sure exactly which - I could ask).
 Due to concerns about long term maintenance, the SysAdmin has
 moved us to an SGE compatible setup - Univa Grid Engine (UGE).

 However, in at least one respect this is not a drop in replacement,
 while other cluster usage appears to be working fine our Galaxy
 installation is not, e.g.

 ...

 Debugging this by attempting a manual submission,

 $ qsub /mnt/galaxy/galaxy-central/database/pbs/galaxy_331:842.sh
 Unable to run job: Colon (':') not allowed in objectname.
 Exiting.

 Renaming the file to replace the colon with (say) an underscore allows
 a manual qsub to work fine with UGE. I've edited Galaxy to avoid the
 colons (patch below) but the submission still fails.

 Additionally removing the SGE specific settings in universe_wsgi.ini did
 allow the job to be submitted I am still having problems. Perhaps I need
 to fix all the other filenames too (e.g. stdout, stderr, error code), or do 
 that
 in one go by removing the colon in the job name?

Part of the problem I am facing involves the SGE/UGE specific
arguments I have defined in universe_wsgi.ini (which still work
fine if I use them with qsub manually).

My original settings looked like this,

[galaxy:tool_runners]
ncbi_blastp_wrapper  = drmaa://-V -l
hostname=n08-04-008-*|n11-04-048-cortana -pe smp 4/

That worked fine in Galaxy with SGE, and still works fine with UGE using
qsub manually. However, the -pe smp 4 part does not work for queue
submission anymore with UGE. Simplifying to:

[galaxy:tool_runners]
ncbi_blastp_wrapper = drmaa://-V -pe smp 4/

fails:

galaxy.jobs.handler INFO 2013-01-16 11:49:39,603 (346) Job dispatched
galaxy.jobs.runners.drmaa DEBUG 2013-01-16 11:49:40,346 (346)
submitting file /mnt/galaxy/galaxy-central/database/pbs/galaxy_346.sh
galaxy.jobs.runners.drmaa DEBUG 2013-01-16 11:49:40,347 (346) command
is: blastp -version 
/mnt/galaxy/galaxy-central/database/tmp/GALAXY_VERSION_STRING_346;
blastp -query /mnt/galaxy/galaxy-central/database/files/000/dataset_344.dat
  -db /mnt/shared/cluster/blast/galaxy/oomycete_CDS -task blastp
-evalue 0.001 -out
/mnt/galaxy/galaxy-central/database/files/000/dataset_394.dat
-outfmt 6 -num_threads 8
galaxy.jobs.runners.drmaa DEBUG 2013-01-16 11:49:40,347 (346) spec: -pe smp 4
galaxy.jobs.runners.drmaa ERROR 2013-01-16 11:49:40,351 Uncaught
exception queueing job
Traceback (most recent call last):
  File /mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/drmaa.py,
line 146, in run_next
self.queue_job( obj )
  File /mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/drmaa.py,
line 235, in queue_job
job_id = self.ds.runJob(jt)
  File 
/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/__init__.py,
line 331, in runJob
_h.c(_w.drmaa_run_job, jid, _ct.sizeof(jid), jobTemplate)
  File /mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/helpers.py,
line 213, in c
return f(*(args + (error_buffer, sizeof(error_buffer
  File /mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/errors.py,
line 90, in error_check
raise _ERRORS[code-1](code %s: %s % (code, error_buffer.value))
DeniedByDrmException: code 17: error: no suitable queues

Clearly something is going wrong in passing the option to UGE. Note
this works at the command line:

$ qsub -pe smp 4 /mnt/galaxy/galaxy-central/database/pbs/galaxy_346.sh
Your job 252 (galaxy_346.sh) has been submitted
$ qstat | grep 252
252 0.60500 galaxy_346 galaxy   qw01/16/2013 11:50:41
 4

If I remove this option, job submission works. Given Galaxy gives UGE the
'native spec' as a string, I don't think this is a Galaxy problem. Rather, it
could be an incompatibility in UGE versus SGE? I can probably workaround
this particular issue - there are other ways to request four processors
and/or a whole cluster node.

So, to recap, I needed to remove any colons in job scripts fixed (crude
patch on previous email), and tweak my SGE/UGE settings in the
universe_wsgi.ini file.

I would also like to see a clear error message for the user when an
DeniedByDrmException is raised during job submission - currently
this is not handled gracefully at all.

I've now had some cluster jobs succeed via Galaxy, but it does not
seem to be as reliable as under SGE. Perhaps there is some heavy
IO on the cluster at the moment which may be confusing things...

Peter
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


[galaxy-dev] Galaxy with Univa Grid Engine (UGE) instead of SGE?

2013-01-15 Thread Peter Cock
Hello all,

Our local Galaxy server had been running happily under SGE, using
one of the last free releases (not sure exactly which - I could ask).
Due to concerns about long term maintenance, the SysAdmin has
moved us to an SGE compatible setup - Univa Grid Engine (UGE).

However, in at least one respect this is not a drop in replacement,
while other cluster usage appears to be working fine our Galaxy
installation is not, e.g.

galaxy.jobs.runners.drmaa DEBUG 2013-01-15 17:14:33,660 (331:842)
submitting file
/mnt/galaxy/galaxy-central/database/pbs/galaxy_331:842.sh
galaxy.jobs.runners.drmaa DEBUG 2013-01-15 17:14:33,661 (331:842)
command is: /mnt/galaxy/galaxy-central/extract_dataset_parts.sh
/mnt/galaxy/galaxy-central/database/job_working_directory/000/331/task_0;
blastp -query 
/mnt/galaxy/galaxy-central/database/job_working_directory/000/331/task_0/dataset_344.dat
  -db /var/local/blast/ncbi/nr -task blastp -evalue 0.001 -out
/mnt/galaxy/galaxy-central/database/job_working_directory/000/331/task_0/dataset_373.dat
-outfmt 5 -num_threads 8
galaxy.jobs.runners.drmaa ERROR 2013-01-15 17:14:33,666 Uncaught
exception queueing job
Traceback (most recent call last):
  File /mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/drmaa.py,
line 146, in run_next
self.queue_job( obj )
  File /mnt/galaxy/galaxy-central/lib/galaxy/jobs/runners/drmaa.py,
line 234, in queue_job
job_id = self.ds.runJob(jt)
  File 
/mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/__init__.py,
line 331, in runJob
_h.c(_w.drmaa_run_job, jid, _ct.sizeof(jid), jobTemplate)
  File /mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/helpers.py,
line 213, in c
return f(*(args + (error_buffer, sizeof(error_buffer
  File /mnt/galaxy/galaxy-central/eggs/drmaa-0.4b3-py2.6.egg/drmaa/errors.py,
line 90, in error_check
raise _ERRORS[code-1](code %s: %s % (code, error_buffer.value))
DeniedByDrmException: code 17: error: no suitable queues

Debugging this by attempting a manual submission,

$ qsub /mnt/galaxy/galaxy-central/database/pbs/galaxy_331:842.sh
Unable to run job: Colon (':') not allowed in objectname.
Exiting.

Renaming the file to replace the colon with (say) an underscore allows
a manual qsub to work fine with UGE. I've edited Galaxy to avoid the
colons (patch below) but the submission still fails.

Additionally removing the SGE specific settings in universe_wsgi.ini did
allow the job to be submitted I am still having problems. Perhaps I need
to fix all the other filenames too (e.g. stdout, stderr, error code), or do that
in one go by removing the colon in the job name?

Has anyone else tried Galaxy under UGE, and do you have any advice?

Thanks,

Peter

-- 

Quick filename hack to avoid colons in job script filenames - might
be better to avoid this in the job name itself?

$ hg diff
diff -r 1bfe2768026a lib/galaxy/jobs/runners/drmaa.py
--- a/lib/galaxy/jobs/runners/drmaa.py  Mon Jan 14 17:21:25 2013 +
+++ b/lib/galaxy/jobs/runners/drmaa.py  Tue Jan 15 18:44:31 2013 +
@@ -191,7 +191,7 @@
 job_name = ''.join( map( lambda x: x if x in ( string.letters
+ string.digits + '_' ) else '_', job_name ) )

 jt = self.ds.createJobTemplate()
-jt.remoteCommand = %s/galaxy_%s.sh %
(self.app.config.cluster_files_directory, job_wrapper.get_id_tag())
+jt.remoteCommand = (%s/galaxy_%s.sh %
(self.app.config.cluster_files_directory,
job_wrapper.get_id_tag())).replace(:, _)
 jt.jobName = job_name
 jt.outputPath = :%s % ofile
 jt.errorPath = :%s % efile
@@ -229,6 +229,7 @@

 log.debug((%s) submitting file %s % ( galaxy_id_tag,
jt.remoteCommand ) )
 log.debug((%s) command is: %s % ( galaxy_id_tag, command_line ) )
+log.debug((%s) spec: %s % ( galaxy_id_tag, native_spec))
 # runJob will raise if there's a submit problem
 if self.external_runJob_script is None:
 job_id = self.ds.runJob(jt)
@@ -423,7 +424,7 @@
 drm_job_state.ofile = %s.drmout % os.path.join(os.getcwd(),
job_wrapper.working_directory, job_wrapper.get_id_tag())
 drm_job_state.efile = %s.drmerr % os.path.join(os.getcwd(),
job_wrapper.working_directory, job_wrapper.get_id_tag())
 drm_job_state.ecfile = %s.drmec % os.path.join(os.getcwd(),
job_wrapper.working_directory, job_wrapper.get_id_tag())
-drm_job_state.job_file = %s/galaxy_%s.sh %
(self.app.config.cluster_files_directory, job.get_id())
+drm_job_state.job_file = (%s/galaxy_%s.sh %
(self.app.config.cluster_files_directory, job.get_id())).replace(:,
_)
 drm_job_state.job_id = str( job_id )
 drm_job_state.runner_url = job_wrapper.get_job_runner_url()
 job_wrapper.command_line = job.get_command_line()
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at: