Hi,

Our galaxy instance runs jobs in a SGE cluster using 2 job-handlers. The
SGE cluster uses a Job Submission Verifier (JSV) that rejects any job submission that specify core
binding strategies.


When Galaxy starts, the first jobs we submit works perfectly:

First job submission:

galaxy.jobs.manager DEBUG 2013-04-15 14:29:59,285 (194) Job assigned to
handler 'handler0' galaxy.jobs DEBUG 2013-04-15 14:29:59,934 (194) Working directory for job is: /scratch/nfs/galaxy.crg.es/job_working_directory/000/194 galaxy.jobs.handler DEBUG 2013-04-15 14:29:59,942 dispatching job 194 to drmaa runner
galaxy.jobs.handler INFO 2013-04-15 14:30:00,166 (194) Job dispatched
galaxy.jobs.runners.drmaa DEBUG 2013-04-15 14:30:00,468 (194) submitting file /scratch/nfs/galaxy.crg.es/ogs/galaxy_194.sh galaxy.jobs.runners.drmaa DEBUG 2013-04-15 14:30:00,468 (194) command is: python /data/www-bi/apache/galaxy.crg.es/htdocs/galaxy-dist/tools/fastq/fastq_stats.py '/data/www-bi/galaxy.crg.es/files/000/dataset_4.dat' '/data/www-bi/galaxy.crg.es/files/000/dataset_238.dat' 'sanger' galaxy.jobs.runners.drmaa INFO 2013-04-15 14:30:01,538 (194) queued as 458816 galaxy.jobs.runners.drmaa DEBUG 2013-04-15 14:30:02,115 (194/458816) state change: job is queued and active


# qstat -cb -j 458816
==============================================================
job_number:                 458816
exec_file:                  job_scripts/458816
submission_time:            Mon Apr 15 14:30:01 2013
owner:                      www-bi
uid:                        66401
group:                      www-bi
gid:                        501
sge_o_home:                 /data/www-bi
sge_o_log_name:             www-bi
sge_o_path: /data/galaxy/apache/galaxy.crg.es/htdocs/scripts/galaxy-env/bin:/software/galaxy/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/data/www-bi/bin
sge_o_shell:                /bin/bash
sge_o_workdir: /data/www-bi/apache/galaxy.crg.es/htdocs/galaxy-dist
sge_o_host:                 galaxy
account:                    sge
stderr_path_list: NONE:galaxy:/scratch/nfs/galaxy.crg.es/job_working_directory/000/194/194.drmerr
reserve:                    y
hard resource_list:         virtual_free=12G,h_rt=21600
mail_list:                  www...@galaxy.crg.es
notify:                     FALSE
job_name:                   g194_fastq_stats_jtaly_crg_es
stdout_path_list: NONE:galaxy:/scratch/nfs/galaxy.crg.es/job_working_directory/000/194/194.drmout
jobshare:                   0
hard_queue_list:            www-el6
env_list:
script_file:                /scratch/nfs/galaxy.crg.es/ogs/galaxy_194.sh
parallel environment:  smp range: 2
verify_suitable_queues:     2
binding:                    set linear:2:0,0
scheduling info: queue instance "pr-...@fenn.linux.crg.es" dropped because it is overloaded: np_load_avg=1.703333 (= 1.703333 + 0.50 * 0.000000 with nproc=12) >= 1.7 queue instance "sh...@node-ib0209bi.linux.crg.es" dropped because it is overloaded: np_load_avg=2.837500 (= 2.837500 + 0.50 * 0.000000 with nproc=8) >= 1.3 queue instance "l...@node-ib0209bi.linux.crg.es" dropped because it is overloaded: np_load_avg=2.837500 (= 2.837500 + 0.50 * 0.000000 with nproc=8) >= 1.3


The core binding has been added by our jsv script. This is correct.


But our second submission fails:

galaxy.jobs.runners.drmaa ERROR 2013-04-15 14:30:56,263 Uncaught exception queueing job
Traceback (most recent call last):
File "/data/www-bi/apache/galaxy.crg.es/htdocs/galaxy-dist/lib/galaxy/jobs/runners/drmaa.py", line 144, in run_next
    self.queue_job( obj )
File "/data/www-bi/apache/galaxy.crg.es/htdocs/galaxy-dist/lib/galaxy/jobs/runners/drmaa.py", line 232, in queue_job
    job_id = self.ds.runJob(jt)
File "/data/www-bi/apache/galaxy.crg.es/htdocs/galaxy-dist/eggs/drmaa-0.4b3-py2.6.egg/drmaa/__init__.py", line 331, in runJob
    _h.c(_w.drmaa_run_job, jid, _ct.sizeof(jid), jobTemplate)
File "/data/www-bi/apache/galaxy.crg.es/htdocs/galaxy-dist/eggs/drmaa-0.4b3-py2.6.egg/drmaa/helpers.py", line 213, in c
    return f(*(args + (error_buffer, sizeof(error_buffer))))
File "/data/www-bi/apache/galaxy.crg.es/htdocs/galaxy-dist/eggs/drmaa-0.4b3-py2.6.egg/drmaa/errors.py", line 90, in error_check
    raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value))
DeniedByDrmException: code 17: contact us: x...@xxx.es


if we look at the submited params:

# cat /tmp/qsub_err.txt
$VAR1 = {
          'w' => 'e',
          'N' => 'g195_fastq_stats_jtaly_crg_es',
          'binding_amount' => '2',
          'CMDNAME' => '/scratch/nfs/galaxy.crg.es/ogs/galaxy_195.sh',
          'binding_type' => 'set',
          'M' => {
                   'www...@galaxy.crg.es' => undef
                 },
          'binding_strategy' => 'linear',
          'l_hard' => {
                        'virtual_free' => '12G',
                        'h_rt' => '6:00:00'
                      },
          'shell' => 'n',
          'pe_min' => '2',
          'USER' => 'www-bi',
          'binding_socket' => '0',
          'e' => {
'/scratch/nfs/galaxy.crg.es/job_working_directory/000/195/195.drmerr' => undef
                 },
          'GROUP' => 'www-bi',
          'binding_core' => '0',
          'pe_max' => '2',
          'CMDARGS' => '0',
          'q_hard' => {
                        'www-el6' => undef
                      },
          'pe_name' => 'smp',
          'CLIENT' => 'drmaa',
          'b' => 'y',
          'R' => 'y',
          'VERSION' => '1.0',
          'CONTEXT' => 'client',
          'o' => {
'/scratch/nfs/galaxy.crg.es/job_working_directory/000/195/195.drmout' => undef
                 }
        };

There's a core binding strategy.


The problem is that second job submission is inheriting submission
parameters from the first job, and, as the JSV script does not allow to specify
core binding strategy by the user, the job is rejected.

If you wait some time (600 seconds), the new submit works again...

We are wondering if anyone can help us to understand why the submission parameters been inherit by each job? Maybe the DRMAA session is not properly closed? or the environment not cleaned?

Thank you for your help

Best

Jean-François

$hg summary
parent: 8795:9fd7fe0c5712
 merge from stable
branch: default
commit: 1 modified, 59 unknown
update: (current)

--
#####################################
Jean-François Taly
Bioinformatician

Bioinformatics Core Facility
http://biocore.crg.cat
CRG - Centre de Regulació Genòmica (Room 439)
Parc de Recerca Biomèdica de Barcelona (PRBB)
Doctor Aiguader, 88
08003 Barcelona
Spain

email: jean-francois.t...@crg.eu
phone: +34 93 316 0202
fax: +34 93 316 0099
#####################################




___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
 http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
 http://galaxyproject.org/search/mailinglists/

Reply via email to