Hi Marco,

Thanks for all of the details, they make a big difference when
troubleshooting.  Sorry for the delay in response.

Can you ensure that you don't have any of the pbs_* options set in
universe_wsgi.ini?  I noticed that there are stagein/stageouts
set on the job.

You may also need to set $usecp in mom_priv/config on your execution
hosts to prevent pbs_mom from trying to rcp/scp the error and output
files back to galaxy1.  $usecp instructs pbs_mom to consider /path on
the execution host to be the same filesystem as on the submission host,

$usecp *:/mnt/equallogic1 /mnt/equallogic1


Marco Moretto wrote:
> Hi all,
> first of all, thanks to the Galaxy Team for this really useful software.
> Actually I don't really know if my problem is related with Galaxy or with
> Torque/Maui but I didn't find any solution looking in both Torque and Maui
> user lists, so I hope that some of you with more experience could give me
> some good advices. I'm trying to set up Galaxy in a small local virtual
> environment in order to test it. I started with 2 virtual ubuntu servers
> called galaxy1 and galaxy2. On galaxy1 I succesfully installed Galaxy,
> Apache, Torque and Maui. I'm using Postgres ad DBMS. It is installed on
> another "real" DB server. The virtual server galaxy2 is used as node. Galaxy
> is working like a charm locally but when I try to use Torque problems arise.
> Torque alone works correctly. That means that I can submit a job with qsub
> and everything works. The 2 virtual server (galaxy1 and galaxy2) share a
> directory (through NFS) in which I installed Galaxy following the "unified
> method" from the documentation.
> Now, as I said, Galaxy alone works, Torque/Maui alone works. When I put the
> two together nothing works.
> As a test I upload (using local runner) a gff file. Then I try to make a
> filter to the gff using "Filter and sort -> Extract features". When I run
> this tool the corresponding job on the Torque queue runs forever in Hold
> state. I report some output from diagnose programs:
> The diagnose -j reports the following:
> Name                  State Par Proc QOS     WCLimit R  Min     User
>  Group  Account  QueuedTime  Network  Opsys   Arch    Mem   Disk  Procs
>   Class Features
> 29                     Hold DEF    1 DEF     1:00:00 0    1   galaxy
> galaxy        -    00:02:36   [NONE] [NONE] [NONE]    >=0    >=0    NC0
> [batch:1] [NONE]
> While the showq command reports
> ACTIVE JOBS--------------------
>      0 Active Jobs       0 of    1 Processors Active (0.00%)
> IDLE JOBS----------------------
> 0 Idle Jobs
> BLOCKED JOBS----------------
> 29                   galaxy       Hold     1     1:00:00  Wed May  4
> 03:56:40
> The checkjob reports:
> checking job 29
> State: Hold
> Creds:  user:galaxy  group:galaxy  class:batch  qos:DEFAULT
> WallTime: 00:00:00 of 1:00:00
> SubmitTime: Wed May  4 03:56:40
>   (Time Queued  Total: 00:03:07  Eligible: 00:00:01)
> The qstat -f reports
> Job Id: 33.galaxy1.research.intra.ismaa.it
>     Job_Name = 27_extract_features1_marco.more...@iasma.it
>     Job_Owner = gal...@galaxy1.research.intra.ismaa.it
>     job_state = W
>     queue = batch
>     server = galaxy1.research.intra.ismaa.it
>     ctime = Wed May  4 04:56:36 2011
>     Error_Path =
> galaxy1:/mnt/equallogic1/galaxy/galaxy-dist/database/pbs/27.e
>     exec_host = galaxy2/0
>     exec_port = 15003
>     Execution_Time = Wed May  4 05:26:41 2011
>     mtime = Wed May  4 04:56:37 2011
>     Output_Path =
> galaxy1:/mnt/equallogic1/galaxy/galaxy-dist/database/pbs/27.
>         o
>     qtime = Wed May  4 04:56:36 2011
>     Resource_List.neednodes = 1
>     Resource_List.nodect = 1
>     Resource_List.nodes = 1
>     Resource_List.walltime = 01:00:00
>     stagein = /mnt/equallogic1/galaxy/tmp/dataset_18.dat@galaxy1
> :/mnt/equallog
>         ic1/galaxy/galaxy-dist/database/files/000/dataset_18.dat,
>         /mnt/equallogic1/galaxy/tmp/dataset_30.dat@galaxy1
> :/mnt/equallogic1/g
>         alaxy/galaxy-dist/database/files/000/dataset_30.dat
>     stageout =
> /mnt/equallogic1/galaxy/galaxy-dist/database/files/000/dataset_
>         30.dat@galaxy1
> :/mnt/equallogic1/galaxy/galaxy-dist/database/files/000/
>         dataset_30.dat
>     substate = 37
>     Variable_List = PBS_O_QUEUE=batch,
>         PBS_O_HOST=galaxy1.research.intra.ismaa.it
>     euser = galaxy
>     egroup = galaxy
>     hashname = 33.galaxy1.research.intra.ismaa.it
>     queue_rank = 33
>     queue_type = E
> StartDate: -00:03:06  Wed May  4 03:56:41
> Total Tasks: 1
> Req[0]  TaskCount: 1  Partition: DEFAULT
> Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
> Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
> IWD: [NONE]  Executable:  [NONE]
> Bypass: 0  StartCount: 1
> PartitionMask: [ALL]
> PE:  1.00  StartPriority:  1
> cannot select job 29 for partition DEFAULT (non-idle state 'Hold')
> and finally the tracejob reports
> /var/spool/torque/server_priv/accounting/20110504: Permission denied
> /var/spool/torque/mom_logs/20110504: No such file or directory
> /var/spool/torque/sched_logs/20110504: No such file or directory
> Job: 33.galaxy1.research.intra.ismaa.it
> 05/04/2011 04:56:36  S    enqueuing into batch, state 1 hop 1
> 05/04/2011 04:56:36  S    Job Queued at request of
> gal...@galaxy1.research.intra.ismaa.it, owner =
> gal...@galaxy1.research.intra.ismaa.it, job name =
>                           27_extract_features1_marco.more...@iasma.it, queue
> = batch
> 05/04/2011 04:56:37  S    Job Run at request of
> gal...@galaxy1.research.intra.ismaa.it
> 05/04/2011 04:56:41  S    Email 's' to
> gal...@galaxy1.research.intra.ismaa.it failed: Child process 'sendmail -f
> adm gal...@galaxy1.research.intra.ismaa.it'
>                           returned 127 (errno 10:No child processes)
> The only clear thing to me is that after the submission the scheduler puts
> it in a Hold state. But I cannot understand why. I also try to run the
> Galaxy-generated sh script with qsub. Following the Galaxy log:
> galaxy.jobs.runners.pbs DEBUG 2011-05-04 04:56:36,748 (27) submitting file
> /mnt/equallogic1/galaxy/galaxy-dist/database/pbs/27.sh
> I copied and run the 27.sh script with the command:
> qsub 27.sh
> And the job runs correctly. So what it is not clear to me is if the problem
> is related to Torque/Maui or is related to the way in which the job is
> submitted from Galaxy.
> Sorry for the very long e-mail and thank you very much for any help.
> ---
> Marco

> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   http://lists.bx.psu.edu/

Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:


Reply via email to