Re: [galaxy-dev] jobs submitted to a cluster

2014-06-12 Thread Evan Bollig
Hey Donny,

What is the value of keep_completed on your queue (from qmgr -c 'p
s')? Could it be that your spool is flushing completed jobs
immediately? I ran into issues the other day with libdrmaa requiring
at least keep_complete = 60 seconds to properly detect completed jobs
and clean up after itself.

Cheers,

-E


-Evan Bollig
Research Associate | Application Developer | User Support Consultant
Minnesota Supercomputing Institute
599 Walter Library
612 624 1447
e...@msi.umn.edu
boll0...@umn.edu


On Thu, Jun 12, 2014 at 7:36 AM, Shrum, Donald C dcsh...@admin.fsu.edu wrote:
 I've setup galaxy to submit jobs to our HPC cluster as the logged in user.  I 
 used the drama python module to submit the jobs to our moab server.

 It appears that the working directory for a submitted job is being removed by 
 galaxy prior to the job completing on the cluster.

 I can see a working directory is created in the logs:
 galaxy.jobs DEBUG 2014-06-12 08:21:03,786 (15) Working directory for job is: 
 /panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15

 I've confirmed the directory is created by watching on the file system and 
 within about two seconds of the folder being created it is deleted.
 [root@admin 000]# watch -d ls -lR
 Every 2.0s: ls -lR
   Thu Jun 12 08:21:06 
 2014
 total 64
 drwxrwxrwx 2 dcshrum dcshrum 4096 Jun 12 08:21 15


 I see the job sent via DRMAA:
 galaxy.jobs.handler DEBUG 2014-06-12 08:21:03,795 (15) Dispatching to drmaa 
 runner
 galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:05,566 (15) submitting file 
 /panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.sh
 galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:05,566 (15) native 
 specification is: -N galaxyjob -l nodes=1,walltime=2:00 -q genacc_q
 galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:05,892 (15) submitting with 
 credentials: dcshrum [uid: 232706]
 galaxy.jobs.runners.drmaa INFO 2014-06-12 08:21:06,196 (15) queued as 
 7570705.moab.local

 The job fails:
 galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:06,698 
 (15/7570705.moab.local) state change: job finished, but failed
 galaxy.jobs.runners DEBUG 2014-06-12 08:21:07,124 (15/7570705.moab.local) 
 Unable to cleanup 
 /panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.sh:
  [Errno 2] No such file or directory: 
 '/panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.sh'


 I can see the same error in my moab log:
 *** error from copy
 /bin/cp: cannot create regular file 
 `/panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.o':
  No such file or directory
 *** end error output


 Any idea as to why galaxy removes the working directory?  Is there a setting 
 in the job_conf.xml that would resolve this?

 Thanks for any pointers.

 Donny
 FSU Research Computing Center


 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this
 and other Galaxy lists, please use the interface at:
   http://lists.bx.psu.edu/

 To search Galaxy mailing lists use the unified search at:
   http://galaxyproject.org/search/mailinglists/

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


Re: [galaxy-dev] jobs submitted to a cluster

2014-06-12 Thread Shrum, Donald C
It's set to 600 seconds so I don't think that is the issue... Is there some 
sort of wait time to set in job_conf.xml

-Original Message-
From: Evan Bollig [mailto:boll0...@umn.edu] 
Sent: Thursday, June 12, 2014 9:27 AM
To: Shrum, Donald C
Cc: galaxy-dev@lists.bx.psu.edu
Subject: Re: [galaxy-dev] jobs submitted to a cluster

Hey Donny,

What is the value of keep_completed on your queue (from qmgr -c 'p s')? Could 
it be that your spool is flushing completed jobs immediately? I ran into issues 
the other day with libdrmaa requiring at least keep_complete = 60 seconds to 
properly detect completed jobs and clean up after itself.

Cheers,

-E


-Evan Bollig
Research Associate | Application Developer | User Support Consultant Minnesota 
Supercomputing Institute
599 Walter Library
612 624 1447
e...@msi.umn.edu
boll0...@umn.edu


On Thu, Jun 12, 2014 at 7:36 AM, Shrum, Donald C dcsh...@admin.fsu.edu wrote:
 I've setup galaxy to submit jobs to our HPC cluster as the logged in user.  I 
 used the drama python module to submit the jobs to our moab server.

 It appears that the working directory for a submitted job is being removed by 
 galaxy prior to the job completing on the cluster.

 I can see a working directory is created in the logs:
 galaxy.jobs DEBUG 2014-06-12 08:21:03,786 (15) Working directory for 
 job is: 
 /panfs/storage.local/software/galaxy-dist/database/job_working_directo
 ry/000/15

 I've confirmed the directory is created by watching on the file system and 
 within about two seconds of the folder being created it is deleted.
 [root@admin 000]# watch -d ls -lR
 Every 2.0s: ls -lR
   Thu Jun 12 08:21:06 
 2014
 total 64
 drwxrwxrwx 2 dcshrum dcshrum 4096 Jun 12 08:21 15


 I see the job sent via DRMAA:
 galaxy.jobs.handler DEBUG 2014-06-12 08:21:03,795 (15) Dispatching to 
 drmaa runner galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:05,566 
 (15) submitting file 
 /panfs/storage.local/software/galaxy-dist/database/job_working_directo
 ry/000/15/galaxy_15.sh galaxy.jobs.runners.drmaa DEBUG 2014-06-12 
 08:21:05,566 (15) native specification is: -N galaxyjob -l 
 nodes=1,walltime=2:00 -q genacc_q galaxy.jobs.runners.drmaa DEBUG 
 2014-06-12 08:21:05,892 (15) submitting with credentials: dcshrum 
 [uid: 232706] galaxy.jobs.runners.drmaa INFO 2014-06-12 08:21:06,196 
 (15) queued as 7570705.moab.local

 The job fails:
 galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:06,698 
 (15/7570705.moab.local) state change: job finished, but failed 
 galaxy.jobs.runners DEBUG 2014-06-12 08:21:07,124 (15/7570705.moab.local) 
 Unable to cleanup 
 /panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.sh:
  [Errno 2] No such file or directory: 
 '/panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.sh'


 I can see the same error in my moab log:
 *** error from copy
 /bin/cp: cannot create regular file 
 `/panfs/storage.local/software/galaxy-dist/database/job_working_direct
 ory/000/15/galaxy_15.o': No such file or directory
 *** end error output


 Any idea as to why galaxy removes the working directory?  Is there a setting 
 in the job_conf.xml that would resolve this?

 Thanks for any pointers.

 Donny
 FSU Research Computing Center


 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this and other 
 Galaxy lists, please use the interface at:
   http://lists.bx.psu.edu/

 To search Galaxy mailing lists use the unified search at:
   http://galaxyproject.org/search/mailinglists/


___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


Re: [galaxy-dev] jobs submitted to a cluster

2014-06-12 Thread Evan Bollig
job_conf.xml is outside of my knowledge. Better wait to see what the
others can tell us.

-E
-Evan Bollig
Research Associate | Application Developer | User Support Consultant
Minnesota Supercomputing Institute
599 Walter Library
612 624 1447
e...@msi.umn.edu
boll0...@umn.edu


On Thu, Jun 12, 2014 at 8:31 AM, Shrum, Donald C dcsh...@admin.fsu.edu wrote:
 It's set to 600 seconds so I don't think that is the issue... Is there some 
 sort of wait time to set in job_conf.xml

 -Original Message-
 From: Evan Bollig [mailto:boll0...@umn.edu]
 Sent: Thursday, June 12, 2014 9:27 AM
 To: Shrum, Donald C
 Cc: galaxy-dev@lists.bx.psu.edu
 Subject: Re: [galaxy-dev] jobs submitted to a cluster

 Hey Donny,

 What is the value of keep_completed on your queue (from qmgr -c 'p s')? Could 
 it be that your spool is flushing completed jobs immediately? I ran into 
 issues the other day with libdrmaa requiring at least keep_complete = 60 
 seconds to properly detect completed jobs and clean up after itself.

 Cheers,

 -E


 -Evan Bollig
 Research Associate | Application Developer | User Support Consultant 
 Minnesota Supercomputing Institute
 599 Walter Library
 612 624 1447
 e...@msi.umn.edu
 boll0...@umn.edu


 On Thu, Jun 12, 2014 at 7:36 AM, Shrum, Donald C dcsh...@admin.fsu.edu 
 wrote:
 I've setup galaxy to submit jobs to our HPC cluster as the logged in user.  
 I used the drama python module to submit the jobs to our moab server.

 It appears that the working directory for a submitted job is being removed 
 by galaxy prior to the job completing on the cluster.

 I can see a working directory is created in the logs:
 galaxy.jobs DEBUG 2014-06-12 08:21:03,786 (15) Working directory for
 job is:
 /panfs/storage.local/software/galaxy-dist/database/job_working_directo
 ry/000/15

 I've confirmed the directory is created by watching on the file system and 
 within about two seconds of the folder being created it is deleted.
 [root@admin 000]# watch -d ls -lR
 Every 2.0s: ls -lR   
Thu Jun 12 
 08:21:06 2014
 total 64
 drwxrwxrwx 2 dcshrum dcshrum 4096 Jun 12 08:21 15


 I see the job sent via DRMAA:
 galaxy.jobs.handler DEBUG 2014-06-12 08:21:03,795 (15) Dispatching to
 drmaa runner galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:05,566
 (15) submitting file
 /panfs/storage.local/software/galaxy-dist/database/job_working_directo
 ry/000/15/galaxy_15.sh galaxy.jobs.runners.drmaa DEBUG 2014-06-12
 08:21:05,566 (15) native specification is: -N galaxyjob -l
 nodes=1,walltime=2:00 -q genacc_q galaxy.jobs.runners.drmaa DEBUG
 2014-06-12 08:21:05,892 (15) submitting with credentials: dcshrum
 [uid: 232706] galaxy.jobs.runners.drmaa INFO 2014-06-12 08:21:06,196
 (15) queued as 7570705.moab.local

 The job fails:
 galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:06,698
 (15/7570705.moab.local) state change: job finished, but failed 
 galaxy.jobs.runners DEBUG 2014-06-12 08:21:07,124 (15/7570705.moab.local) 
 Unable to cleanup 
 /panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.sh:
  [Errno 2] No such file or directory: 
 '/panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.sh'


 I can see the same error in my moab log:
 *** error from copy
 /bin/cp: cannot create regular file
 `/panfs/storage.local/software/galaxy-dist/database/job_working_direct
 ory/000/15/galaxy_15.o': No such file or directory
 *** end error output


 Any idea as to why galaxy removes the working directory?  Is there a setting 
 in the job_conf.xml that would resolve this?

 Thanks for any pointers.

 Donny
 FSU Research Computing Center


 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this and other
 Galaxy lists, please use the interface at:
   http://lists.bx.psu.edu/

 To search Galaxy mailing lists use the unified search at:
   http://galaxyproject.org/search/mailinglists/


 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this
 and other Galaxy lists, please use the interface at:
   http://lists.bx.psu.edu/

 To search Galaxy mailing lists use the unified search at:
   http://galaxyproject.org/search/mailinglists/

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


Re: [galaxy-dev] jobs submitted to a cluster

2014-06-12 Thread John Chilton
My guess is Galaxy is deleting the directory because it believe the
job is in error because of some communication problem while polling
your DRM via DRMAA - Galaxy thinks the job has failed before it has
even been run.

You can set

cleanup_job = never

in universe_wsgi.ini's app:main section to instruct Galaxy to not
delete the working directory. I suspect this will allow the DRM to
finish running your job - but Galaxy is still going to fail it since
it cannot properly detect its status.

Can you confirm?

-John






On Thu, Jun 12, 2014 at 8:36 AM, Evan Bollig boll0...@umn.edu wrote:
 job_conf.xml is outside of my knowledge. Better wait to see what the
 others can tell us.

 -E
 -Evan Bollig
 Research Associate | Application Developer | User Support Consultant
 Minnesota Supercomputing Institute
 599 Walter Library
 612 624 1447
 e...@msi.umn.edu
 boll0...@umn.edu


 On Thu, Jun 12, 2014 at 8:31 AM, Shrum, Donald C dcsh...@admin.fsu.edu 
 wrote:
 It's set to 600 seconds so I don't think that is the issue... Is there some 
 sort of wait time to set in job_conf.xml

 -Original Message-
 From: Evan Bollig [mailto:boll0...@umn.edu]
 Sent: Thursday, June 12, 2014 9:27 AM
 To: Shrum, Donald C
 Cc: galaxy-dev@lists.bx.psu.edu
 Subject: Re: [galaxy-dev] jobs submitted to a cluster

 Hey Donny,

 What is the value of keep_completed on your queue (from qmgr -c 'p s')? 
 Could it be that your spool is flushing completed jobs immediately? I ran 
 into issues the other day with libdrmaa requiring at least keep_complete = 
 60 seconds to properly detect completed jobs and clean up after itself.

 Cheers,

 -E


 -Evan Bollig
 Research Associate | Application Developer | User Support Consultant 
 Minnesota Supercomputing Institute
 599 Walter Library
 612 624 1447
 e...@msi.umn.edu
 boll0...@umn.edu


 On Thu, Jun 12, 2014 at 7:36 AM, Shrum, Donald C dcsh...@admin.fsu.edu 
 wrote:
 I've setup galaxy to submit jobs to our HPC cluster as the logged in user.  
 I used the drama python module to submit the jobs to our moab server.

 It appears that the working directory for a submitted job is being removed 
 by galaxy prior to the job completing on the cluster.

 I can see a working directory is created in the logs:
 galaxy.jobs DEBUG 2014-06-12 08:21:03,786 (15) Working directory for
 job is:
 /panfs/storage.local/software/galaxy-dist/database/job_working_directo
 ry/000/15

 I've confirmed the directory is created by watching on the file system and 
 within about two seconds of the folder being created it is deleted.
 [root@admin 000]# watch -d ls -lR
 Every 2.0s: ls -lR  
 Thu Jun 12 
 08:21:06 2014
 total 64
 drwxrwxrwx 2 dcshrum dcshrum 4096 Jun 12 08:21 15


 I see the job sent via DRMAA:
 galaxy.jobs.handler DEBUG 2014-06-12 08:21:03,795 (15) Dispatching to
 drmaa runner galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:05,566
 (15) submitting file
 /panfs/storage.local/software/galaxy-dist/database/job_working_directo
 ry/000/15/galaxy_15.sh galaxy.jobs.runners.drmaa DEBUG 2014-06-12
 08:21:05,566 (15) native specification is: -N galaxyjob -l
 nodes=1,walltime=2:00 -q genacc_q galaxy.jobs.runners.drmaa DEBUG
 2014-06-12 08:21:05,892 (15) submitting with credentials: dcshrum
 [uid: 232706] galaxy.jobs.runners.drmaa INFO 2014-06-12 08:21:06,196
 (15) queued as 7570705.moab.local

 The job fails:
 galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:06,698
 (15/7570705.moab.local) state change: job finished, but failed 
 galaxy.jobs.runners DEBUG 2014-06-12 08:21:07,124 (15/7570705.moab.local) 
 Unable to cleanup 
 /panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.sh:
  [Errno 2] No such file or directory: 
 '/panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.sh'


 I can see the same error in my moab log:
 *** error from copy
 /bin/cp: cannot create regular file
 `/panfs/storage.local/software/galaxy-dist/database/job_working_direct
 ory/000/15/galaxy_15.o': No such file or directory
 *** end error output


 Any idea as to why galaxy removes the working directory?  Is there a 
 setting in the job_conf.xml that would resolve this?

 Thanks for any pointers.

 Donny
 FSU Research Computing Center


 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this and other
 Galaxy lists, please use the interface at:
   http://lists.bx.psu.edu/

 To search Galaxy mailing lists use the unified search at:
   http://galaxyproject.org/search/mailinglists/


 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this
 and other Galaxy lists, please use the interface at:
   http://lists.bx.psu.edu

Re: [galaxy-dev] jobs submitted to a cluster

2014-06-12 Thread Shrum, Donald C
Hi John,

That did the trick.  I have some other problem but I don't think it's galaxy 
from here.

Thanks again for the reply.

Donny

-Original Message-
From: John Chilton [mailto:jmchil...@gmail.com] 
Sent: Thursday, June 12, 2014 10:53 AM
To: Evan Bollig
Cc: Shrum, Donald C; galaxy-dev@lists.bx.psu.edu
Subject: Re: [galaxy-dev] jobs submitted to a cluster

My guess is Galaxy is deleting the directory because it believe the job is in 
error because of some communication problem while polling your DRM via DRMAA - 
Galaxy thinks the job has failed before it has even been run.

You can set

cleanup_job = never

in universe_wsgi.ini's app:main section to instruct Galaxy to not delete the 
working directory. I suspect this will allow the DRM to finish running your job 
- but Galaxy is still going to fail it since it cannot properly detect its 
status.

Can you confirm?

-John






On Thu, Jun 12, 2014 at 8:36 AM, Evan Bollig boll0...@umn.edu wrote:
 job_conf.xml is outside of my knowledge. Better wait to see what the 
 others can tell us.

 -E
 -Evan Bollig
 Research Associate | Application Developer | User Support Consultant 
 Minnesota Supercomputing Institute
 599 Walter Library
 612 624 1447
 e...@msi.umn.edu
 boll0...@umn.edu


 On Thu, Jun 12, 2014 at 8:31 AM, Shrum, Donald C dcsh...@admin.fsu.edu 
 wrote:
 It's set to 600 seconds so I don't think that is the issue... Is 
 there some sort of wait time to set in job_conf.xml

 -Original Message-
 From: Evan Bollig [mailto:boll0...@umn.edu]
 Sent: Thursday, June 12, 2014 9:27 AM
 To: Shrum, Donald C
 Cc: galaxy-dev@lists.bx.psu.edu
 Subject: Re: [galaxy-dev] jobs submitted to a cluster

 Hey Donny,

 What is the value of keep_completed on your queue (from qmgr -c 'p s')? 
 Could it be that your spool is flushing completed jobs immediately? I ran 
 into issues the other day with libdrmaa requiring at least keep_complete = 
 60 seconds to properly detect completed jobs and clean up after itself.

 Cheers,

 -E


 -Evan Bollig
 Research Associate | Application Developer | User Support Consultant 
 Minnesota Supercomputing Institute
 599 Walter Library
 612 624 1447
 e...@msi.umn.edu
 boll0...@umn.edu


 On Thu, Jun 12, 2014 at 7:36 AM, Shrum, Donald C dcsh...@admin.fsu.edu 
 wrote:
 I've setup galaxy to submit jobs to our HPC cluster as the logged in user.  
 I used the drama python module to submit the jobs to our moab server.

 It appears that the working directory for a submitted job is being removed 
 by galaxy prior to the job completing on the cluster.

 I can see a working directory is created in the logs:
 galaxy.jobs DEBUG 2014-06-12 08:21:03,786 (15) Working directory for 
 job is:
 /panfs/storage.local/software/galaxy-dist/database/job_working_direc
 to
 ry/000/15

 I've confirmed the directory is created by watching on the file system and 
 within about two seconds of the folder being created it is deleted.
 [root@admin 000]# watch -d ls -lR
 Every 2.0s: ls -lR  
 Thu Jun 12 
 08:21:06 2014
 total 64
 drwxrwxrwx 2 dcshrum dcshrum 4096 Jun 12 08:21 15


 I see the job sent via DRMAA:
 galaxy.jobs.handler DEBUG 2014-06-12 08:21:03,795 (15) Dispatching 
 to drmaa runner galaxy.jobs.runners.drmaa DEBUG 2014-06-12 
 08:21:05,566
 (15) submitting file
 /panfs/storage.local/software/galaxy-dist/database/job_working_direc
 to ry/000/15/galaxy_15.sh galaxy.jobs.runners.drmaa DEBUG 2014-06-12
 08:21:05,566 (15) native specification is: -N galaxyjob -l
 nodes=1,walltime=2:00 -q genacc_q galaxy.jobs.runners.drmaa DEBUG
 2014-06-12 08:21:05,892 (15) submitting with credentials: dcshrum
 [uid: 232706] galaxy.jobs.runners.drmaa INFO 2014-06-12 08:21:06,196
 (15) queued as 7570705.moab.local

 The job fails:
 galaxy.jobs.runners.drmaa DEBUG 2014-06-12 08:21:06,698
 (15/7570705.moab.local) state change: job finished, but failed 
 galaxy.jobs.runners DEBUG 2014-06-12 08:21:07,124 (15/7570705.moab.local) 
 Unable to cleanup 
 /panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.sh:
  [Errno 2] No such file or directory: 
 '/panfs/storage.local/software/galaxy-dist/database/job_working_directory/000/15/galaxy_15.sh'


 I can see the same error in my moab log:
 *** error from copy
 /bin/cp: cannot create regular file
 `/panfs/storage.local/software/galaxy-dist/database/job_working_dire
 ct
 ory/000/15/galaxy_15.o': No such file or directory
 *** end error output


 Any idea as to why galaxy removes the working directory?  Is there a 
 setting in the job_conf.xml that would resolve this?

 Thanks for any pointers.

 Donny
 FSU Research Computing Center


 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this and other 
 Galaxy lists, please use the interface at:
   http