subject:"\[galaxy\-dev\] Error\: Job output not returned from cluster"

Re: [galaxy-dev] Error: Job output not returned from cluster

2012-04-30 Thread Nate Coraor

On Apr 25, 2012, at 8:25 AM, Louise-Amélie Schmitt wrote:

 Hi,
 
 Thanks a lot it actually helped. It is not exactly as straightforward in 
 drmaa.py but somehow I could manage.
 
 However, it was not the problem. For some reason, the user needs to write 
 files from the node to job_working_directory/00X// and the latter is not 
 world-writable. I had to make everyone chmod everything to 777 to make it 
 work. Did I miss something?

That doesn't seem right.  The subdirectory for your job underneath 
job_working_directory/00X/XXX/ should be chowned to the real user before the 
job is submitted and then chowned back to the galaxy user once the job is 
complete.

--nate

 
 Best,
 L-A
 
 
 
 Le 24/04/2012 15:17, Alban Lermine a écrit :
 Hi L-A,
 
 I run Galaxy as real user on our cluster with pbs (free version).
 
 We first configure LDAP authentification for having email account
 related to unix account (just cut the @curie.fr)
 Then I have modify pbs.py (in
 GALAXY_DIRgalaxy-dist/lib/galaxy/jobs/runners)
 
 I have just disconnected the pbs submission through python library and
 replace it by a system call (just like I send jobs to the cluster with
 command line), here is the code used:
 
  galaxy_job_id = job_wrapper.job_id
  log.debug((%s) submitting file %s % ( galaxy_job_id, job_file ) )
  log.debug((%s) command is: %s % ( galaxy_job_id, command_line ) )
 
 # Submit job with system call instead of using python PBS library -
 Permit to run jobs as .. with sudo -u cmd prefix
 
 galaxy_job_idSTR = str(job_wrapper.job_id)
 galaxy_tool_idSTR = str(job_wrapper.tool.id)
 galaxy_job_name =
 galaxy_job_idSTR+_+galaxy_tool_idSTR+_+job_wrapper.user
 torque_options = runner_url.split(/)
 queue = torque_options[3]
 ressources = torque_options[4]
 user_mail = job_wrapper.user.split(@)
 username = user_mail[0]
 
 torque_cmd = sudo -u username echo +\+command_line+\ | qsub
 -o +ofile+ -e +efile+ -M +job_wrapper.user+ -N +galaxy_job_name+
 -q +queue+ +ressources
 
 submit_pbs_job = os.popen(torque_cmd)
 
 job_id = submit_pbs_job.read().rstrip(\n)
 
 #Original job launcher
 #job_id = pbs.pbs_submit(c, job_attrs, job_file, pbs_queue_name, None)
 
 pbs.pbs_disconnect(c)
 
 Second thing I have done is to wait error and output file from torque in
 the finish_job function (if not, I never receive the output, seems to be
 your problem..), here is the code used:
 
 def finish_job( self, pbs_job_state ):
 
 Get the output/error for a finished job, pass to
 `job_wrapper.finish`
 and cleanup all the PBS temporary files.
 
 ofile = pbs_job_state.ofile
 efile = pbs_job_state.efile
 job_file = pbs_job_state.job_file
 
 # collect the output
 try:
 
 # With qsub system call, need to wait efile and ofile creation
 at the end of the job execution before reading them
 
 efileExists = os.path.isfile(efile)
 ofileExists = os.path.isfile(ofile)
 efileExistsSTR = str(efileExists)
 ofileExistsSTR = str(ofileExists)
 
 while efileExistsSTR == False:
 time.sleep( 1 )
 efileExists = os.path.isfile(efile)
 efileExistsSTR = str(efileExists)
 
 while ofileExistsSTR == False:
 time.sleep( 1 )
 ofileExists = os.path.isfile(ofile)
 ofileExistsSTR = str(ofileExists)
 
 # Back to original code
 
 ofh = file(ofile, r)
 efh = file(efile, r)
 stdout = ofh.read( 32768 )
 stderr = efh.read( 32768 )
 except:
 stdout = ''
 stderr = 'Job output not returned by PBS: the output
 datasets were deleted while the job was running, the job was manually
 dequeued or there was a cluster error.'
 log.debug(stderr)
 
 * Last step is to allow galaxy user to run sudo
 
 
 Hope it can help you finding your problem..
 
 See you,
 
 Alban
 
 
 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this
 and other Galaxy lists, please use the interface at:
 
 http://lists.bx.psu.edu/
 


___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] Error: Job output not returned from cluster

2012-04-25 Thread Louise-Amélie Schmitt


Hi,

Thanks a lot it actually helped. It is not exactly as straightforward in 
drmaa.py but somehow I could manage.


However, it was not the problem. For some reason, the user needs to 
write files from the node to job_working_directory/00X// and the 
latter is not world-writable. I had to make everyone chmod everything to 
777 to make it work. Did I miss something?


Best,
L-A



Le 24/04/2012 15:17, Alban Lermine a écrit :

Hi L-A,

I run Galaxy as real user on our cluster with pbs (free version).

We first configure LDAP authentification for having email account
related to unix account (just cut the @curie.fr)
Then I have modify pbs.py (in
GALAXY_DIRgalaxy-dist/lib/galaxy/jobs/runners)

I have just disconnected the pbs submission through python library and
replace it by a system call (just like I send jobs to the cluster with
command line), here is the code used:

  galaxy_job_id = job_wrapper.job_id
  log.debug((%s) submitting file %s % ( galaxy_job_id, job_file ) )
  log.debug((%s) command is: %s % ( galaxy_job_id, command_line ) )

 # Submit job with system call instead of using python PBS library -
Permit to run jobs as .. with sudo -u cmd prefix

 galaxy_job_idSTR = str(job_wrapper.job_id)
 galaxy_tool_idSTR = str(job_wrapper.tool.id)
 galaxy_job_name =
galaxy_job_idSTR+_+galaxy_tool_idSTR+_+job_wrapper.user
 torque_options = runner_url.split(/)
 queue = torque_options[3]
 ressources = torque_options[4]
 user_mail = job_wrapper.user.split(@)
 username = user_mail[0]

 torque_cmd = sudo -u username echo +\+command_line+\ | qsub
-o +ofile+ -e +efile+ -M +job_wrapper.user+ -N +galaxy_job_name+
-q +queue+ +ressources

 submit_pbs_job = os.popen(torque_cmd)

 job_id = submit_pbs_job.read().rstrip(\n)

 #Original job launcher
 #job_id = pbs.pbs_submit(c, job_attrs, job_file, pbs_queue_name, None)

 pbs.pbs_disconnect(c)

Second thing I have done is to wait error and output file from torque in
the finish_job function (if not, I never receive the output, seems to be
your problem..), here is the code used:

def finish_job( self, pbs_job_state ):
 
 Get the output/error for a finished job, pass to
`job_wrapper.finish`
 and cleanup all the PBS temporary files.
 
 ofile = pbs_job_state.ofile
 efile = pbs_job_state.efile
 job_file = pbs_job_state.job_file

 # collect the output
 try:

 # With qsub system call, need to wait efile and ofile creation
at the end of the job execution before reading them

 efileExists = os.path.isfile(efile)
 ofileExists = os.path.isfile(ofile)
 efileExistsSTR = str(efileExists)
 ofileExistsSTR = str(ofileExists)

 while efileExistsSTR == False:
 time.sleep( 1 )
 efileExists = os.path.isfile(efile)
 efileExistsSTR = str(efileExists)

 while ofileExistsSTR == False:
 time.sleep( 1 )
 ofileExists = os.path.isfile(ofile)
 ofileExistsSTR = str(ofileExists)

 # Back to original code

 ofh = file(ofile, r)
 efh = file(efile, r)
 stdout = ofh.read( 32768 )
 stderr = efh.read( 32768 )
 except:
 stdout = ''
 stderr = 'Job output not returned by PBS: the output
datasets were deleted while the job was running, the job was manually
dequeued or there was a cluster error.'
 log.debug(stderr)

* Last step is to allow galaxy user to run sudo


Hope it can help you finding your problem..

See you,

Alban



___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/

Re: [galaxy-dev] Error: Job output not returned from cluster

2012-04-24 Thread Louise-Amélie Schmitt

At first we thought it could be an ssh issue but submitting jobs and 
getting the output back isn't a problem when I do it from my personal 
user manually, so it's really related to Galaxy. We're using PBS Pro btw.


And I'm still at loss... :(

L-A

Le 23/04/2012 15:42, zhengqiu cai a écrit :

I am having the same problem when I use condor as the scheduler instead of sge.

Cai

--- 12年4月23日，周一, Louise-Amélie Schmittlouise-amelie.schm...@embl.de  写道：


发件人: Louise-Amélie Schmittlouise-amelie.schm...@embl.de
主题: [galaxy-dev] Error: Job output not returned from cluster
收件人: galaxy-dev@lists.bx.psu.edu
日期: 2012年4月23日,周一,下午5:09
Hello everyone,

I'm still trying to set up the job submission as the real
user, and I get a mysterious error. The job obviously runs
somewhere and when it ends it is in error state and displays
the following message: Job output not returned from
cluster

In the Galaxy log I have the following lines when the job
finishes running:

galaxy.jobs.runners.drmaa DEBUG 2012-04-23 10:36:41,509
(1455/9161620.pbs-master2.embl.de) state change: job
finished, but failed
galaxy.jobs.runners.drmaa DEBUG 2012-04-23 10:36:41,511 Job
output not returned from cluster
galaxy.jobs DEBUG 2012-04-23 10:36:41,547 finish(): Moved
/g/funcgen/galaxy-dev/database/job_working_directory/001/1455/galaxy_dataset_2441.dat
to
/g/funcgen/galaxy-dev/database/files/002/dataset_2441.dat
galaxy.jobs DEBUG 2012-04-23 10:36:41,755 job 1455 ended
galaxy.datatypes.metadata DEBUG 2012-04-23 10:36:41,755
Cleaning up external metadata files
galaxy.datatypes.metadata DEBUG 2012-04-23 10:36:41,768
Failed to cleanup MetadataTempFile temp files from
/g/funcgen/galaxy-dev/database/job_working_directory/001/1455/metadata_out_HistoryDatasetAssociation_1606_npFIJM:
No JSON object could be decoded: line 1 column 0 (char 0)

The
/g/funcgen/galaxy-dev/database/job_working_directory/001/1455/
directory is empty and
/g/funcgen/galaxy-dev/database/files/002/dataset_2441.dat
exists but is empty.

Any ideas about what can go wrong there? Any lead would be
immensely appreciated!

Thanks,
L-A

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to
this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/



___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/

Re: [galaxy-dev] Error: Job output not returned from cluster

2012-04-24 Thread Alban Lermine

Le 24/04/2012 14:53, Louise-Amélie Schmitt a écrit :
 At first we thought it could be an ssh issue but submitting jobs and
 getting the output back isn't a problem when I do it from my personal
 user manually, so it's really related to Galaxy. We're using PBS Pro btw.

 And I'm still at loss... :(

 L-A

 Le 23/04/2012 15:42, zhengqiu cai a écrit :
 I am having the same problem when I use condor as the scheduler
 instead of sge.

 Cai

 --- 12年4月23日，周一, Louise-Amélie
 Schmittlouise-amelie.schm...@embl.de  写道：

 发件人: Louise-Amélie Schmittlouise-amelie.schm...@embl.de
 主题: [galaxy-dev] Error: Job output not returned from cluster
 收件人: galaxy-dev@lists.bx.psu.edu
 日期: 2012年4月23日,周一,下午5:09
 Hello everyone,

 I'm still trying to set up the job submission as the real
 user, and I get a mysterious error. The job obviously runs
 somewhere and when it ends it is in error state and displays
 the following message: Job output not returned from
 cluster

 In the Galaxy log I have the following lines when the job
 finishes running:

 galaxy.jobs.runners.drmaa DEBUG 2012-04-23 10:36:41,509
 (1455/9161620.pbs-master2.embl.de) state change: job
 finished, but failed
 galaxy.jobs.runners.drmaa DEBUG 2012-04-23 10:36:41,511 Job
 output not returned from cluster
 galaxy.jobs DEBUG 2012-04-23 10:36:41,547 finish(): Moved
 /g/funcgen/galaxy-dev/database/job_working_directory/001/1455/galaxy_dataset_2441.dat

 to
 /g/funcgen/galaxy-dev/database/files/002/dataset_2441.dat
 galaxy.jobs DEBUG 2012-04-23 10:36:41,755 job 1455 ended
 galaxy.datatypes.metadata DEBUG 2012-04-23 10:36:41,755
 Cleaning up external metadata files
 galaxy.datatypes.metadata DEBUG 2012-04-23 10:36:41,768
 Failed to cleanup MetadataTempFile temp files from
 /g/funcgen/galaxy-dev/database/job_working_directory/001/1455/metadata_out_HistoryDatasetAssociation_1606_npFIJM:

 No JSON object could be decoded: line 1 column 0 (char 0)

 The
 /g/funcgen/galaxy-dev/database/job_working_directory/001/1455/
 directory is empty and
 /g/funcgen/galaxy-dev/database/files/002/dataset_2441.dat
 exists but is empty.

 Any ideas about what can go wrong there? Any lead would be
 immensely appreciated!

 Thanks,
 L-A

 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to
 this
 and other Galaxy lists, please use the interface at:

   http://lists.bx.psu.edu/


 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this
 and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/
Hi L-A,

I run Galaxy as real user on our cluster with pbs (free version).

We first configure LDAP authentification for having email account
related to unix account (just cut the @curie.fr)
Then I have modify pbs.py (in
GALAXY_DIRgalaxy-dist/lib/galaxy/jobs/runners)

I have just disconnected the pbs submission through python library and
replace it by a system call (just like I send jobs to the cluster with
command line), here is the code used:

 galaxy_job_id = job_wrapper.job_id
 log.debug((%s) submitting file %s % ( galaxy_job_id, job_file ) )
 log.debug((%s) command is: %s % ( galaxy_job_id, command_line ) )

# Submit job with system call instead of using python PBS library -
Permit to run jobs as .. with sudo -u cmd prefix

galaxy_job_idSTR = str(job_wrapper.job_id)
galaxy_tool_idSTR = str(job_wrapper.tool.id)
galaxy_job_name =
galaxy_job_idSTR+_+galaxy_tool_idSTR+_+job_wrapper.user
torque_options = runner_url.split(/)
queue = torque_options[3]
ressources = torque_options[4]
user_mail = job_wrapper.user.split(@)
username = user_mail[0]

torque_cmd = sudo -u username echo +\+command_line+\ | qsub
-o +ofile+ -e +efile+ -M +job_wrapper.user+ -N +galaxy_job_name+
-q +queue+ +ressources

submit_pbs_job = os.popen(torque_cmd)

job_id = submit_pbs_job.read().rstrip(\n)
   
#Original job launcher
#job_id = pbs.pbs_submit(c, job_attrs, job_file, pbs_queue_name, None)

pbs.pbs_disconnect(c)

Second thing I have done is to wait error and output file from torque in
the finish_job function (if not, I never receive the output, seems to be
your problem..), here is the code used:

def finish_job( self, pbs_job_state ):

Get the output/error for a finished job, pass to
`job_wrapper.finish`
and cleanup all the PBS temporary files.

ofile = pbs_job_state.ofile
efile = pbs_job_state.efile
job_file = pbs_job_state.job_file
   
# collect the output
try:

# With qsub system call, need to wait efile and ofile creation
at the end of the job execution before reading them
   
efileExists = os.path.isfile(efile)
ofileExists = os.path.isfile(ofile)
efileExistsSTR = str(efileExists

[galaxy-dev] Error: Job output not returned from cluster

2012-04-23 Thread Louise-Amélie Schmitt


Hello everyone,

I'm still trying to set up the job submission as the real user, and I 
get a mysterious error. The job obviously runs somewhere and when it 
ends it is in error state and displays the following message: Job 
output not returned from cluster


In the Galaxy log I have the following lines when the job finishes running:

galaxy.jobs.runners.drmaa DEBUG 2012-04-23 10:36:41,509 
(1455/9161620.pbs-master2.embl.de) state change: job finished, but failed
galaxy.jobs.runners.drmaa DEBUG 2012-04-23 10:36:41,511 Job output not 
returned from cluster
galaxy.jobs DEBUG 2012-04-23 10:36:41,547 finish(): Moved 
/g/funcgen/galaxy-dev/database/job_working_directory/001/1455/galaxy_dataset_2441.dat 
to /g/funcgen/galaxy-dev/database/files/002/dataset_2441.dat

galaxy.jobs DEBUG 2012-04-23 10:36:41,755 job 1455 ended
galaxy.datatypes.metadata DEBUG 2012-04-23 10:36:41,755 Cleaning up 
external metadata files
galaxy.datatypes.metadata DEBUG 2012-04-23 10:36:41,768 Failed to 
cleanup MetadataTempFile temp files from 
/g/funcgen/galaxy-dev/database/job_working_directory/001/1455/metadata_out_HistoryDatasetAssociation_1606_npFIJM: 
No JSON object could be decoded: line 1 column 0 (char 0)


The /g/funcgen/galaxy-dev/database/job_working_directory/001/1455/ 
directory is empty and 
/g/funcgen/galaxy-dev/database/files/002/dataset_2441.dat exists but is 
empty.


Any ideas about what can go wrong there? Any lead would be immensely 
appreciated!


Thanks,
L-A

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/

Re: [galaxy-dev] Error: Job output not returned from cluster

Re: [galaxy-dev] Error: Job output not returned from cluster

Re: [galaxy-dev] Error: Job output not returned from cluster

Re: [galaxy-dev] Error: Job output not returned from cluster

[galaxy-dev] Error: Job output not returned from cluster

5 matches

Site Navigation

Mail list logo

Footer information