Re: [galaxy-dev] Specifying number of requested cores to Galaxy DRMAA

2011-05-19 Thread Leandro Hermida
Hi Louise-Amelie,

Thank you for the post reference, this is exactly what I was looking for.
For us for for example when I want to execute a tool that is a Java command
the JVM typically will typically use multiple cores as it's running.  You
said with TORQUE it will crash when there aren't enough resources when the
job is submitted.  I wonder if you can do the same thing we have done here
with LSF?  With LSF you can configure a maximum server load for each node
and if the submitted jobs push the node load above this threshold (e.g. more
cores requested than available) LSF will temporarily suspend jobs (using
some kind of heuristics) so that the load stays below the threshold and
unsuspend as resources become available.  So for us things just will run
slower when we cannot pass the requested number of cores to LSF.

I would think maybe there is a way with TORQUE to have it achieve the same
thing so jobs don't crash when resources requested are more than available?

regards,
Leandro

2011/5/19 Louise-Amélie Schmitt louise-amelie.schm...@embl.de

  Hi,

 In a previous message, I explained how I did to multithreads certain jobs,
 perhaps you can modify the corresponding files for drmaa in a similar way:

 On 04/26/2011 11:26 AM, Louise-Amélie Schmitt wrote:

 Just one little fix on line 261:
 261 if ( len(l)  1 and l[0] == job_wrapper.tool.id ):

 Otherwise it pathetically crashes when non-multithreaded jobs are
 submitted. Sorry about that.

 Regards,
 L-A

 Le mardi 19 avril 2011 à 14:33 +0200, Louise-Amélie Schmitt a écrit :

  Hello everyone,

 I'm using TORQUE with Galaxy, and we noticed that if a tool is
 multithreaded, the number of needed cores is not communicated to pbs,
 leading to job crashes if the required resources are not available when
 the job is submitted.

 Therefore I modified a little the code as follows in
 lib/galaxy/jobs/runners/pbs.py

 256 # define PBS job options
 257 attrs.append( dict( name = pbs.ATTR_N, value = str( %s_%s_%
 s % ( job_wrapper.job_id, job_wrapper.tool.id, job_wrapper.user ) ) ) )
 258 mt_file = open('tool-data/multithreading.csv', 'r')
 259 for l in mt_file:
 260 l = string.split(l)
 261 if ( l[0] == job_wrapper.tool.id ):
 262 attrs.append( dict( name = pbs.ATTR_l,
 resource = 'nodes', value = '1:ppn='+str(l[1]) ) )
 263 attrs.append( dict( name = pbs.ATTR_l,
 resource = 'mem', value = str(l[2]) ) )
 264 break
 265 mt_file.close()
 266 job_attrs = pbs.new_attropl( len( attrs ) +
 len( pbs_options ) )

 (sorry it didn't come out very well due to line breaking)

 The csv file contains a list of the multithreaded tools, each line
 containing:
 tool id\tnumber of threads\tmemory needed\n

 And it works fine, the jobs wait for their turn properly, but
 information is duplicated. Perhaps there would be a way to include
 something similar in galaxy's original code (if it is not already the
 case, I may not be up-to-date) without duplicating data.

 I hope that helps :)

 Best regards,
 L-A

 ___
 The Galaxy User list should be used for the discussion of
 Galaxy analysis and other features on the public server
 at usegalaxy.org.  Please keep all replies on the list by
 using reply all in your mail client.  For discussion of
 local Galaxy instances and the Galaxy source code, please
 use the Galaxy Development list:

   http://lists.bx.psu.edu/listinfo/galaxy-dev

 To manage your subscriptions to this and other Galaxy lists,
 please use the interface at:

   http://lists.bx.psu.edu/



 On 05/19/2011 12:03 PM, Leandro Hermida wrote:

 Hi,

 When Galaxy is configured to use the DRMAA job runner is there a way for a
 tool to tell DRMAA the number of cores it would like to request? The
 equivalent of bsub -n X in LSF where X is min number of cores to have
 available on node.

 best,
 leandro


 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this
 and other Galaxy lists, please use the interface at:

   http://lists.bx.psu.edu/



___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] Specifying number of requested cores to Galaxy DRMAA

2011-05-19 Thread Louise-Amélie Schmitt

Hi again Leandro

Well I might not have been really clear, perhaps I should have re-read 
the mail before posting it :)


The thing is, it was not an issue of Torque starting jobs when there 
were not enough resources available, but rather it believing the needed 
resources for each job being fewer that they were (e.g. always assuming 
the jobs were single-threaded even if the actual tools needed more tan 
one core). if Torque is properly notified of the needed resources, it 
will dispatch them or make them wait accordingly (since it knows the 
nodes' limits and load), like your LSF does.


This hack is not very sexy but it just notifies Torque of the cores 
needed by every multithreaded tool, so it doesn't run a multithreaded 
job when there's only one core available in the chosen node.


Hope that helps :)

Regards,
L-A


On 05/19/2011 03:05 PM, Leandro Hermida wrote:

Hi Louise-Amelie,

Thank you for the post reference, this is exactly what I was looking 
for.  For us for for example when I want to execute a tool that is a 
Java command the JVM typically will typically use multiple cores as 
it's running.  You said with TORQUE it will crash when there aren't 
enough resources when the job is submitted.  I wonder if you can do 
the same thing we have done here with LSF?  With LSF you can configure 
a maximum server load for each node and if the submitted jobs push the 
node load above this threshold (e.g. more cores requested than 
available) LSF will temporarily suspend jobs (using some kind of 
heuristics) so that the load stays below the threshold and unsuspend 
as resources become available.  So for us things just will run slower 
when we cannot pass the requested number of cores to LSF.


I would think maybe there is a way with TORQUE to have it achieve the 
same thing so jobs don't crash when resources requested are more than 
available?


regards,
Leandro

2011/5/19 Louise-Amélie Schmitt louise-amelie.schm...@embl.de 
mailto:louise-amelie.schm...@embl.de


Hi,

In a previous message, I explained how I did to multithreads
certain jobs, perhaps you can modify the corresponding files for
drmaa in a similar way:

On 04/26/2011 11:26 AM, Louise-Amélie Schmitt wrote:

Just one little fix on line 261:
261 if ( len(l)  1 and l[0] ==job_wrapper.tool.id  
http://job_wrapper.tool.id  ):

Otherwise it pathetically crashes when non-multithreaded jobs are
submitted. Sorry about that.

Regards,
L-A

Le mardi 19 avril 2011 à 14:33 +0200, Louise-Amélie Schmitt a écrit :

Hello everyone,

I'm using TORQUE with Galaxy, and we noticed that if a tool is
multithreaded, the number of needed cores is not communicated to pbs,
leading to job crashes if the required resources are not available when
the job is submitted.

Therefore I modified a little the code as follows in
lib/galaxy/jobs/runners/pbs.py

256 # define PBS job options
257 attrs.append( dict( name = pbs.ATTR_N, value = str( %s_%s_%
s % ( job_wrapper.job_id,job_wrapper.tool.id  
http://job_wrapper.tool.id, job_wrapper.user ) ) ) )
258 mt_file = open('tool-data/multithreading.csv', 'r')
259 for l in mt_file:
260 l = string.split(l)
261 if ( l[0] ==job_wrapper.tool.id  
http://job_wrapper.tool.id  ):
262 attrs.append( dict( name = pbs.ATTR_l,
resource = 'nodes', value = '1:ppn='+str(l[1]) ) )
263 attrs.append( dict( name = pbs.ATTR_l,
resource = 'mem', value = str(l[2]) ) )
264 break
265 mt_file.close()
266 job_attrs = pbs.new_attropl( len( attrs ) +
len( pbs_options ) )

(sorry it didn't come out very well due to line breaking)

The csv file contains a list of the multithreaded tools, each line
containing:
tool id\tnumber of threads\tmemory needed\n

And it works fine, the jobs wait for their turn properly, but
information is duplicated. Perhaps there would be a way to include
something similar in galaxy's original code (if it is not already the
case, I may not be up-to-date) without duplicating data.

I hope that helps :)

Best regards,
L-A

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
atusegalaxy.org  http://usegalaxy.org.  Please keep all replies on the 
list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

   http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

   http://lists.bx.psu.edu/



On 05/19/2011 12:03 PM, Leandro Hermida wrote:

Hi,

When 

Re: [galaxy-dev] Specifying number of requested cores to Galaxy DRMAA

2011-05-19 Thread Leandro Hermida
Hi Louise,

I see thank you for the response, maybe there was some confusion, the
feature I was trying to explain with LSF is that you *don't* need to tell it
the required resources for a job and it will still be able to run all the
submitted jobs on a node without crashing even if the jobs submitted need
e.g. 10 more cores are available (that is 10 more cores than LSF thought it
needed).  LSF will just temporarily suspend jobs in mid-run on a node to
keep the load down, but nothing will ever crash even if you are running jobs
that require 20 CPUs and you only have 2.  Thought maybe there was a way to
do this with TORQUE.  If LSF or TORQUE are explcitly passed the resources
needed then they will never need to temporarily suspend because they will
pick a node with those resources free.  That being said, your method is more
efficient for this reason as it will pick the right node with the cores
available instead of picking a node with maybe just one core available and
then running the multithreaded job slower because it has to periodically
suspend it as it is running.

Also I wonder do you run any Java command-line tools via Galaxy? I can't
seem to find with the JVM exactly how many cores it needs during execution
or how to limit it to a certain max, it just jumps around in CPU usage from
50% to over 400%

regards,
Leandro

2011/5/19 Louise-Amélie Schmitt louise-amelie.schm...@embl.de

  Hi again Leandro

 Well I might not have been really clear, perhaps I should have re-read the
 mail before posting it :)

 The thing is, it was not an issue of Torque starting jobs when there were
 not enough resources available, but rather it believing the needed resources
 for each job being fewer that they were (e.g. always assuming the jobs were
 single-threaded even if the actual tools needed more tan one core). if
 Torque is properly notified of the needed resources, it will dispatch them
 or make them wait accordingly (since it knows the nodes' limits and load),
 like your LSF does.

 This hack is not very sexy but it just notifies Torque of the cores needed
 by every multithreaded tool, so it doesn't run a multithreaded job when
 there's only one core available in the chosen node.

 Hope that helps :)

 Regards,
 L-A



 On 05/19/2011 03:05 PM, Leandro Hermida wrote:

 Hi Louise-Amelie,

 Thank you for the post reference, this is exactly what I was looking for.
 For us for for example when I want to execute a tool that is a Java command
 the JVM typically will typically use multiple cores as it's running.  You
 said with TORQUE it will crash when there aren't enough resources when the
 job is submitted.  I wonder if you can do the same thing we have done here
 with LSF?  With LSF you can configure a maximum server load for each node
 and if the submitted jobs push the node load above this threshold (e.g. more
 cores requested than available) LSF will temporarily suspend jobs (using
 some kind of heuristics) so that the load stays below the threshold and
 unsuspend as resources become available.  So for us things just will run
 slower when we cannot pass the requested number of cores to LSF.

 I would think maybe there is a way with TORQUE to have it achieve the same
 thing so jobs don't crash when resources requested are more than available?

 regards,
 Leandro

 2011/5/19 Louise-Amélie Schmitt louise-amelie.schm...@embl.de

  Hi,

 In a previous message, I explained how I did to multithreads certain jobs,
 perhaps you can modify the corresponding files for drmaa in a similar way:

 On 04/26/2011 11:26 AM, Louise-Amélie Schmitt wrote:

 Just one little fix on line 261:
 261 if ( len(l)  1 and l[0] == job_wrapper.tool.id ):

 Otherwise it pathetically crashes when non-multithreaded jobs are
 submitted. Sorry about that.

 Regards,
 L-A

 Le mardi 19 avril 2011 à 14:33 +0200, Louise-Amélie Schmitt a écrit :

  Hello everyone,

 I'm using TORQUE with Galaxy, and we noticed that if a tool is
 multithreaded, the number of needed cores is not communicated to pbs,
 leading to job crashes if the required resources are not available when
 the job is submitted.

 Therefore I modified a little the code as follows in
 lib/galaxy/jobs/runners/pbs.py

 256 # define PBS job options
 257 attrs.append( dict( name = pbs.ATTR_N, value = str( %s_%s_%
 s % ( job_wrapper.job_id, job_wrapper.tool.id, job_wrapper.user ) ) ) )
 258 mt_file = open('tool-data/multithreading.csv', 'r')
 259 for l in mt_file:
 260 l = string.split(l)
 261 if ( l[0] == job_wrapper.tool.id ):
 262 attrs.append( dict( name = pbs.ATTR_l,
 resource = 'nodes', value = '1:ppn='+str(l[1]) ) )
 263 attrs.append( dict( name = pbs.ATTR_l,
 resource = 'mem', value = str(l[2]) ) )
 264 break
 265 mt_file.close()
 266 job_attrs = pbs.new_attropl( len( attrs ) +
 len( pbs_options ) )

 (sorry it didn't come out 

Re: [galaxy-dev] Specifying number of requested cores to Galaxy DRMAA

2011-05-19 Thread Louise-Amélie Schmitt

Hi again

Oh I see... Not the best way to deal with the resources indeed but 
better than nothing though.


I do run Java tools via Galaxy but I haven't paid attention to this 
issue, so I can't really help. It's a homemade tool that only has one 
class so I guess it's not worth the effort in my case. But if you find 
the answer, I'd be interested too.


Good luck,
L-A


On 05/19/2011 04:28 PM, Leandro Hermida wrote:

Hi Louise,

I see thank you for the response, maybe there was some confusion, the 
feature I was trying to explain with LSF is that you *don't* need to 
tell it the required resources for a job and it will still be able to 
run all the submitted jobs on a node without crashing even if the jobs 
submitted need e.g. 10 more cores are available (that is 10 more cores 
than LSF thought it needed).  LSF will just temporarily suspend jobs 
in mid-run on a node to keep the load down, but nothing will ever 
crash even if you are running jobs that require 20 CPUs and you only 
have 2.  Thought maybe there was a way to do this with TORQUE.  If LSF 
or TORQUE are explcitly passed the resources needed then they will 
never need to temporarily suspend because they will pick a node with 
those resources free.  That being said, your method is more efficient 
for this reason as it will pick the right node with the cores 
available instead of picking a node with maybe just one core available 
and then running the multithreaded job slower because it has to 
periodically suspend it as it is running.


Also I wonder do you run any Java command-line tools via Galaxy? I 
can't seem to find with the JVM exactly how many cores it needs during 
execution or how to limit it to a certain max, it just jumps around in 
CPU usage from 50% to over 400%


regards,
Leandro

2011/5/19 Louise-Amélie Schmitt louise-amelie.schm...@embl.de 
mailto:louise-amelie.schm...@embl.de


Hi again Leandro

Well I might not have been really clear, perhaps I should have
re-read the mail before posting it :)

The thing is, it was not an issue of Torque starting jobs when
there were not enough resources available, but rather it believing
the needed resources for each job being fewer that they were (e.g.
always assuming the jobs were single-threaded even if the actual
tools needed more tan one core). if Torque is properly notified of
the needed resources, it will dispatch them or make them wait
accordingly (since it knows the nodes' limits and load), like your
LSF does.

This hack is not very sexy but it just notifies Torque of the
cores needed by every multithreaded tool, so it doesn't run a
multithreaded job when there's only one core available in the
chosen node.

Hope that helps :)

Regards,
L-A



On 05/19/2011 03:05 PM, Leandro Hermida wrote:

Hi Louise-Amelie,

Thank you for the post reference, this is exactly what I was
looking for.  For us for for example when I want to execute a
tool that is a Java command the JVM typically will typically use
multiple cores as it's running.  You said with TORQUE it will
crash when there aren't enough resources when the job is
submitted.  I wonder if you can do the same thing we have done
here with LSF?  With LSF you can configure a maximum server load
for each node and if the submitted jobs push the node load above
this threshold (e.g. more cores requested than available) LSF
will temporarily suspend jobs (using some kind of heuristics) so
that the load stays below the threshold and unsuspend as
resources become available.  So for us things just will run
slower when we cannot pass the requested number of cores to LSF.

I would think maybe there is a way with TORQUE to have it achieve
the same thing so jobs don't crash when resources requested are
more than available?

regards,
Leandro

2011/5/19 Louise-Amélie Schmitt louise-amelie.schm...@embl.de
mailto:louise-amelie.schm...@embl.de

Hi,

In a previous message, I explained how I did to multithreads
certain jobs, perhaps you can modify the corresponding files
for drmaa in a similar way:

On 04/26/2011 11:26 AM, Louise-Amélie Schmitt wrote:

Just one little fix on line 261:
261 if ( len(l)  1 and l[0] ==job_wrapper.tool.id  
http://job_wrapper.tool.id  ):

Otherwise it pathetically crashes when non-multithreaded jobs are
submitted. Sorry about that.

Regards,
L-A

Le mardi 19 avril 2011 à 14:33 +0200, Louise-Amélie Schmitt a écrit :

Hello everyone,

I'm using TORQUE with Galaxy, and we noticed that if a tool is
multithreaded, the number of needed cores is not communicated to pbs,
leading to job crashes if the required resources are not available when
the job is submitted.

Therefore I modified a little the code as follows in