[galaxy-dev] A Galaxy environment that can support up to 35 users?

2012-09-17 Thread Dan Sullivan
Hi, Galaxy Developers,

Is anybody out there managing a Galaxy environment that was designed and or has 
been tested to support 35 concurrent users?  The reason why I am asking this is 
because we [the U of C] have a training session coming up this Thursday, and 
the environment we have deployed needs to support this number of users.  We 
have put the server under as high as stress as possible with 6 users, and 
Galaxy has performed fine, however it has proven somewhat challenging to do 
load testing for all 35 concurrent users prior to the workshop.  I can't help 
but feel we are rolling the dice a little bit as we've never put the server 
under anything close to this load level, so I figured I would try to dot my i's 
by sending an email to this list.

Here are the configuration changes that are currently implemented (in terms of 
trying to performance tune and web scale our galaxy server):

1) Enabled proxy load balancing with six web front-ends (the number six pulled 
from Galaxy wiki) (Apache):

Proxy balancer://galaxy/
BalancerMember http://127.0.0.1:8080
BalancerMember http://127.0.0.1:8081
BalancerMember http://127.0.0.1:8082
BalancerMember http://127.0.0.1:8083
BalancerMember http://127.0.0.1:8084
BalancerMember http://127.0.0.1:8085
 /Proxy

2) Rewrite static URLs for static content (Apache):

RewriteRule ^/static/style/(.*) 
/group/galaxy/galaxy-dist/static/uchicago_cri_august_2012_style/blue/$1 [L]
RewriteRule ^/static/scripts/(.*) 
/group/galaxy/galaxy-dist/static/scripts/packed/$1 [L]
RewriteRule ^/static/(.*) /group/galaxy/galaxy-dist/static/$1 [L]
RewriteRule ^/robots.txt /group/galaxy/galaxy-dist/static/robots.txt [L]
RewriteRule ^(.*) balancer://galaxy$1 [P]

3) Enabled compression and caching (Apache):
Location /
SetOutputFilter DEFLATE
SetEnvIfNoCase Request_URI \.(?:gif|jpe?g|png)$ no-gzip dont-vary
SetEnvIfNoCase Request_URI \.(?:t?gz|zip|bz2)$ no-gzip dont-vary
/Location
Location /static
ExpiresActive On
ExpiresDefault access plus 6 hours
/Location

4) Configured web scaling (universe_wsgi.ini) :
a) six web server processes (threadpool_workers = 7)
b) a single job manager (threadpool_workers = 5)
c) two job handlers (threadpool_workers = 5)

5) Configured a pbs_mom external job runner (our cluster), and commented out 
the default tool runners (to use pbs)  (we are not using the other tools for 
the workshop).

#ucsc_table_direct1 = local:///
#ucsc_table_direct_archaea1 = local:///
#ucsc_table_direct_test1 = local:///
#upload1 = local:///

6)  Changed the following database parameters (universe_wsgi.ini):
database_engine_option_pool_size = 10
database_engine_option_max_overflow = 20 

7) Disable the developer settings (universe_wsgi.ini):
debug = False 
use_interactive = False
 #filter-with = gzip

The server I have is a VM with the following resources:

2GB of RAM
4CPU Cores

I feel that it is also worthwhile to mention that users will not be downloading 
datasets during the workshop, so as of now, the implementation of XSendFile 
as specified in the Apache Proxy documentation is not of immediate concern.

Does anybody see any blaring mistakes where they think this configuration might 
fall short with respect to capacity planning for an environment of 35 
concurrent users, or additional tuning that could potentially assist in 
ensuring the availability of the server during the workshop?   Thank-you so 
much for your opinion(s), and please wish us luck this Thursday :-)

Dan Sullivan
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] A Galaxy environment that can support up to 35 users?

2012-09-17 Thread Assaf Gordon
Hello Dan,

Couple of lessons we learned from setting up similar workshop-galaxies:

Dan Sullivan wrote, On 09/17/2012 01:04 PM:
 Hi, Galaxy Developers,
 
 Is anybody out there managing a Galaxy environment that was designed and or 
 has been tested to support 35 concurrent users?  The reason why I am asking 
 this is because we [the U of C] have a training session coming up this 
 Thursday, and the environment we have deployed needs to support this number 
 of users.  We have put the server under as high as stress as possible with 6 
 users, and Galaxy has performed fine, however it has proven somewhat 
 challenging to do load testing for all 35 concurrent users prior to the 
 workshop.  I can't help but feel we are rolling the dice a little bit as 
 we've never put the server under anything close to this load level, so I 
 figured I would try to dot my i's by sending an email to this list.
 
 Here are the configuration changes that are currently implemented (in terms 
 of trying to performance tune and web scale our galaxy server):
 
 1) Enabled proxy load balancing with six web front-ends (the number six 
 pulled from Galaxy wiki) (Apache):
 
When configured correctly, 3 or 4 web-front-ends seemed sufficient.
(When configured incorrectly, it doesn't matter how many you have, performances 
will suffer :) ).

Given that you only have 4 CPUs/cores for your machine, having six front-ends 
seems too much.

 2) Rewrite static URLs for static content (Apache):
 3) Enabled compression and caching (Apache):

This might sounds obvious, but test that it actually works (e.g. check in the 
apache logs that the static files were served by apache, not by galaxy).
Typos and other minor incompatibilities can cause the URLs to be served by 
galaxy, which will waste resources.

 4) Configured web scaling (universe_wsgi.ini) :
   a) six web server processes (threadpool_workers = 7)
   b) a single job manager (threadpool_workers = 5)
 c) two job handlers (threadpool_workers = 5)

Again, with a system of only 4 CPUs, you might overload your server.

 5) Configured a pbs_mom external job runner (our cluster), and commented out 
 the default tool runners (to use pbs)  (we are not using the other tools for 
 the workshop).
 
 #ucsc_table_direct1 = local:///
 #ucsc_table_direct_archaea1 = local:///
 #ucsc_table_direct_test1 = local:///
 #upload1 = local:///
 

Unless your workshop is *tightly* scripted, you can't really tell which tool 
users will use.
If this is an introduction to galaxy, users will experiment with some tools 
(even if you don't tell them to).

(also, I'm not sure if those data import tools can run on your cluster node).


 6)  Changed the following database parameters (universe_wsgi.ini):
   database_engine_option_pool_size = 10
   database_engine_option_max_overflow = 20 

Assuming you're using PostgreSQL (and you shouldn't use anything else, in 
practice), add the following:
   database_engine_option_server_side_cursors = True

And I would set pool_size to 50 and max_overflow to 100 - seems excessive, 
but under the load of 20 users hammering at galaxy at the same time in a short 
time window,
I got the database connection pool size errors within 10 minutes.   
 
 The server I have is a VM with the following resources:
 
 2GB of RAM
 4CPU Cores
 

IMHO, that's too little memory and CPUs.

A ball-park figures for our servers:

memory-wise:
each web-front-end python process takes ~300MB (and you plan for 6 of them).
and you also have 3 more python processes (1 job manager + 2 job handlers).

CPU-wise:
In addition to 9 python processes, you will have several PostgreSQL processes, 
few apache threads, and some other system processes running.
Even when each python process doesn't run at full capacity (ie. 100% CPU), your 
system already sounds overloaded.
When jobs are running (at least on our system) the job-handlers consume some 
CPU time by just monitoring the jobs.
When users submit large workflows with many jobs, the job-handlers take 100% 
cpu for a short time.
with all of the above combined, I would say 4 CPUs sounds a bit weak.


 I feel that it is also worthwhile to mention that users will not be 
 downloading datasets during the workshop, so as of now, the implementation of 
 XSendFile as specified in the Apache Proxy documentation is not of 
 immediate concern.
 

IMHO, this is a wrong assumption.
You can not fully control what users in a workshop are doing.
Imagine that only two users out of your 35 will click (even accidentally) on 
the download icon, and start downloading a big file - if downloads are handled 
by the python processes - then immediately two of your six web-front-ends are 
now busy and can't serve other users.

Also - regardless of how big the downloaded files are, Apache+XSendFile will be 
more efficient at sending files to the user (then python), and with just 4 CPUs 
you definitely want to conserve as many resources as possible.


 Does anybody see any 

Re: [galaxy-dev] A Galaxy environment that can support up to 35 users?

2012-09-17 Thread Dan Sullivan
Hi, Assaf,

Thank-you for your very detailed, thorough, and thoughtful reply.  I have 
responses to some stuff that you said; my comments are in-line;

On Sep 17, 2012, at 1:11 PM, Assaf Gordon gor...@cshl.edu wrote:

 Hello Dan,
 
 Couple of lessons we learned from setting up similar workshop-galaxies:
 
 Dan Sullivan wrote, On 09/17/2012 01:04 PM:
 Hi, Galaxy Developers,
 
 Is anybody out there managing a Galaxy environment that was designed and or 
 has been tested to support 35 concurrent users?  The reason why I am asking 
 this is because we [the U of C] have a training session coming up this 
 Thursday, and the environment we have deployed needs to support this number 
 of users.  We have put the server under as high as stress as possible with 6 
 users, and Galaxy has performed fine, however it has proven somewhat 
 challenging to do load testing for all 35 concurrent users prior to the 
 workshop.  I can't help but feel we are rolling the dice a little bit as 
 we've never put the server under anything close to this load level, so I 
 figured I would try to dot my i's by sending an email to this list.
 
 Here are the configuration changes that are currently implemented (in terms 
 of trying to performance tune and web scale our galaxy server):
 
 1) Enabled proxy load balancing with six web front-ends (the number six 
 pulled from Galaxy wiki) (Apache):
 
 When configured correctly, 3 or 4 web-front-ends seemed sufficient.
 (When configured incorrectly, it doesn't matter how many you have, 
 performances will suffer :) ).
 
 Given that you only have 4 CPUs/cores for your machine, having six front-ends 
 seems too much.

Since we are not running on bare metal hardware, I can definitely increase 
memory and CPU count on the Galaxy VM.  I am going to increase these to 8 Cores 
w/8GB of RAM for the purpose of the workshop, based on the rough numbers you 
provided.

 
 2) Rewrite static URLs for static content (Apache):
 3) Enabled compression and caching (Apache):
 
 This might sounds obvious, but test that it actually works (e.g. check in the 
 apache logs that the static files were served by apache, not by galaxy).
 Typos and other minor incompatibilities can cause the URLs to be served by 
 galaxy, which will waste resources.
 
This is a very good idea.  Thank-you.

 4) Configured web scaling (universe_wsgi.ini) :
  a) six web server processes (threadpool_workers = 7)
  b) a single job manager (threadpool_workers = 5)
c) two job handlers (threadpool_workers = 5)
 
 Again, with a system of only 4 CPUs, you might overload your server.

As I said, I am going to increase the CPU core count to 8 based on your 
recommendations.
 
 5) Configured a pbs_mom external job runner (our cluster), and commented out 
 the default tool runners (to use pbs)  (we are not using the other tools for 
 the workshop).
 
 #ucsc_table_direct1 = local:///
 #ucsc_table_direct_archaea1 = local:///
 #ucsc_table_direct_test1 = local:///
 #upload1 = local:///
 
 
 Unless your workshop is *tightly* scripted, you can't really tell which tool 
 users will use.
 If this is an introduction to galaxy, users will experiment with some tools 
 (even if you don't tell them to).
 
 (also, I'm not sure if those data import tools can run on your cluster node).

Based on some limited testing, these data import tools can run on our cluster 
node.  We have NAT configured with outbound HTTP from the cluster.  I think 
we're alright on this one, although I will report back if I find any new 
meaningful lessons learned using this configuration.
 
 
 6)  Changed the following database parameters (universe_wsgi.ini):
  database_engine_option_pool_size = 10
  database_engine_option_max_overflow = 20 
 
 Assuming you're using PostgreSQL (and you shouldn't use anything else, in 
 practice), add the following:
   database_engine_option_server_side_cursors = True
 
 And I would set pool_size to 50 and max_overflow to 100 - seems 
 excessive, but under the load of 20 users hammering at galaxy at the same 
 time in a short time window,
 I got the database connection pool size errors within 10 minutes.   

This is good information from your experience.  Thank-you for sharing this.  I 
will implement this as you suggested.
 
 The server I have is a VM with the following resources:
 
 2GB of RAM
 4CPU Cores
 
 
 IMHO, that's too little memory and CPUs.
 
 A ball-park figures for our servers:
 
 memory-wise:
 each web-front-end python process takes ~300MB (and you plan for 6 of them).
 and you also have 3 more python processes (1 job manager + 2 job handlers).
 
 CPU-wise:
 In addition to 9 python processes, you will have several PostgreSQL 
 processes, few apache threads, and some other system processes running.
 Even when each python process doesn't run at full capacity (ie. 100% CPU), 
 your system already sounds overloaded.
 When jobs are running (at least on our system) the job-handlers consume some 
 CPU time by just monitoring