[galaxy-dev] A Galaxy environment that can support up to 35 users?
Hi, Galaxy Developers, Is anybody out there managing a Galaxy environment that was designed and or has been tested to support 35 concurrent users? The reason why I am asking this is because we [the U of C] have a training session coming up this Thursday, and the environment we have deployed needs to support this number of users. We have put the server under as high as stress as possible with 6 users, and Galaxy has performed fine, however it has proven somewhat challenging to do load testing for all 35 concurrent users prior to the workshop. I can't help but feel we are rolling the dice a little bit as we've never put the server under anything close to this load level, so I figured I would try to dot my i's by sending an email to this list. Here are the configuration changes that are currently implemented (in terms of trying to performance tune and web scale our galaxy server): 1) Enabled proxy load balancing with six web front-ends (the number six pulled from Galaxy wiki) (Apache): Proxy balancer://galaxy/ BalancerMember http://127.0.0.1:8080 BalancerMember http://127.0.0.1:8081 BalancerMember http://127.0.0.1:8082 BalancerMember http://127.0.0.1:8083 BalancerMember http://127.0.0.1:8084 BalancerMember http://127.0.0.1:8085 /Proxy 2) Rewrite static URLs for static content (Apache): RewriteRule ^/static/style/(.*) /group/galaxy/galaxy-dist/static/uchicago_cri_august_2012_style/blue/$1 [L] RewriteRule ^/static/scripts/(.*) /group/galaxy/galaxy-dist/static/scripts/packed/$1 [L] RewriteRule ^/static/(.*) /group/galaxy/galaxy-dist/static/$1 [L] RewriteRule ^/robots.txt /group/galaxy/galaxy-dist/static/robots.txt [L] RewriteRule ^(.*) balancer://galaxy$1 [P] 3) Enabled compression and caching (Apache): Location / SetOutputFilter DEFLATE SetEnvIfNoCase Request_URI \.(?:gif|jpe?g|png)$ no-gzip dont-vary SetEnvIfNoCase Request_URI \.(?:t?gz|zip|bz2)$ no-gzip dont-vary /Location Location /static ExpiresActive On ExpiresDefault access plus 6 hours /Location 4) Configured web scaling (universe_wsgi.ini) : a) six web server processes (threadpool_workers = 7) b) a single job manager (threadpool_workers = 5) c) two job handlers (threadpool_workers = 5) 5) Configured a pbs_mom external job runner (our cluster), and commented out the default tool runners (to use pbs) (we are not using the other tools for the workshop). #ucsc_table_direct1 = local:/// #ucsc_table_direct_archaea1 = local:/// #ucsc_table_direct_test1 = local:/// #upload1 = local:/// 6) Changed the following database parameters (universe_wsgi.ini): database_engine_option_pool_size = 10 database_engine_option_max_overflow = 20 7) Disable the developer settings (universe_wsgi.ini): debug = False use_interactive = False #filter-with = gzip The server I have is a VM with the following resources: 2GB of RAM 4CPU Cores I feel that it is also worthwhile to mention that users will not be downloading datasets during the workshop, so as of now, the implementation of XSendFile as specified in the Apache Proxy documentation is not of immediate concern. Does anybody see any blaring mistakes where they think this configuration might fall short with respect to capacity planning for an environment of 35 concurrent users, or additional tuning that could potentially assist in ensuring the availability of the server during the workshop? Thank-you so much for your opinion(s), and please wish us luck this Thursday :-) Dan Sullivan ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] A Galaxy environment that can support up to 35 users?
Hello Dan, Couple of lessons we learned from setting up similar workshop-galaxies: Dan Sullivan wrote, On 09/17/2012 01:04 PM: Hi, Galaxy Developers, Is anybody out there managing a Galaxy environment that was designed and or has been tested to support 35 concurrent users? The reason why I am asking this is because we [the U of C] have a training session coming up this Thursday, and the environment we have deployed needs to support this number of users. We have put the server under as high as stress as possible with 6 users, and Galaxy has performed fine, however it has proven somewhat challenging to do load testing for all 35 concurrent users prior to the workshop. I can't help but feel we are rolling the dice a little bit as we've never put the server under anything close to this load level, so I figured I would try to dot my i's by sending an email to this list. Here are the configuration changes that are currently implemented (in terms of trying to performance tune and web scale our galaxy server): 1) Enabled proxy load balancing with six web front-ends (the number six pulled from Galaxy wiki) (Apache): When configured correctly, 3 or 4 web-front-ends seemed sufficient. (When configured incorrectly, it doesn't matter how many you have, performances will suffer :) ). Given that you only have 4 CPUs/cores for your machine, having six front-ends seems too much. 2) Rewrite static URLs for static content (Apache): 3) Enabled compression and caching (Apache): This might sounds obvious, but test that it actually works (e.g. check in the apache logs that the static files were served by apache, not by galaxy). Typos and other minor incompatibilities can cause the URLs to be served by galaxy, which will waste resources. 4) Configured web scaling (universe_wsgi.ini) : a) six web server processes (threadpool_workers = 7) b) a single job manager (threadpool_workers = 5) c) two job handlers (threadpool_workers = 5) Again, with a system of only 4 CPUs, you might overload your server. 5) Configured a pbs_mom external job runner (our cluster), and commented out the default tool runners (to use pbs) (we are not using the other tools for the workshop). #ucsc_table_direct1 = local:/// #ucsc_table_direct_archaea1 = local:/// #ucsc_table_direct_test1 = local:/// #upload1 = local:/// Unless your workshop is *tightly* scripted, you can't really tell which tool users will use. If this is an introduction to galaxy, users will experiment with some tools (even if you don't tell them to). (also, I'm not sure if those data import tools can run on your cluster node). 6) Changed the following database parameters (universe_wsgi.ini): database_engine_option_pool_size = 10 database_engine_option_max_overflow = 20 Assuming you're using PostgreSQL (and you shouldn't use anything else, in practice), add the following: database_engine_option_server_side_cursors = True And I would set pool_size to 50 and max_overflow to 100 - seems excessive, but under the load of 20 users hammering at galaxy at the same time in a short time window, I got the database connection pool size errors within 10 minutes. The server I have is a VM with the following resources: 2GB of RAM 4CPU Cores IMHO, that's too little memory and CPUs. A ball-park figures for our servers: memory-wise: each web-front-end python process takes ~300MB (and you plan for 6 of them). and you also have 3 more python processes (1 job manager + 2 job handlers). CPU-wise: In addition to 9 python processes, you will have several PostgreSQL processes, few apache threads, and some other system processes running. Even when each python process doesn't run at full capacity (ie. 100% CPU), your system already sounds overloaded. When jobs are running (at least on our system) the job-handlers consume some CPU time by just monitoring the jobs. When users submit large workflows with many jobs, the job-handlers take 100% cpu for a short time. with all of the above combined, I would say 4 CPUs sounds a bit weak. I feel that it is also worthwhile to mention that users will not be downloading datasets during the workshop, so as of now, the implementation of XSendFile as specified in the Apache Proxy documentation is not of immediate concern. IMHO, this is a wrong assumption. You can not fully control what users in a workshop are doing. Imagine that only two users out of your 35 will click (even accidentally) on the download icon, and start downloading a big file - if downloads are handled by the python processes - then immediately two of your six web-front-ends are now busy and can't serve other users. Also - regardless of how big the downloaded files are, Apache+XSendFile will be more efficient at sending files to the user (then python), and with just 4 CPUs you definitely want to conserve as many resources as possible. Does anybody see any
Re: [galaxy-dev] A Galaxy environment that can support up to 35 users?
Hi, Assaf, Thank-you for your very detailed, thorough, and thoughtful reply. I have responses to some stuff that you said; my comments are in-line; On Sep 17, 2012, at 1:11 PM, Assaf Gordon gor...@cshl.edu wrote: Hello Dan, Couple of lessons we learned from setting up similar workshop-galaxies: Dan Sullivan wrote, On 09/17/2012 01:04 PM: Hi, Galaxy Developers, Is anybody out there managing a Galaxy environment that was designed and or has been tested to support 35 concurrent users? The reason why I am asking this is because we [the U of C] have a training session coming up this Thursday, and the environment we have deployed needs to support this number of users. We have put the server under as high as stress as possible with 6 users, and Galaxy has performed fine, however it has proven somewhat challenging to do load testing for all 35 concurrent users prior to the workshop. I can't help but feel we are rolling the dice a little bit as we've never put the server under anything close to this load level, so I figured I would try to dot my i's by sending an email to this list. Here are the configuration changes that are currently implemented (in terms of trying to performance tune and web scale our galaxy server): 1) Enabled proxy load balancing with six web front-ends (the number six pulled from Galaxy wiki) (Apache): When configured correctly, 3 or 4 web-front-ends seemed sufficient. (When configured incorrectly, it doesn't matter how many you have, performances will suffer :) ). Given that you only have 4 CPUs/cores for your machine, having six front-ends seems too much. Since we are not running on bare metal hardware, I can definitely increase memory and CPU count on the Galaxy VM. I am going to increase these to 8 Cores w/8GB of RAM for the purpose of the workshop, based on the rough numbers you provided. 2) Rewrite static URLs for static content (Apache): 3) Enabled compression and caching (Apache): This might sounds obvious, but test that it actually works (e.g. check in the apache logs that the static files were served by apache, not by galaxy). Typos and other minor incompatibilities can cause the URLs to be served by galaxy, which will waste resources. This is a very good idea. Thank-you. 4) Configured web scaling (universe_wsgi.ini) : a) six web server processes (threadpool_workers = 7) b) a single job manager (threadpool_workers = 5) c) two job handlers (threadpool_workers = 5) Again, with a system of only 4 CPUs, you might overload your server. As I said, I am going to increase the CPU core count to 8 based on your recommendations. 5) Configured a pbs_mom external job runner (our cluster), and commented out the default tool runners (to use pbs) (we are not using the other tools for the workshop). #ucsc_table_direct1 = local:/// #ucsc_table_direct_archaea1 = local:/// #ucsc_table_direct_test1 = local:/// #upload1 = local:/// Unless your workshop is *tightly* scripted, you can't really tell which tool users will use. If this is an introduction to galaxy, users will experiment with some tools (even if you don't tell them to). (also, I'm not sure if those data import tools can run on your cluster node). Based on some limited testing, these data import tools can run on our cluster node. We have NAT configured with outbound HTTP from the cluster. I think we're alright on this one, although I will report back if I find any new meaningful lessons learned using this configuration. 6) Changed the following database parameters (universe_wsgi.ini): database_engine_option_pool_size = 10 database_engine_option_max_overflow = 20 Assuming you're using PostgreSQL (and you shouldn't use anything else, in practice), add the following: database_engine_option_server_side_cursors = True And I would set pool_size to 50 and max_overflow to 100 - seems excessive, but under the load of 20 users hammering at galaxy at the same time in a short time window, I got the database connection pool size errors within 10 minutes. This is good information from your experience. Thank-you for sharing this. I will implement this as you suggested. The server I have is a VM with the following resources: 2GB of RAM 4CPU Cores IMHO, that's too little memory and CPUs. A ball-park figures for our servers: memory-wise: each web-front-end python process takes ~300MB (and you plan for 6 of them). and you also have 3 more python processes (1 job manager + 2 job handlers). CPU-wise: In addition to 9 python processes, you will have several PostgreSQL processes, few apache threads, and some other system processes running. Even when each python process doesn't run at full capacity (ie. 100% CPU), your system already sounds overloaded. When jobs are running (at least on our system) the job-handlers consume some CPU time by just monitoring