On Mon, Apr 11, 2011 at 12:53 PM, Fritz Ferstl <[email protected]> wrote: > Well, some aren't ;-) > > If you have prepared a large pile of work which breaks down into maybe > hundreds of thousands of jobs then you don't want the submission to take > forever (500K jobs would take about 1 hr at 150 jobs/sec if I'm doing the > math right in my head). With DRMAA it'd be only 10 minutes ... just assuming > one client.
I understand that, but then I believe if 500K-job submissions are frequent events, then most likely the cluster has a lot of nodes, and that means most likely they already has some sort of support contract with Oracle or you guys :-D And the other thing is that DRMAA job submissions are still not as common as qsubs. However, I've seen a cluster that only get jobs via DRMAA, so that might change in the future. > It's not only that. Qsub obviously also goes through fork/execs of the Unix > command shell around it each time a job is submitted... while you can write a > DRMAA client which submits jobs in a loop. I believe fork/exec is not that bad compare to sending a packet to a remote machine, esp. with the TCP overhead and the network latency. Rayson > > Cheers, > > Fritz > > Oh, and another thing I did was -b y vs n, and on Linux it also did not make > much difference with the local disk. > Rayson > > > On Mon, Apr 11, 2011 at 12:28 PM, Fritz Ferstl <[email protected]> wrote: >> >> Another thing you might want to try is sumitting jobs via DRMAA instead of >> via qsub. You'll get roughly 900 jobs / sec submitted with a single DRMAA >> client. So this will load the system more than when using qsub. >> >> Cheers, >> >> Fritz >> >> On Fri, Apr 8, 2011 at 10:44 AM, Chris Dagdigian <[email protected]> wrote: >> >> - Job submission rate and job "churn". I think DanT said this in a blog post >> years ago but if you expect to need 200+ qsubs per second then you are going >> to need berkeley spooling. >> >> I did some classic spooling benchmarks during the weekend: >> >> submitting 1000 jobs with 1 qsub session - 19 sec >> submitting 2000 jobs with 2 qsub sessions - 20 sec >> submitting 4000 jobs with 4 qsub sessions - 31 sec >> >> (ie. each session submits 1000 jobs, and qsub sessions are done in parallel.) >> >> I then modified the classic spooling code so that qmaster does not >> write to the disk when jobs are submitted (which is fine as an >> experiment, and as long as the qmaster is not restarted, jobs are not >> lost), and got identical results. >> >> My conclusion is that Linux caches most of the disk writes and thus >> I/O performance would not affect the qsub performance much. However, >> even with a journaling filesystem with consistency, there is a small >> chance that some jobs can be lost when the qmaster crashes during the >> write operations. On the other hand, hardware resource contention >> and/or LOCK_GLOBAL contention might be causing the slowdown for the 4 >> parallel qsub case. And even when the number of worker_threads & >> listener_threads is increased, the results were the same. >> >> http://gridscheduler.sourceforge.net/htmlman/htmlman5/bootstrap.html >> >> Hardware: local disk, Thinkpad T510 - 64-bit Linux, 4GB memory, 2 >> cores/4 threads, 2.67GHz >> >> I have not benchmarked Berkeley DB spooling, but I believe I will need >> server hardware to get greater than 200 jobs per second qsub >> performance. >> >> Rayson >> >> >> >> Same goes for clusters that experience huge >> amounts of job flows or state changes. I have less experience here but in >> these sorts of systems I think binary spooling makes a real difference >> >> My $.02 of course! >> >> -chris >> >> >> >> >> Mark Suhovecky wrote: >> >> OK, I got SGE6.2u5p1 to build with version 4.4.20 of Berkeley DB, >> and proceeded to try and install Grid Engine on the master host >> via inst_sge. >> >> At some point it tells me that I should install Berkeley DB >> on the master host first, so I do "inst_sge -db", which hangs when it >> tries >> to start the DB for the first time. Then, because some >> days I'm not terribly bright, I decide to see if the DB will start >> at machine reboot. Well, now it hangs when sgedb start >> runs from init. Still gotta fix that. >> >> So let me back up for a minute and ask about Berkeley DB... >> >> We currently run sge_6.2u1 on 1250 or so hosts, with "classic" >> flat file spooling, and it's pretty stable. >> When we move to SGE6.2u5p1, we'd like >> to use the Arco reporting package, and I'm blithely assuming >> that I need a DB with an SQL interface to accomodate this. >> >> Is that true? Can we use Arco w/o DB spooling? >> >> _______________________________________________ >> users mailing list >> [email protected] >> https://gridengine.org/mailman/listinfo/users >> >> _______________________________________________ >> users mailing list >> [email protected] >> https://gridengine.org/mailman/listinfo/users >> >> -- >> >> Fritz Ferstl | CTO and Business Development, EMEA >> Univa Corporation | The Data Center Optimization Company >> E-Mail: [email protected] | Phone: +49.9471.200.195 | Mobile: >> +49.170.819.7390 >> > > > -- > > Fritz Ferstl | CTO and Business Development, EMEA > Univa Corporation | The Data Center Optimization Company > E-Mail: [email protected] | Phone: +49.9471.200.195 | Mobile: +49.170.819.7390 > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
