Re: [gridengine users] Berkeley DB (was building RHEL5)

Rayson Ho Mon, 11 Apr 2011 10:11:52 -0700

On Mon, Apr 11, 2011 at 12:53 PM, Fritz Ferstl <[email protected]> wrote:
> Well, some aren't ;-)
>
> If you have prepared a large pile of work which breaks down into maybe 
> hundreds of thousands of jobs then you don't want the submission to take 
> forever (500K jobs would take about 1 hr at 150 jobs/sec if I'm doing the 
> math right in my head). With DRMAA it'd be only 10 minutes ... just assuming 
> one client.


I understand that, but then I believe if 500K-job submissions are
frequent events, then most likely the cluster has a lot of nodes, and
that means most likely they already has some sort of support contract
with Oracle or you guys :-D

And the other thing is that DRMAA job submissions are still not as
common as qsubs. However, I've seen a cluster that only get jobs via
DRMAA, so that might change in the future.


> It's not only that. Qsub obviously also goes through fork/execs of the Unix 
> command shell around it each time a job is submitted... while you can write a 
> DRMAA client which submits jobs in a loop.

I believe fork/exec is not that bad compare to sending a packet to a
remote machine, esp. with the TCP overhead and the network latency.

Rayson



>
> Cheers,
>
> Fritz
>
> Oh, and another thing I did was -b y vs n, and on Linux it also did not make 
> much difference with the local disk.
> Rayson
>
>
> On Mon, Apr 11, 2011 at 12:28 PM, Fritz Ferstl <[email protected]> wrote:
>>
>> Another thing you might want to try is sumitting jobs via DRMAA instead of 
>> via qsub. You'll get roughly 900 jobs / sec submitted with a single DRMAA 
>> client. So this will load the system more than when using qsub.
>>
>> Cheers,
>>
>> Fritz
>>
>> On Fri, Apr 8, 2011 at 10:44 AM, Chris Dagdigian <[email protected]> wrote:
>>
>> - Job submission rate and job "churn". I think DanT said this in a blog post
>> years ago but if you expect to need 200+ qsubs per second then you are going
>> to need berkeley spooling.
>>
>> I did some classic spooling benchmarks during the weekend:
>>
>> submitting 1000 jobs with 1 qsub session  - 19 sec
>> submitting 2000 jobs with 2 qsub sessions - 20 sec
>> submitting 4000 jobs with 4 qsub sessions - 31 sec
>>
>> (ie. each session submits 1000 jobs, and qsub sessions are done in parallel.)
>>
>> I then modified the classic spooling code so that qmaster does not
>> write to the disk when jobs are submitted (which is fine as an
>> experiment, and as long as the qmaster is not restarted, jobs are not
>> lost), and got identical results.
>>
>> My conclusion is that Linux caches most of the disk writes and thus
>> I/O performance would not affect the qsub performance much. However,
>> even with a journaling filesystem with consistency, there is a small
>> chance that some jobs can be lost when the qmaster crashes during the
>> write operations. On the other hand, hardware resource contention
>> and/or LOCK_GLOBAL contention might be causing the slowdown for the 4
>> parallel qsub case. And even when the number of worker_threads &
>> listener_threads is increased, the results were the same.
>>
>> http://gridscheduler.sourceforge.net/htmlman/htmlman5/bootstrap.html
>>
>> Hardware: local disk, Thinkpad T510 - 64-bit Linux, 4GB memory, 2
>> cores/4 threads, 2.67GHz
>>
>> I have not benchmarked Berkeley DB spooling, but I believe I will need
>> server hardware to get greater than 200 jobs per second qsub
>> performance.
>>
>> Rayson
>>
>>
>>
>> Same goes for clusters that experience huge
>> amounts of job flows or state changes. I have less experience here but in
>> these sorts of systems I think binary spooling makes a real difference
>>
>> My $.02 of course!
>>
>> -chris
>>
>>
>>
>>
>> Mark Suhovecky wrote:
>>
>> OK, I got SGE6.2u5p1 to build with version 4.4.20 of Berkeley DB,
>> and proceeded to try and install Grid Engine on the master host
>> via inst_sge.
>>
>>  At some point it tells me that I should install Berkeley DB
>> on the master host  first, so I do "inst_sge -db", which hangs when it
>> tries
>> to start the DB for the first time. Then, because some
>> days I'm not terribly bright, I decide to see if the DB will start
>> at machine reboot. Well, now it hangs when sgedb start
>> runs from init. Still gotta fix that.
>>
>> So let me back up for a minute and ask about Berkeley DB...
>>
>> We currently run sge_6.2u1 on 1250 or so hosts, with "classic"
>> flat file spooling, and it's pretty stable.
>> When we move to SGE6.2u5p1, we'd like
>> to use the Arco reporting package, and I'm blithely assuming
>> that I need a DB with an SQL interface to accomodate this.
>>
>> Is that true? Can we use Arco w/o DB spooling?
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
>>
>> --
>>
>> Fritz Ferstl | CTO and Business Development, EMEA
>> Univa Corporation | The Data Center Optimization Company
>> E-Mail: [email protected] | Phone: +49.9471.200.195 | Mobile: 
>> +49.170.819.7390
>>
>
>
> --
>
> Fritz Ferstl | CTO and Business Development, EMEA
> Univa Corporation | The Data Center Optimization Company
> E-Mail: [email protected] | Phone: +49.9471.200.195 | Mobile: +49.170.819.7390
>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Berkeley DB (was building RHEL5)

Reply via email to