Re: [gridengine users] Evaluating different scheduler strategies by simulating a past workflow

Stuart Barkley Thu, 29 Sep 2011 19:12:36 -0700

On Thu, 29 Sep 2011 at 09:02 -0000, Fabio Martinelli wrote:

> is there a way to apply again a past workflow by using information
> like that stored inside the ARCO DB, the reporting file or the
> accounting file ?
>
> obviously with a shorter time scale and we don't mind about about
> memory consumption and I/O, just slots assignment.


You should be able to extract useful information from the reporting
file showing user, group, wall clock, slots and requested resources
for each job which ran.  This data can be given to grad student in
simulation to explore.

I don't think this is what you want, but last year at SC10 this was
"Best Paper" in MTAGS.  Tim Armstrong was the presenter.  It analyzed
utilization and scheduling of autodock for best performance.

    http://www.cs.iit.edu/~iraicu/MTAGS10/paper10.pdf
    http://www.cs.iit.edu/~iraicu/MTAGS10/paper10-slides.pdf

(As I recall) They didn't actually run any jobs, only used data from
previous job runs in simulation.  They also only simulated a single
user, not competing users for limited resources.

> basically we need to understand if the actual scheduler policy is
> "fair", at least according to our personal concept of "fairness"; so
> far it's unfair and as a reaction we should tune some scheduler
> parameters and observe the Sun Grid Engine behavior but as jobs take
> hours or days to complete this tuning process is simply to long to
> manage.

Defining "fair" can be tricky.  It can help when the scheduler has a
defined fair share capability.  You can just enable it say your policy
is the default policy.

We are only starting to see multiple users with waiting jobs on our
clusters.  I would like to see some better use case documentation
(with complete configuration information) to help us to define our
needs more completely.  A known working and documented configuration
is preferable to a bunch of managers/scientists sitting in a
conference room dreaming up something internally inconsistent and
unimplementable.

Usually we just have one or two users with array jobs in the queue.
Last week we had a case where we actually had 5 users with queued jobs
and a couple users complained when their jobs didn't start right away.
Watching the jobs in the queue was instructive and it wasn't clear if
the scheduling did the right thing.

Specifically, a large array job using single slots seemed to block an
8-slot job from starting.  The array tasks where completing on
individual nodes, but new 1-slot tasks keep getting started.  At one
point it looked like SGE might have started to drain a node in
preparation for the 8-slot job, but I think that was an incorrect
observation.

> just to cite a concrete solution, I never tried this Moab Simulator
> http://www.adaptivecomputing.com/resources/docs/mwm/6-1/Content/topics/analyzing/simulations.html
> but the 'simulation' concept and the related tools seem to be there,
> so I wonder what about {Sun} Grid Engine.

We are running Moab on one cluster but have never tried this.  It
looks good from a sales/marketing position, but looked awkward to use.
Having a simulation tool would be a good idea, but I'm not sure how
practical it would be.

You might also need to worry about jobs with flexible resource
requests (for example jobs which request a range of slots).  The
granted resources you collect from the reporting file may not be what
a different simulation would produce.

Users will also adapt their behavior as you change the system
parameters.  I have read that users will often attempt to game
whatever fairness policy you implement.  They will find the loopholes,
bugs and other features which work in their favor.

Note to Univa: Providing better, complete and current documentation
for the existing SGE is important.  This includes fully worked example
configurations for common current installations.  We have some money
and are willing to buy (and have had discussions with you).  I do
demand proper documentation for a purchased product.  When this also
helps the open source community it is good public relations (including
being payback for the incorporation of open source into your product).

Stuart Barkley
-- 
I've never been lost; I was once bewildered for three days, but never lost!
                                        --  Daniel Boone
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Evaluating different scheduler strategies by simulating a past workflow

Reply via email to