svn commit: r1727979 [2/3] - in /uima/sandbox/uima-ducc/trunk/uima-ducc-duccdocs/src/site/tex/duccbook: ./ images/ducc-internals/ part5/

challngr Mon, 01 Feb 2016 09:37:40 -0800
Added: 
uima/sandbox/uima-ducc/trunk/uima-ducc-duccdocs/src/site/tex/duccbook/part5/ducc-pops-component-rm.tex
URL: 
http://svn.apache.org/viewvc/uima/sandbox/uima-ducc/trunk/uima-ducc-duccdocs/src/site/tex/duccbook/part5/ducc-pops-component-rm.tex?rev=1727979&view=auto
==============================================================================
--- 
uima/sandbox/uima-ducc/trunk/uima-ducc-duccdocs/src/site/tex/duccbook/part5/ducc-pops-component-rm.tex
 (added)
+++ 
uima/sandbox/uima-ducc/trunk/uima-ducc-duccdocs/src/site/tex/duccbook/part5/ducc-pops-component-rm.tex
 Mon Feb  1 17:36:08 2016
@@ -0,0 +1,1109 @@
+% 
+% Licensed to the Apache Software Foundation (ASF) under one
+% or more contributor license agreements.  See the NOTICE file
+% distributed with this work for additional information
+% regarding copyright ownership.  The ASF licenses this file
+% to you under the Apache License, Version 2.0 (the
+% "License"); you may not use this file except in compliance
+% with the License.  You may obtain a copy of the License at
+% 
+%   http://www.apache.org/licenses/LICENSE-2.0
+% 
+% Unless required by applicable law or agreed to in writing,
+% software distributed under the License is distributed on an
+% "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+% KIND, either express or implied.  See the License for the
+% specific language governing permissions and limitations
+% under the License.
+% 
+
+
+This chapter provides architectural and implementation details for the DUCC
+Resource Manager, referred to as the ``RM''.
+\section{Introduction}
+
+The DUCC Resource Manager is responsible for apportioning cluster resources to 
the
+collection of ``work'' to be run by users.  Work is classified into several 
categories.  As
+exposed in the public interface, these categories are:
+
+\begin{description}
+  \item[Fair-Share Job] This is a UIMA/AS job, consisting of a minimum of two 
processes and 
+    a potential maximum of as many processes as physically fit on a cluster.  
The work
+    executed by the processes is parallel, enabling the RM to expand or 
contract
+    the job by allocating or deallocating processes as needed to balance the 
load.
+
+    Load is balanced using a weighted fair-share policy in which all users are 
apportioned an
+    equitable amount of the cluster resources, where ``cluster resources'' 
refers only to real
+    memory on the cluster nodes.
+
+  \item[Service Instance] This is any arbitrary process that DUCC manages as a 
``service''.
+    Services are registered with the Service Manager and may be comprised of 
multiple physical
+    processes.  (See the DuccBook for details of DUCC Service Management.)  
The RM schedules these
+    processes as singletons, using a non-preemptive policy (FIXED\_SHARE or 
RESERVE).
+
+  \item[Arbitrary Process or ``Managed Reservation''] These are singleton 
processes of any type, scheduled
+    using a FIXED\_SHARE policy.
+
+  \item[Fixed-Share Job] This is a UIMA-AS job scheduled with a 
non-preemptable, i.e. FIXED\_SHARE
+    policy.
+
+  \item[Reservation] This is a request for a dedicated full machine.
+\end{description}
+
+The RM is a memory scheduler only.  The use case which justifies DUCC is 
UIMA-AS jobs, each of
+which consists a variable number of parallel processes, each of which requires 
large amounts of memory, usually
+on the order of 16GB or more.  Memory requirements completely overwhelm other 
resource
+requirements, so that jobs scheduled by their declared memory sizes usually 
get sufficient
+other resource such as CPU. 
+
+\section{Vocabulary}
+    In order to understand RM it is necessary to understand some of the 
language used in RM.
+
+    \begin{description}
+    \item[quantum] This is the smallest memory size of an allocation, defined 
in multiples of GB.  It
+      is defined globally in {\em ducc.properties} and may be overridden in 
{\em ducc.classes} for
+      top-level nodepools.  See the DuccBook for more details.
+      
+      Note that although DUCC defines a quantum, most of the RM does not
+      use quantum at all; instead it generalizes quantum into {\em qshare}, 
{\em nshare},
+      and {\em order} as defined below.  When a schedule is returned to the 
Orchestrator, the
+      allocations, in terms of quanta, are translated back to memory 
allocations using this
+      configured {\em quantum}.
+      
+    \item[qshare] This is an abstract memory allocation representing exactly 
one {\em quantum}.  Memory
+      allocations are made in terms of some multiple of {\em qshares}.
+
+    \item[nshare] This is an allocation which consists of one or more 
co-located {\em qshares}.  When 
+      exposed outside of RM this is usually thought of as a ``process''.  It 
means, literally,
+      ``n qshares''.  
+      
+        Be careful, an {\em nshare} is NOT a process, it is an allocation that 
can be put to
+        any use or to no use if desired.  The RM does not care what an {\em 
nshare} is used for.
+
+      \item[order] This is a number which refers to the number of {\em 
qshares} associated with some
+        entity such as a machine, a job, a process, an {\em nshare}.  An 
``order 3'' machine is a
+        machine whose memory can be counted as ``three qshares''.  An ``order 
3'' job is a job whose
+        processes must be allocated as ``three qshares'' each, or one ``order 
three'' nshare.
+
+        All {\em qshares} are of order 1, but we don't know how much that is 
without knowing
+        the {\em quantum}.  
+
+        {\em Order} is an abstraction of the {\em quantum}.  Knowing the {\em 
order}
+        of any entity does not tell one how large that entity is.  It does 
tell one
+        how big that entity is in relation to other entities.
+
+        Note that {\em order} is NOT an allocation; it is not a {\em qshare} 
or an {\em nshare}.
+        It is a number which describes the size of things, without saying how 
big that size
+        is or what that thing is.
+
+        Order is used throughout the RM to index arrays and is one of the most 
fundamental
+        concepts in the RM architecture.
+      \end{description}
+      
+\section{Architectural Overview}
+    Throughout this section, please refer to the diagram in 
\hyperref[fig:rm-structure]{Figure ~\ref{fig:rm-structure}}.
+    The diagram shows the normal flow through the scheduler, from the time and 
Orchestrator
+    publication arrives to the time the RM publishes its schedule.
+
+\subsection{The Onion}
+    At the highest level of abstraction, the RM consists of three layers as 
seen in the 
+    \hyperref[fig:rm-structure]{Figure ~\ref{fig:rm-structure}} below.  It can 
be thought of as a three-layer onion:
+    \begin{description}
+    \item[JobManagerConverter] This is the outer layer.  This layer 
communicates with the
+      ``outside'' world: the DUCC Orchestrator and DUCC agents.  It is 
conceived of as in
+      ``impedance matcher'' that converts incoming messages to the RM's 
internal structures, and
+      converts the RM's internal structures into structures recognized by the 
outside world (the
+      DUCC Orchestrator). It is possible to replace this layer without 
affecting the RM proper; for
+      example, to create an isolated simulator for RM development.
+    \item[Scheduler] This is the middle layer.  It communicates on one hand 
with the {\em JobManagerConverter}
+      to receive and send data outside of RM and on the other, with the inner 
layer, the {\em NodepoolScheduler}.
+     \item[NodepoolSchedler] This is the inner layer and is the ``scheduler 
proper''.  Its input is
+       a set of allocation requests and its output is a set of node 
assignments for each request.  These are
+       passed up through the {\em Scheduler} and again up to the {\em 
JobManagerConverter} for publication.
+    \end{description}
+    
+\subsection{Nodes, Machines,  and Node Management}
+    The resources scheduled against are nodes.  On each physical node is a 
DUCC Agent which
+    publishes a regular ``heartbeat''.  The ``heartbeat'' contains data 
describing the
+    characteristics of the node used for scheduling: the memory size, the node 
name, the 
+    node IP.  
+
+    If a node fails, its ``heartbeat'' stops arriving.  After some number of 
missed
+    heartbeats, the RM considers the node to be unusable; it will stop 
scheduling work
+    to that node and attempt to evict existing work from the node.  Work that 
can
+    be rescheduled (fair-share jobs and services) get rescheduled on working 
nodes.
+
+    It is possible to remove a node from scheduling without the node failing 
using the
+    {\em vary\_off} utility.  This causes the RM to stop scheduling work to 
the 
+    specified node, and it causes fair-share work to be evicted and 
rescheduled elsewhere.
+
+    The RM component responsible for managing nodes is {\em NodePool.java}.  
As each
+    node heartbeat arrives, the {\em NodePool} is notified and, if it is the 
first
+    such notification, creates an object RM calls the {\em Machine} to 
represent the
+    remote resource.
+
+    The RM implements multiple nodepools in a nested or tree structure.  There 
is one NodePool
+    object for each configured nodepool.  The NodePools are structured in a 
self-organizing
+    tree (that is, none of the Java collection classes are used to organize 
multiple nodepools).
+    Most of the methods inside the NodePool module are recursive, aware of 
their parent
+    NodePools and child NodePools.
+
+\subsection{Jobs}
+    The RM is mostly unaware of the purpose for which allocations are made.  
It uses a
+    single structure, the {\em RmJob} to represent all work.  There are a 
small number
+    of minor switches in the code in deference to specific types of work 
(UIMA-AS jobs
+    vs reservations, for example) but these are not sufficient to justify a 
more elaborate
+    object structure.
+
+\subsection{Shares and share-order}
+    All RM allocations are in terms of {\em shares}. A share represents some 
portion
+    of the real memory on a node.  The smallest {\em share} than can be 
allocated is
+    called a {\em quantum share}.  The size of the {\em share quantum} is 
declared
+    in {\em ducc.properties} and can be overridden for {\em top-level 
nodepools} in
+    {\em ducc.classes} (see the DuccBook for the details of quantum 
configuration).
+
+    In the RM, a single quantum share is called a {\em Qshare}.  Multiple 
co-located {\em qshares}
+    may be allocated for any given request.  A multi-qshare allocation is 
called an {\em nshare}. A
+    {\em nshare} always corresponds to 
+    \begin{itemize}
+      \item A single process of a FAIR\_SHARE job,
+      \item A single process of a FIXED\_SHARE job,
+      \item A single service instance process,
+      \item A single AP (or ``managed reservation'',
+      \item A single unchanged reservation.
+    \end{itemize}
+
+    Thus a job may be allocated multiple {\em nshares}.  {\em Nshares} are 
usually exposed outside
+    of RM under the term {\em process}.  (There is a difference: an {\em 
nshare} refers to an
+    abstract allocation; a {\em process} refers to some physical process 
running on a node.  RM only
+    needs to deal in abstractions.)
+    
+    The term {\em share-order} is used to refer to the number of {\em qshares} 
associated with
+    an entity.  Entities with associated {\em share-order} include
+    \begin{itemize}
+      \item Jobs.  The {\em share-order} of a job is the number of {\em 
qshares} required
+        to make a single {\em nshare}.  It is a translation of the job's 
declared memory
+        requirements into the number of share quanta required per process.
+      \item Nodes (Machines).  The {\em share-order} of a machine is the 
number of
+        {\em qshares} that can be allocated on that machine.  It is a 
translation of the
+        machine's real memory into the number of share quanta that can be 
allocated to the machine.
+      \item Shares.  The order of a {\em share} is the number of {\em qshares} 
represented
+        by the {\em Share} object.  Note this implies that a {\em Share} 
object always
+        represents a single {\em Nshare}.
+    \end{itemize}
+    
+    All of the important algorithms in the RM involve managing incoming work 
and
+    resources by their ``order''.
+
+    A Job's memory specification are converted to {\em share-order} thus:
+\begin{verbatim}
+    share_order = job_declared_memory / share_quantum
+    if ( job_declared_memory % share_quantum ) > 0
+        share_order + share_order + 1
+\end{verbatim}
+    Note that a job's share order is always rounded UP.    
+    
+    A Machine's announced memory is converted to {\em share-order} thus:
+\begin{verbatim}
+    share_order = (int) (allocatable_mem / share_quantum;
+\end{verbatim}
+    Note that a machine's share order is always rounded DOWN.
+
+\subsection{Users}
+    Every job that arrives has an associated user.  The RM maintains a map of 
all users and
+    maintains a two-way map of jobs-to-users.  Note that every job has exactly 
one {\em user} 
+    but that every user may have an arbitrary number of {\em jobs}.  Thus, a 
user may
+    be associated with work of different {\em order}s and running in different 
{\em classes}
+    under differing policies.  The structure defined in {\em User.java} 
maintains all
+    necessary records as used by the scheduler.
+
+\subsection{RM Structure Schematic}
+    \begin{figure}[H]
+      \centering
+      \includegraphics[width=5.5in]{images/ducc-internals/rm-structure.png}
+      \caption{RM Structure}
+      \label{fig:rm-structure}
+    \end{figure}
+
+\section{Outer Layer: JobManagerConverter}
+
+    The {\em JobManagerConverter} is the outermost layer of the RM.  It is 
intended as a
+    buffer or ``impedance matcher'' to protect the RM from the ``outside 
world''.  It is
+    also intended to be replaceable as needed. It communicates with the middle 
layer through
+    an interface.   Any entity that (correctly) uses this interface may act as 
the outer layer.
+
+    This section describes the most important
+    functions of the {\em JobManagerConverter} in detail.  We refer to this as 
the 
+    JMC for brevity.
+
+    The primary function of the JMC is to receive incoming work in the form of 
Orchestrator
+    publications and convert them into a set of discrete scheduling events to 
be passed to the inner
+    layers of the RM.
+
+\subsection{Incoming Work}
+    Key methods involved in receiving work and passing it to the next layer 
are described here.
+
+    \paragraph{eventArrives()} receives the {\em DuccWorkMap} from the 
Orchestrator.  
+
+     If the RM is not yet initialized the map is ignored.
+
+     If the RM has been recently reconfigured, all structures in JMC are 
cleared and 
+     state set as if this is the first publication.
+
+     If this is the first publication, we pass the map to the method {\em 
recoverFromOrchestrator}
+     to initialize essential structures for work that has {\em ALREADY} been 
scheduled and is
+     running in other parts of the system.  This step is needed for these 
cases:
+     \begin{itemize}
+       \item DUCC is being started ``warm''.  In this case the Orchestrator 
map may include
+         Reservations, which are permanently scheduled, and must be recovered. 
+       \item The RM may have been stopped (or crashed) and is being restarted. 
 In this case
+         work of all sorts that was already scheduling must be recovered.
+       \item The RM may have been dynamically reconfigured.  Dynamic 
reconfiguration requires
+         that all internal structures be reset.  This is the equivalent to 
stopping and then
+         restarting the RM.  Work must be recovered.
+     \end{itemize}
+     
+     The incoming map is now saved for the map-difference code.  If this is 
the first publication,
+     RM simply returns.
+
+     All subsequent Orchestrator publications are compared with the previous 
map and
+     all differences are converted to scheduling events. 
+
+     There are three types of events:
+     \begin{description}
+       \item[New work] If the work has never been seen before, it is passed to 
the method
+         {\em jobArrives} for conversion into the RM internal structure {\em 
RmJob}.  The new
+         work is passed to the middle layer {\em Scheduler} via {\em 
signalNewWork()}.
+       \item[Completed Work] If the work is marked {\em completed} by the 
Orchestrator it is
+         removed from the local map and the {\em Scheduler} is signalled via 
{\em signalCompletion()}.
+       \item[Existing Work] The associated {\em process map} for each DuccWork 
object is differenced against the
+         previous map to identify processes which may have
+         completed or otherwise changed state.  The {\em RmJob} is fetched 
from {\em Scheduler} and
+         the state of its shares or the job itself is updated.  If a process 
is completed,
+         the {\em Scheduler} is signalled via {\em signalCompletion()}, 
overloaded on share instead of job.  If
+         at least one process has reached the {\em Running} state the {\em 
RmJob} is notified so
+         the {\em expand-by-doubling} policy can be enacted.
+     \end{description}
+
+     Once the incoming events are processed the middle layer is signaled by 
invoking the method
+     {\em schedule()}.
+
+\subsection{Outgoing Events}
+     A schedule is returned from {\em Scheduler} in the form of a {\em 
JobManagerUpdate} object.  This
+     object must be translated into an outgoing publication of the form 
expected by the Orchestrator.  The
+     {\em JobManagerUpdate} is passed to the {\em 
JobManagerConvert.createState()} method for conversion.
+
+     The final publication is returned in the form of a {\em RmStateDuccEvent} 
which is then passed to the
+     Camel framework for publication.
+
+\section{Middle Layer: Scheduler}
+
+    The ``middle layer'' is implemented in {\em Scheduler.java}.  This entity 
must conform to the
+    interface {\em ISchedulerMain} to maintain the layered ``onion'' 
architecture.  The ``outer layer''
+    does all its interactions with the scheduler proper through this interface.
+
+    The middle layer is relatively straightforward.  It is the middle-man 
between the {\em JobManagerConverter}
+    and scheduler proper, responsible for initialization, global bookkeeping, 
and dispatching of events
+    to the correct objects.  We'll simply list the important functions
+    and how to find them:
+
+    \begin{description}
+      \item[Initialization] {\em Scheduler.init()} is called from the DUCC 
infrastructure for RM,
+        {\em ResourceManagerComponent}.  RM configuration from {\em 
ducc.properties} is loaded, the
+        class configuration from {\em ducc.classes} is loaded, the RM 
configuration is announced to the
+        log, the database of dynamic RM data is cleared, and the 
``initialized'' flag is set to ``true''.
+      \item[Class configuration] This ({\em initClasses()}) is invoked out of 
{\em init()}.  The class configuration is loaded
+        into the common {\em NodeConfiguration} object and a set of {\em 
ResourceClass} objects is
+        created.  The {\em NodePool} objects are instantiated.
+      \item[Re-configuration] This is implemented in the {\em reconfigure()} 
method.  Most internal structures
+        are cleared and released and {\em init()} invoked as described above.
+      \item[Node Publications] All node publications are passed to the method 
{\em nodeArrives()}. This
+        method does a bit of bookkeeping, works out the {\em order} of the 
node, records the {\em node heartbeat},
+        and passes the node to its NodePool for future scheduling.
+      \item[Run Scheduler] The {\em schedule()} method is invoked from the 
outer layer as described
+        in the previous section.  This method drains incoming events, 
``recovers'' any
+        previously-scheduled work, updates state to reflect processes which 
have exited, and enters
+        new jobs and users into the system.  It then invokes the ``inner 
layer'', the {\em
+          NodePoolScheduler} on each top-level nodepool.  This results in 
creation of the new
+        schedule which is passed back to the outer-layer for publication by 
means of the {\em
+          dispatch()} method.
+      \item[Dispatch] This method ({\em dispatch()}) records the current 
schedule in the log
+        and converts the schedule into a form usable by the {\em 
JobManagerConverter} for publication. The
+        object created here, {\em JobManagerUpdate} is passed up and published.
+      \item[CLI] All CLI methods are handled here, passed in from the outer 
layer from {\em ResourceManagerComponent}.
+        
+    \end{description}
+
+\section{Inner Layer: NodepoolScheduler and NodePool}
+   The {\em NodePoolScheduler} and it's helper {\em NodePool} comprise the 
``scheduler proper''.  They are
+   both relatively complex.  This section discusses their architecture and the 
general flow of data
+   through them.  Readers would be advised to have code listings handy if the 
goal is to fully understand
+   the DUCC Scheduler.
+
+   The {\em NodepoolScheduler} is the ``main'' scheduler.  An analogy would be 
that it is the ``frontal cortex''
+   of the brain, doing most of the abstract reasoning required to form a 
schedule.
+
+   The {\em NodePool} is a helper class, responsible for managing physical 
layout of processes ({\em ``nshares''})
+   over physical nodes ({\em Machines}).  It can be thought of as the 
``cerebellum'', controlling the ``arms and legs''
+   of the schedule.
+
+    The scheduling rule {\em ``priority''} is implemented by executing the 
{\em How Much} and {\em
+      What Of} phases once for each priority, starting with the ``best'' 
priority, down to the
+    ``worst'' priority.  At each stage the scheduler attempts to give away all 
of its
+    resources. Each subsequent cycle through this loop will generally have 
fewer resources to
+    allocate until either all work is scheduled, or all resources are 
exhausted, whichever comes
+    first.
+
+    After the first two phases are complete in all passes, all fair-share jobs 
are iterated and any job whose
+    physical allocation exceeds the number of resources counted in the ``How 
Much'' phase has its surplus
+    processes preempted.  (These preempted resources are NOT added back to the 
resource pools until the Orchestrator
+    confirms they have exited; hence they aren't accounted for in the ``what 
of'' phase AT THIS POINT. They
+    will be used once they are known to be free.)
+
+    Finally, the {\em defragmentation} phase is executed.
+
+\subsection{NodepoolScheduler}
+
+    We will use a ``divide and conquer'' approach to describe the {\em 
NodpoolScheduler}.  This component consist of
+    three primary parts:
+    \begin{description}
+      \item[How Much.]  This phase performs the FAIR\_SHARE calculations as 
well as works out the
+        allotments for FIXED\_SHARE and RESERVE requests.  It assumes an 
``ideal'' configuration of
+        nodes with no conflicts and no fragmentation.  There is one {\em How 
Much} method for
+        each of the three scheduling policies ({\em howMuchFairShare(), 
howMuchFixed(),} and
+        {\em howMuchReserve()}.
+          
+
+      \item[What Of.] This phase works closely with the {\em NodePool} to try 
to find available
+        space for the abstract allocations produced by the ``How Much'' phase. 
 It is responsible for initiating
+        preemptions but it never preempts a job below the counts from the 
``How Much'' phase.  It preserves
+        a stable layout by never preempting work that is already allocated 
unless that work is exceeds
+        some user's fair share as determined by the ``counts'' from ``How 
Much''.
+        
+        Note that because it is constrained by the existing layout it may not 
always succeed
+        laying out all work.  If this occurs we must perform 
``Defragmentation''.
+
+        The three relevant methods are {\em whatOfFairShare(), whatOfFixed(),} 
and {\em whatOfReserve()}.
+
+      \item[Defragmentation] After ``What Of'', a pass is made to insure that 
every job is allocated
+        its fair share.  If not, defragmentation is performed to insure that 
at least ``some minimum'' 
+        number of processes is allocated for every job.  This may involve 
preemptions
+        of job processes even for user whose allocations are at or below their 
fair share.
+    \end{description}
+
+
+    We now describe these three actions in detail.
+
+\subsubsection{How Much}
+
+    For non-preemptive work this is straightforward: the work is assigned 
whatever is asked for UP TO
+    the configured user allotment (see the DuccBook for details of allotment). 
 Non-preemptive work 
+    belonging to users who are at or over their allotment is deferred and not 
passed to further scheduling stages.
+
+    The FAIR\_SHARE algorithm is performed on each of three {\em entities}: 
The ResourceClass, the User, and
+    the RmJob.  Throughout the rest of the discussion the term {\em entity} is 
used to refer to any
+    of these when the specific type of object is not relevant.  (Each of these 
entities implement the
+    generalized {\em IEntity} interface.).
+
+    \begin{description}
+      \item[ResourceClass] Every {\em ResourceClass} is asked to provide a 
summary of how many {\em nshares}
+        of each size it could use, assuming unbounded resources, (but 
constrained by RM rules such
+        as initialization cap and expand-by-doubling).  They produce an
+        array, indexed by {\em share order} of the number of processes of each 
order they want allocated.
+
+        To produce this array, the {\em ResourceClass} iterates all jobs ({\em 
RmJob} structures) assigned to the class and ask 
+        the same question of the RmJobs:, ``in an unbounded world what is the 
maximum number of processes you require''.  The
+        method responding, {\em RmJob.getJobCaps()} examines the number of 
work items not-completed, the number
+        of threads per process, and their {\em process\_deployments\_max} to 
produce an initial guess.  It then
+        takes into account ``doubling'' to revise the estimate down.  It then 
uses the process initialization
+        time and average execution time per work-item to again revise the 
estimate down if it appears
+        new allocations would not be used by the time they were made 
available. (This process is described in greater
+        detail below.)
+
+        The short description of what {\em getJobCaps()} does is this: start 
with the largest reasonable request
+        and whittle it down using the constraints of the RM rules to the 
smallest number of processes that
+        is guaranteed to be used, RIGHT NOW.
+
+        The sum of all job caps by ResourceClass, indexed by {\em 
share\_order} is used to create
+        the scheduling {\em demand.}
+        
+        The NodePools are then interrogated to produce a similar array, 
indexed by {\em share
+          order}, of the number of processes they can provide, accounting only 
for existing
+        committed allocations This produced an idealized view of the {\em 
resources}.
+
+        The algorithm implemented in {\em apportionQShares} then performs a 
FAIR\_SHARE allocation of
+        {\em nshares} to every job by matching {\em demand} with {\em 
resources}. We'll describe this allocation in greater detail below.
+
+        At the end of this phase, every {\em ResourceClass} contains a table 
called {\em given\_by\_order} which is 
+        the number of {\em nshares} indexed by {\em share order} to be 
assigned the jobs in the ResourceClass,
+        according to weighted fair-share.  At this stage
+        it is not known if it is physically possible to actually fulfill these 
allocations.
+
+      \item[User] Next, for each resource class, all the users owning jobs in 
that class are
+        gathered.  The same weighted FAIR\_SHARE code is executed against 
users, counting only jobs
+        in the current class, but using the hard-coded weight of ``1'' (one).  
This results in an
+        equitable distribution of the weighted FAIR\_SHARE allocations from 
the current ResourceClass among
+        the users of that class.
+
+        At the end of this phase, every {\em User} contains a table {\em 
given\_by\_order} which is the total
+        shares allocated to this user, for all jobs in this class.
+
+      \item[Job] After allocating jobs among users for each resource class, 
each {\em User} with
+        jobs in the class has the shares apportioned by the previous steps 
divided equally among all
+        their jobs in that class, again using the same weighted FAIR\_SHARE 
routine with hard-coded
+        weight of ``1'' (one).
+
+        At the end of this phase as before, all affected {\em RmJob}s have a 
table {\em
+          given\_by\_order} which contains the number of {\em nshares} 
assigned to that
+        job.
+    \end{description}
+
+\subsubsection{apportionQSares()}    
+    The method {\em apportionQShares()} is the method that performs the 
FAIR\_SHARE allocation for the
+    ``How Much'' phase.
+
+    The {\em apportionQshares()} method is much more involved than simply 
performing a weighted apportionment
+    of some abstract number of qshares among the various entities 
(ResourceClass, User, RmJob).  Because
+    every resource may be of different {\em share order}, and the set of jobs 
being scheduled to a ResourceClass
+    are generally of  different {\em share order}, this method must perform an 
equitable distribution of {\em qshares} but it
+    must assign them as {\em nshares} that can be also physically allocated.  
We must perform weighted fair-share
+    against the ``demand'' in terms of basic scheduling unit {\em qshares}, 
but we must produce a tentative schedule in terms of
+    {\em nshares} which can be mapped to real, known, physical machines.
+
+    State simply, it is useless to allocate shares on a machine of order $n$ 
to a job of order $>n$: the
+    job won't ``fit''.
+
+    In {\em apportionQShares()} we perform a series of iterations by 
decreasing {\em share\_order},
+    each iteration performing ``fair share'' allocation of resources among 
resources of that order,
+    but using the TOTAL demand in {\em qshares} of the entity, ignoring for 
the moment whether it
+    will ``fit''.
+
+    At the end of each iteration, entities which have their ``demand'' 
satisfied at the current
+    order are removed, and the iteration is repeated with the next smaller 
order, until either all
+    ``demand'' is satisfied or all resources are exhausted.
+
+    This produces an imperfect schedule that is ``pretty close'' and is 
computationally simple to
+    produce.  The defragmentation step at the end of scheduling provides 
additional correction.
+
+    The general mechanism is as follows:
+    \begin{itemize}
+      \item Initialize the {\em wanted\_by\_order} structure (the number of 
{\em nshares} of each {\em share order}
+        wanted by all entities in the current pass.
+      \item Starting at the largest share order, called ``current order'',
+        \begin{itemize}
+          \item Calculate weighted FAIR\_SHARE for only entities of ``current 
order'' against all resources of
+            current order or greater, using total unfulfilled {\em demand} for 
the entity.
+          \item Assign new {\em nshares} to the entities, incrementing their 
{\em given\_by\_order} for the current order.
+          \item Decrement entities' {\em wanted\_by\_order} (i.e., their {\em 
demand}.)
+          \item Remove all entities whose total allocation has been satisfied.
+          \item Decrement the ``current order'' and repeat.
+        \end{itemize}
+      \item If any {\em wanted\_by\_order} has non-zero entries, repeat this 
entire procedure until either all of {\em wanted\_by\_order}
+        becomes zero, or until no more resources are assigned (meaning they 
have been exhausted).
+      \end{itemize}
+
+      After this phase has been executed for every entity, every {\em RmJob} 
has a table of
+      ``counts'' which indicates the number of processes to be allocated to it.
+
+\subsubsection{What Of}
+
+  The {\em What Of} phase attempts to find physical resources to match the 
``counts'' from ``How Much''.  Note
+  that we no longer deal with Users.  We use ResourceClasses in this phase 
ONLY to find the correct
+  NodePool.  The RmJob is the focal point of ``What Of''.
+  
+  The general mechanism is the same for all types of allocations at this 
point: collect all jobs
+  for each resource class, and ask the NodePool assigned to that class to find 
``open'' spots
+  of the right size for every counted {\em nshare}.
+
+  If the job is already fully allocated (it's ``counts'' are less-than or 
equal to the number of
+  processes it owns), this phase is done for that job.  If not, the NodePool 
begins a search
+  among its resources for machines with sufficient space for the job.
+
+  Note that pending preemptions are treated the same as non-preempted 
allocations.  Until the Orchestrator
+  has confirmed that a process is completed, the RM assumes the space is still 
occupied.
+
+  The nodepool search may be a recursive search, starting at the nodepool that 
is directly assigned to the
+  current job.  If the job is non-preemptable, there is no recursion: the 
search occurs ONLY in the job's assigned nodepool.  Otherwise
+  the search proceeds as follows:
+  
+  \begin{itemize}
+    \item For each job:
+      \item Set the ``current nodepool'' to the nodepool of the job's declared 
class.
+        \begin{itemize}
+        \item Collect all machines with sufficient capacity for the current 
job.
+        \item If a processes for the current job can be allocated, do so.
+        \item If no process can be allocated and there are ``child'' 
nodepools, set the
+          ``current nodepool'' to the next ``child''.
+        \item Repeat this iteration, descending through the child nodepools, 
until
+          a process is allocated or all descendants are exhausted.
+     \end{itemize}
+  \end{itemize}
+  
+\subsubsection{Interlude: Preemption}
+   After {\em What Of}, we must initiate preemptions.  This is relatively 
straightforward and performed
+   BEFORE {\em defragmentation}.  It is performed by the method {\em 
doEvictions()}.
+
+   The {\em NodePoolScheduler} iterates all FAIR\_SHARE jobs and checks their 
{\em given\_by\_order}
+   array against the number of processes actually allocated to the job.  If 
they do not match it is
+   because
+   \begin{enumerate}
+     \item The job is expanding but the {\em What Of} phase could not find 
resources.
+     \item The job should shrink because the {\em How Much} phase reduced its 
fair-share
+       to make space for other work
+   \end{enumerate}
+   
+   If the job must shrink the RmJob's {\em shrinkBy()} method is called with 
the number of {\em nshares} it
+   must shed.  The {\em RmJob} sorts its shares using investment and preempts 
the requisite number
+   of processes. 
+
+   The investment sort is actually a ``multiple'' sort, comparing data 
provided by the Orchestrator about
+   the {\em processes} assigned to the job.  The 
+   shares (representing physical {\em processes} here)
+   are sorted by {\em least investment} first as follows:
+   \begin{enumerate}
+     \item A share that has not completed initialization is ``less than'' any 
share that has completed 
+       initialization.
+     \item If the two shares have not completed initialization, the share with 
least initialization time is ``less than''
+       the other share.
+     \item If both shares have completed initialization, the share with lowest 
investment is ``less than'' the other share.
+     \item If both shares have the same investment, the share in the ``most 
deeply nested nodepool'' is ``less than'' the other
+       share.
+     \item Otherwise, the share with the lowest ID is ``less than'' the other 
share (the newest share has the lower ID).
+   \end{enumerate}
+
+   NOTE: This is a significant simplification over the original eviction code. 
 The original code is still
+   in the source files under {\em shrinkByInvestment()}, for reference, but it 
is no longer used.
+
+   All preempted shares remain attached to their jobs and are NOT deallocated 
until the Orchestrator 
+   confirms their exit.  They are marked ``pending removal'' however, so the 
existing bookkeeping is able
+   to account for them during future preemption stages and defragmentation.
+
+   NOTE: Once a share is published as ``preempted'' to the Orchestrator, it 
cannot be retrieved.  Thus, if
+   the preemption takes a long time to complete, and the state of the system 
changes so the job can
+   re-expand, the preemption is NOT canceled.  This can be observed 
occasionally in the logs as
+   jobs that are both shrinking and expanding simultaneously.
+
+\subsubsection{Defragmentation}
+
+  Once preemption is complete the {\em defragmentation} phase begins.
+
+  Because the ``counts'' from ``How Much'' are {\em abstract} counts, derived 
from an idealized set
+  of resources representing real, physical machines as presented by the 
NodePools, the ``What Of''
+  phase will ALWAYS succeed in finding allocations, IF no preemptions are 
required and if there is no
+  fragmentation in the system.  The ``What Of'' phase always attempts to 
minimize fragmentation by
+  using a simple bin-packing scheme that packs the largest allocations first 
and the smaller
+  allocations in the ``holes''.
+
+  Here is a very simple example of fragmentation.
+
+  Figure ~\ref{fig:rm-fragmentation-1} shows an ideal allocation of a two jobs 
of different sizes.  Job A has been
+  assigned 5 {\em qshares} for 5 order-1 {\em nshares} (processes).  Job B is 
assigned
+  4 {\em qshares} for 2 order-2 {\em nshares}.  Both jobs are fully allocated 
and 'What Of''
+  will generally be successful accomplishing this.
+
+    \begin{figure}[H]
+      \centering
+      \includegraphics[width=5.5in]{images/ducc-internals/rm-structure-1.png}
+      \caption{Unfragmented Layout}
+      \label{fig:rm-fragmentation-1}
+    \end{figure}
+
+  However, as time proceeds and jobs come and go, it is possible that job A 
would get
+  allocated as in Figure ~\ref{fig:rm-fragmentation-2}.  Now job B can only 
get 1 process: exactly HALF it's
+  ``deserved'' allocation.  It would be necessary to preempt one of job A's 
processes to make space,
+  even though job A is not above its fair-share allocation.
+
+    \begin{figure}[H]
+      \centering
+      \includegraphics[width=5.5in]{images/ducc-internals/rm-structure-2.png}
+      \caption{Fragmented Layout}
+      \label{fig:rm-fragmentation-2}
+    \end{figure}
+
+  Of course this is a simplistic example.  In general the situation is 
significantly more complex.
+
+  The goal of {\em defragmentation} is to reconfigure job A as in Figure 
~\ref{fig:rm-fragmentation-1} so that
+  job B can get its full allocation.
+
+  The general procedure for defragmentation is as follows:
+
+  \paragraph{Detection} This is performed in the method {\em 
detectFragmentation()}.
+  After ``What Of'', all jobs are iterated.  Two numbers  are derived for each 
job:
+  \begin{enumerate}
+    \item The number of ``deserved'' shares.  During the ``How Much'' phase, 
+      we perform a weighted fair-share assignment of resources. Often a job
+      cannot use its full ``fair share'' allotment; for example, it may be a 
+      new job and only need two initial processes.  The extra resources are
+      apportioned to other jobs which end up with MORE than their proper
+      weighted fair-share allotment.
+
+      The ``deserved'' shares is a user's TRUE fair-share allotment, 
+      calculated BEFORE bonus shares are allocated to it.  This number is
+      calculated during the ``How Much'' phase and stored in each RmJob as
+      the {\em pure\_fair\_share}.
+
+    \item The number of allocated shares.  This number is calculated (in {\em 
RmJob}) as
+\begin{verbatim}
+   shares_allocated + pending_expansions - pending_preemptions
+\end{verbatim}
+    \end{enumerate}
+
+    If the number of ``deserved'' shares is greater than the number of 
allocated
+    shares (accounting for expansion and preemption), the job is considered
+    ``potentially needy''.
+
+    If there are no ``potentially needy'' jobs, {\em defragmentation} is done 
and
+    we can proceed to broadcast the schedule.
+
+    The second goal of defragmentation is to minimize ``churn'' in the system. 
 We
+    do NOT attempt to achieve a perfect layout.  Instead, there is a threshold
+    minimum number of processes we try to guarantee every job.  This number is 
configured
+    in {\em ducc.properties} as the {\em ducc.rm.fragmentation.threshold}.
+
+    A pass is now made over every ``potentially needy'' job.  Every such job 
with an
+    allocation that is greater than the {\em fragmentation threshold} is 
removed from the
+    ``needy'' list.  All remaining jobs are considered ``actually needy''.
+
+    If there are no ``actually needy'' jobs, {\em defragmentation} is done and 
we can
+    proceed to broadcast the schedule.
+
+    Otherwise, the method {\em doFinalEvictions()} is called to try to make 
space for
+    ``actually needy'' jobs.  We perform a ``take from the rich and give to 
the poor''
+    procedure to insure that jobs whose allocation are below both their 
``deserved fair share''
+    and the ``fragmentation threshold'' are assigned additional resources.
+
+    NOTE: This procedure works for non-preemptable allocations as well.  For 
non-preemptable
+    allocations, the ``deserved'' value is exactly 1 {\em nshare} and any such 
job
+    with no allocations is considered ``actually needy''.
+
+    We iterate all users and add up the total {\em qshares} occupied by all 
their jobs,
+    ordering the users by this value, known as their ``wealth''.
+
+    We iterate the ``actually needy'' jobs.  For each such job we iterate the 
``wealthy'' users,
+    starting from the ``wealthiest'', inspecting their jobs to see if any of 
the processes are
+    allocated over resources that can be allocated to the needy job.  Note 
that removal of a share
+    must NOT result an an otherwise non-needy job becoming ``needy''.  If so, 
the user's wealth is
+    decremented and one of two things occurs:
+    \begin{enumerate}
+      \item If the selected process is a ``pending expansion'' that has not 
been published,
+        it is immediately reassigned to the needy job.  (Note that this is an 
optimization and
+        the one exception to the rule that once a allocation is finalized in 
{\em RmJob}
+        it cannot be changed.) If the job is no longer needy it
+        is removed from the needy list.
+      \item Otherwise, the selected process is preempted and the needy job is 
placed on
+        a global ``needyJobs'' list.  Jobs on this list get priority 
allocation BEFORE
+        any new allocations are made in all subsequent scheduling cycles, 
until they
+        are no longer needy.
+    \end{enumerate}
+    
+    Note the conditions which must be met by a process before it can be 
donated to a needy job
+    (verified in method {\em takeFromTheRich()}):
+    \begin{itemize}
+      \item The machine containing the share must be of sufficient {\em share 
order}.
+      \item The share must be preemptable.
+      \item The machine must be in a compatible nodepool.
+      \item If this share is evicted, the owning job must not become ``needy''.
+      \item If this share is evicted, it must leave sufficient space on the 
machine for the new
+        share.  i.e, if it is impossible to clear enough space on the machine 
for the needy job,
+        there is no point evicting this share. We iterate all shares on the 
machine at this point
+        and try to evict sufficient shares (which of course must belong to 
``wealthy'' users) to
+        make space for the needy share.
+    \end{itemize}
+             
+\subsection{NodePool}
+
+    The {\em NodePool} object manages physical nodes, represented in the RM by 
an
+    object called {\em Machine}.  The collections of NodePools form a tree 
structure
+    with each ``nested'' nodepool managed as a ``child'' of its ``parent'' 
nodepool.
+    
+    There are many more methods in NodePool than are documented here.  In this 
section
+    we only review the most important, or the most complicated methods.
+
+    The RM supports multiple disjoint NodePools, known as ``top-level'' 
nodepools.  The collection
+    of ``top-level'' nodepools partitions the entire nodespace into 
independently scheduled
+    resource collections.  The motivation is to permit multiple, disparate 
collections of nodes to
+    be managed under a single DUCC system (rather than run multiple 
independent DUCC systems).
+
+    Most of the NodePool algorithms are recursive.  Both the {\em Scheduler} 
and
+    {\em NodePoolScheduler} object generally interact with the top NodePool of 
each
+    tree, which coordinates, through recursion, the direction of requests to 
the 
+    correct, possibly nested target NodePool.
+
+    For example, to count the machines in a nodepool, one generally wants the 
count of
+    machines in the pool PLUS the machines in its children:
+\begin{verbatim}
+    /**
+     * How many do I have, including recusring down the children?
+     */
+    int countMachines()
+    {
+        int count = allMachines.size();
+        for ( NodePool np : children.values() ) {
+            count += np.countMachines();
+        }
+        return count;
+    }
+\end{verbatim}
+
+    In the cases where recursion is to be inhibited, most of the methods are 
modified
+    with the name ``Local'':
+\begin{verbatim}
+    /**
+     * Non-recursive machine count.
+     */
+    int countLocalMachines()
+    {
+        return allMachines.size();
+    }
+\end{verbatim}
+
+    Most of the methods in {\em NodePool} are short and easily understood, 
like the two above. There
+    are a few subtleties in NodePool which will be expanded upon below.
+
+\subsubsection{NodePool Reset}
+    All scheduling phases must be aware of what physical resources are 
available, which are in use, and
+    which are available for scheduling.  As we proceed with scheduling we need 
to maintain scratch-space
+    that represents the current ``potential'' schedule, but without perturbing 
the existing allocations.
+    
+    The NodePool provides exactly this scratch space.  Before the two main 
scheduling phases, ``How
+    Much'' and ``What Of'', the NodePool is instructed to reset().  The 
NodePool (and recursively,
+    the entire set of nested NodePools), drops all of its structures other 
than the most basic Machine
+    structures and then rebuilds them from the machine structures..  The 
scheduling phases then create
+    ``trial'' schedules, resetting the NodePool as often as necessary.
+
+    This also has the side-effect that errors do not tend to accumulate in the 
system; we essentially
+    reboot the schedule on every pass.
+
+    Finalizing the schedule is done in the Machine objects, with some help 
from the RmJob and
+    Share objects.  
+
+    (NOTE: as an optimization, RM does NOT generally rebuild Machine and RmJob 
from scratch on each
+    Orchestrator publication.  They ARE rebuilt whenever RM starts, and during 
dynamic RM Reconfiguration).
+
+    The next section, {\em Virtual Machines} provides a concrete example of 
the use of NodePool for
+    scratch space during scheduling.
+
+\subsubsection{Virtual Machines}
+    Probably the most important part of the scheduler is encapsulated in the 
NodePool method,
+    {\em rearrangeVirtual}.  This method treats the collection of all ``real'' 
machines as
+    a collection of ``virtual'' machines which is the resource set that is 
scheduled against.
+
+    In the RM's view, a ``virtual machine'' is any PROPER subset of a ``real 
machine''.  (Recall
+    the mathematical definition of a PROPER subset is any subset of some set 
that is not equal
+    to that set.). 
+
+    As soon as an allocation of a single {\em nshare} is made against a 
machine, that machine's
+    capacity for further allocations is diminished until the allocation is 
released by the Orchestrator.  For example, an
+    order-3 allocation against an ``real'' order-5 machine results in the 
diminution of the order-5 machine
+    to a ``virtual'' order-2 machine.  To put it differently, making a 
3-quantum allocation against a 5-quantum
+    ``real machine'' results in a 2-quantum ``virtual machine''.
+    
+
+    To understand what {\em rearrangeVirtual()} does it is important to 
understand three tables.
+    These three tables are indexed by {\em share order} and are the key 
structures for both ``How
+    Much'' and ``What Of''.  These tables are:
+    \begin{description}
+      \item[nMachinesByOrder] This table contains the number of full, free 
``real machines'' with no allocations,
+        indexed by {\em share order} 1, 2, ... {\em maxorder}.
+      \item[vMachinesByOrder] This table contains the number of ``virtual 
machines'' indexed by {\em share order.}
+      \item[nSharesByOrder]  This table contains the number of {\em nshares} 
of every order which can be
+        currently allocated.
+    \end{description}
+    
+    There is no overlap between ``nMachinesByOrder'' and ``vMachinesByOrder''. 
Therefore, the number
+    of schedulable ``machines'' of any kind for some specific order {\em O} is
+\begin{verbatim}
+    nMachinesByOrder[O] + vMachinesByOrder[O]
+\end{verbatim}
+
+\paragraph{nSharesByOrder} is derived from the two machine tables and the 
meaning of its values
+    is subtly different.  The numbers in the machine tables are independent of 
each other.  For example, if
+    there is a single order-5 ``real machine'', this does NOT imply that there 
is also an order-3
+    ``virtual machine'' and an order-2 ``virtual machine''.  This breakdown 
can only happen after
+    allocation.
+
+    {\em nSharesByOrder} however, gives the number of {\em nshares} of an 
order that might be
+    allocated from any possible machine, real or virtual, allowing that a 
larger share may need to
+    be split.  Each value in the table is dependent on the values of higher 
order in the table.  For
+    example, if there is 1 order-5 ``real machine'', nSharesByOrder will 
indicate there is 1
+    order-5 share available, or 1 order-4 share, or 1 order-3 share, or 2 
order-2 shares, or 5
+    order-1 shares.  Here is an example of what these tables might look like 
at some point during scheduling:
+\begin{verbatim}
+    Order                     1  2 3  4
+    -------------------  ----------------
+    nMachinesByOrder[];  [ 0  2  0 1  4 ] - physical machines
+    vMachinesByOrder[];  [ 0  1  2 0  0 ] - virtual machines
+    nSharesByOrder[]  ;  [ 0 26 11 5  4 ] - collective N Shares for each order
+\end{verbatim}
+    
+\subsubsection{rearrangeVirtual(Machine M, order O)}
+    We can now explain this method.  This is called when we wish to allocate a 
single
+    {\em nshare} of order {\em O} from machine {\em M}.  The accounting works 
as follows:
+    if the machine has no allocations, decrement {\em nMachinesByOrder[O]} by 
one; else
+    decrement {\em vMachinesByOrder[O]} by one.  If the allocation would cause 
the 
+    free space to be split, calculate the order of the free space after 
allocation and 
+    increment the correct value in {\em vMachinesByOrder} like this:
+
+    \paragraph{First Step}: Update the two machine tables.
+\begin{verbatim}
+    int v_order = M.getVirtualShareOrder();    // How much free space in the 
machine?
+    int r_order = M.getShareOrder();           // How much total space in the 
machine?
+
+    if ( v_order == r_order ) {                // Free == total?
+        nMachinesByOrder[r_order]--;           // Yes, full machine allocation
+    } else {
+        vMachinesByOrder[v_order]--;           // No, virt machine allocation
+    }
+
+    v_order -= O;                              // Does it cause a split?
+    if ( v_order > 0 ) {                       // Yes
+        vMachinesByOrder[v_order]++;           // Add a "new", smaller virt 
machine
+    }
+\end{verbatim}
+    There are, of course, additional details, which can be seen by inspecting 
the
+    full source listing.
+
+    \paragraph{Second Step} Update the share table.  We initialize the table 
with the total of real
+    and virtual machines by order.  Then in a double iteration, look 
``forward'' to count the number
+    of shares that might be acquired from higher order allocations by 
splitting the space.  The full
+    method is included here for the curious.  Everyone else can simply trust 
that it is correct.
+\begin{verbatim}
+    protected void calcNSharesByOrder()
+    {
+        int len = nMachinesByOrder.length;
+
+        // init nSharesByorder to the sum of 'n and 'v MachinesByOrder
+        System.arraycopy(nMachinesByOrder, 0, nSharesByOrder, 0, len);
+        for ( int i = 0; i < getMaxOrder() + 1; i++ ) {
+            nSharesByOrder[i] += vMachinesByOrder[i];
+        }
+
+        for ( int o = 1; o < len; o++ ) {                     // counting by 
share order
+            for ( int p = o+1; p < len; p++ ) {
+                if ( nSharesByOrder[p] != 0 ) {
+                    nSharesByOrder[o] += (p / o) * nSharesByOrder[p];
+                }
+            }
+        }
+    }
+\end{verbatim}
+    
+\subsubsection{connectShare(Share s, Machine m, IRmJob j, int order)}
+    This helper method is responsible for updating all the records in order to
+    allocate a specific share on a specific machine for a specific job.  Its
+    action is irreversible: once this method is called, the share is 
irrevocably
+    assigned to the given job on the given machine (except sometimes, during
+    defragmentation, as described above).
+
+    {\em rearrangeVirtual()} is called at the end to update the internal 
``counts''.
+
+\subsubsection{compatibleNodepool(Policy p,  ResourceClass rc)}
+    This method determines if the current nodepool is compatible with the 
indicated
+    scheduling policy and resource class.  If the policy is FAIR\_SHARE, 
recursion
+    through the child nodes is performed.
+
+\subsubsection{nodeArrives}
+    This straightforward method adds a node to the list of schedulable nodes.  
It updates the
+    database, deals with unresponsive nodes becoming responsive again, and does
+    simple bookeeping.
+
+\section{RmJob}
+
+    The RmJob is mostly an accounting object.  While its implementation has 
many details,
+    there are two important methods: {\em calcJobCaps()} and {\em shrinkBy(int 
count)}, both
+    of which were briefly mentioned above.
+
+\subsection{calcJobCaps()}
+    If the {\em rearrangeVirtual()} code described above has a rival for 
``most important method'',
+    it would be the RmJob's {\em calcJobCaps()}.  This method is called many 
times throughout
+    scheduling and is required to return {\em exactly} the number of shares 
the job could make
+    use of at the current moment, if there were unbounded resources.
+
+    Note that this is the method to modify if you wish to change the rate of 
expansion or
+    contraction of a job.
+
+    Because it is called so often, the scheduler iterates all jobs at the 
start of each 
+    scheduling cycle and calls {\em initJobCap()} to calculate the cap based 
on current job
+    state.  This caches the actual cap, which is returned in subsequent calls 
to
+    {\em calcJobCaps()}.
+
+    The design point is this:  Estimate the cap as the largest value that is 
meaningful.  Then
+    whittle it down to the minimum by applying the architected constraints 
such as
+    the ``initialization cap'' and prediction of when we expect the job to 
complete.  We want
+    everything we can get but no more than we can use.
+
+    This code can be tricky to understand so we'll present it here.  The 
returned ``actual\_cap'' is
+    the value used by NodePoolScheduler's ``How Much'' phase for all {\em 
entities} to determine share allocations.
+
+    The following steps are taken by {\em initJobCap()}:
+    \begin{enumerate}
+      \item If the job is unschedulable (refused), set cap to 0 and return. 
(No shares will be allocated.)
+      \item If the job is completed but not yet deallocated, set the cap to 
the total shares
+        it already has allocated and return. (No additional shares will be 
allocated.)
+        \item Set the tentative cap to the number of remaining {\em work 
items}  divided by the declared
+          threads per processes.  This is the upper bound on the cap:
+\begin{verbatim}
+     c = (n_remaining_questions / nthreads}
+\end{verbatim}
+
+        \item Adjust the tentative cap to the maximum of ``c'' and the number 
of shares already
+          allocated.  This accounts for jobs ``winding down'' when work items 
start to vacate
+          processes so we have more processes than are needed for the 
remaining work but we 
+          want to insure that ``How Much'' does not cause premature shrinkage.
+\begin{verbatim}
+        int currentResources = countNShares();
+        c = Math.max(c, currentResources); 
+\end{verbatim}
+
+        \item Adjust the tentative cap to the minimum of ``c'' and the 
declared {\em process\_deployments\_max}.
+          Call this the ``base cap''.  It is the job cap before accounting for 
prediction and is
+          used if we cannot find a better estimate.
+\begin{verbatim}
+        int base_cap = Math.min(getMaxShares(), c);
+\end{verbatim}
+          
+        \item Predict the number of shares this job could use on an unbounded 
system,
+          based on the average initialization time of its processes and the 
rate of completion
+          of the work items so far.  Call this the ``projected\_cap''.
+\begin{verbatim}
+        int projected_cap = getProjectedCap();      
+        if ( projected_cap == 0 ) {                   // we know nothing, this 
is best guess
+               projected_cap = base_cap;
+        }
+\end{verbatim}
+
+        \item All else being equal, the potential cap for the job is now the 
max of the actual
+          resources we have allocated, and the projected cap.  It is the 
largest number of
+          resources we believe the job can ever use.
+\begin{verbatim}
+        potential_cap = Math.max(projected_cap, currentResources);
+\end{verbatim}
+
+            \item If we're still initializing, and we have configured {\em 
ducc.rm.initialization.cap}
+              in {\em ducc.properties}, revise the cap down and return the 
{\em actual\_cap}.
+\begin{verbatim}
+        actual_cap = Math.min(potential_cap, 
(resource_class.getInitializationCap()));
+\end{verbatim}
+
+            \item If we're still initializing and we do NOT have an initiation 
cap configured,
+              set the {\em actual\_cap} to the {\em potential\_cap} and return.
+\begin{verbatim}
+        actual_cap = potential_cap
+\end{verbatim}
+              
+            \item If we've completed at least one initialization, and we have 
configured
+              {\em ducc.rm.expand.by.doubling}, return the smaller of the {\em 
potential\_cap}
+              and TWICE the currently allocated resources:
+\begin{verbatim}
+       actual_cap = Math.min(potential_cap, currentResources * 2);
+\end{verbatim}
+
+            \item If we've completed at least one initialization, and we do 
NOT use
+              expand-by-doubling, return the {\em potential\_cap}
+\begin{verbatim}
+       actual_cap = potential_cap
+\end{verbatim}
+
+              \item There is one last corner case.  It is possible the job has
+                shrunk to 0 resources (pushed out by fair-share for example). 
If
+                this has happened we have to restart the doubling, and we need 
to
+                ask for at least the initialization cap.  But we don't want to 
go
+                over the ``base\_cap'' which has accounted for the fact the 
job might
+                be running down and we can't use the full initialization cap.
+\begin{verbatim}
+      if ( currentResources == 0 ) {
+         actual_cap = Math.max(1, resource_class.getInitializationCap()); 
+         actual_cap = Math.min(base_cap, actual_cap);                     
+       }
+\end{verbatim}
+
+
+        \end{enumerate}
+    
+\subsection{shrinkBy(int count)}
+    This is a rather trivial method, used to implement ``shrink by 
investment''.  Originally
+    this was a much more involved processes, which gradually became refined to 
its current
+    incarnation.
+
+    All this method does is sort the RmJob shares as described in the 
interlude above, ``Preemptions'',
+    and deletes the indicated number of shares from the front of the sorted 
list.
+
+    The original {\em shrinkByInvestment()} code has been left in place for 
reference.
+
+\section{Supporting Classes}
+   There are a number of supporting classes mostly used for bookkeeping, 
mentioned here for completeness.
+
+\subsection{Machine}
+This represents a Node.  A Machine object is created whenever a Node's state 
arrives.  The Machine
+is entered into an appropriate nodepool.  Machine objects are NEVER destroyed 
(except during dynamic
+reconfiguration) as it is usually expected that an unresponsive machine will 
become responsive
+again.  This significantly reduces the complexity of bookkeeping.
+
+\subsubsection{Blacklists and Whitelists}
+The Machine maintains a list of {\em Share}s allocated to it.  It is possible, 
after changing the
+{\em ducc.classes} configuration and starting RM, that it is no longer legal 
for these shares to be
+allocated on this machine, or perhaps to be allocated at all.  For example the 
machine may have been
+moved to a different class than the class of the work allocated on it, or the 
class may be been
+deleted entirely.
+
+   If this happens the shares are essentially in ``limbo''.  They cannot (in 
general) be associated
+   with any resource class and therefore cannot participate in allocations 
(recall, allocations are
+   done by resource class).  The space must nonetheless be accounted for to 
avoid double-booking the nodes.
+
+   To cope with this the RM considers both the {\em Shares}, and the {\em 
Machine} they reside on
+   to be ``blacklisted''.  When a machine is ``blacklisted'',
+   \begin{itemize}
+     \item All work that can be evicted from it is evicted.  This include any 
kind of UIMA-AS
+       job (including jobs submitted to non-preemptable classes),  and 
Services.
+     \item No additional allocations can be made to the machine until ALL 
blacklisted work
+       has been confirmed by the Orchestrator to have left the system.
+   \end{itemize}
+   
+    Once all blacklisted work on a machine has left the system, the machine is 
``white-listed'' and
+    allocations on it are resumed.
+
+\subsubsection{Database and Persistence}
+   When any machine arrives in the system, a new record is entered in the 
database containing its
+   essential data.
+
+   All state subsequent changes for the machine are entered into the database, 
including the number
+   of missed consecutive Agent heartbeats.
+
+   When a share is assigned to a machine, or leaves a machine, it is the 
responsibility of the Machine object to
+   record the share and its details in the database.
+
+\subsection{Share}
+    The Share object represents one full allocation.  Internally it is an {\em 
nshare} and thus
+    has share order, where the {\em share order} is the number of {\em 
qshares} it represents.  A
+    share is logically exposed outside of RM as a Process.
+
+    The Share's main purpose is bookkeeping; a place to store investment, 
initialization time and
+    to represent the space occupied by a resource allocation.
+
+\subsection{ResourceClass}
+
+    The ResourceClass represents the {\em class} concept as configured in {\em 
ducc.classes}.  It
+    holds the configured class rules (expand\_by\_doubling, 
initialization\_cap, etc).
+
+    It's primary purpose is bookkeeping; a place to organize jobs by class, 
jobs by user by class,
+    to maintain the set of users authorized for the class, etc.  It also 
tracks non-preemptable
+    share {\em allotment}.
+
+    The {\em ResourceClass} is a schedulable {\em IEntity}, as described above 
in the description
+    of the FAIR\_SHARE algorithm.
+
+\subsection{User}
+
+    The User represents a single user. Its primary purpose is bookkeeping; a 
place to organize
+    jobs owned by the user.
+
+    The User is a schedulable {\em IEntity}, as described above in the 
description of the 
+    FAIR\_SHARE algorithm.
+
+\subsection{JobManagerUpdate}
+
+    This is a ``transfer object'' used to transfer the current schedule to the 
publication
+    mechanism and ultimately to the Orchesrator.  It consist of maps of all 
shares, organized
+    by shares ``expanded'', and ``shrunken'' (preempted).  The RM's 
publication mechanism
+    translates this into the appropriate format which then gets published to 
the Orchestrator.
+
svn commit: r1727979 [2/3] - in /uima/sandbox/uima-ducc/trunk/uima-ducc-duccdocs/src/site/tex/duccbook: ./ images/ducc-internals/ part5/

Reply via email to