Author: challngr
Date: Wed Jun 10 19:04:25 2015
New Revision: 1684738

URL: http://svn.apache.org/r1684738
Log:
UIMA-4109 Updates for 2.0.0.

Modified:
    
uima/sandbox/uima-ducc/trunk/uima-ducc-duccdocs/src/site/tex/duccbook/part4/admin/admin-commands.tex
    
uima/sandbox/uima-ducc/trunk/uima-ducc-duccdocs/src/site/tex/duccbook/part4/admin/ducc-properties.tex
    
uima/sandbox/uima-ducc/trunk/uima-ducc-duccdocs/src/site/tex/duccbook/part4/system-logs.tex

Modified: 
uima/sandbox/uima-ducc/trunk/uima-ducc-duccdocs/src/site/tex/duccbook/part4/admin/admin-commands.tex
URL: 
http://svn.apache.org/viewvc/uima/sandbox/uima-ducc/trunk/uima-ducc-duccdocs/src/site/tex/duccbook/part4/admin/admin-commands.tex?rev=1684738&r1=1684737&r2=1684738&view=diff
==============================================================================
--- 
uima/sandbox/uima-ducc/trunk/uima-ducc-duccdocs/src/site/tex/duccbook/part4/admin/admin-commands.tex
 (original)
+++ 
uima/sandbox/uima-ducc/trunk/uima-ducc-duccdocs/src/site/tex/duccbook/part4/admin/admin-commands.tex
 Wed Jun 10 19:04:25 2015
@@ -436,7 +436,7 @@ Nodepool power
              
        
 \subsection{rm\_qoccupancy}
-\label{subsec:admin.rm-qoccupancey}
+\label{subsec:admin.rm-qoccupancy}
 
     \subsubsection{{\em Description:}}
     Rm\_qoccupancy provides a list of all known hosts to the RM, and for each 
host, the following information:

Modified: 
uima/sandbox/uima-ducc/trunk/uima-ducc-duccdocs/src/site/tex/duccbook/part4/admin/ducc-properties.tex
URL: 
http://svn.apache.org/viewvc/uima/sandbox/uima-ducc/trunk/uima-ducc-duccdocs/src/site/tex/duccbook/part4/admin/ducc-properties.tex?rev=1684738&r1=1684737&r2=1684738&view=diff
==============================================================================
--- 
uima/sandbox/uima-ducc/trunk/uima-ducc-duccdocs/src/site/tex/duccbook/part4/admin/ducc-properties.tex
 (original)
+++ 
uima/sandbox/uima-ducc/trunk/uima-ducc-duccdocs/src/site/tex/duccbook/part4/admin/ducc-properties.tex
 Wed Jun 10 19:04:25 2015
@@ -820,6 +820,8 @@
         \end{description} 
         
       \item[ducc.orchestrator.state.publish.rate] \hfill \\
+          \phantomsection\label{itm:props-or.state.publish.rate}
+
         The interval in milliseconds between Orchestrator publications of its 
non-abbreviated  
         state. 
         \begin{description}
@@ -1076,6 +1078,8 @@
           
 
         \item[ducc.rm.prediction.fudge] \hfill \\
+          \phantomsection\label{itm:props-rm.prediction.fudge}
+
           When ducc.rm.prediction is enabled, the known initialization time of 
a job's processes plus 
           some "fudge" factor is used to predict the number of future 
resources needed. The "fudge" 
           is specified in milliseconds. 
@@ -1093,6 +1097,8 @@
                     
 
         \item[ducc.rm.defragmentation.threshold] \hfill \\
+          \phantomsection\label{itm:props-rm.defragmentation.threshold}
+
           If {\em ducc.rm.defragmentation} is enable, limited defragmentation 
of resources is
           performed by the Resource Manager to create sufficient space to 
schedule work 
           that has insufficient resources (new jobs, for example.).  The term

Modified: 
uima/sandbox/uima-ducc/trunk/uima-ducc-duccdocs/src/site/tex/duccbook/part4/system-logs.tex
URL: 
http://svn.apache.org/viewvc/uima/sandbox/uima-ducc/trunk/uima-ducc-duccdocs/src/site/tex/duccbook/part4/system-logs.tex?rev=1684738&r1=1684737&r2=1684738&view=diff
==============================================================================
--- 
uima/sandbox/uima-ducc/trunk/uima-ducc-duccdocs/src/site/tex/duccbook/part4/system-logs.tex
 (original)
+++ 
uima/sandbox/uima-ducc/trunk/uima-ducc-duccdocs/src/site/tex/duccbook/part4/system-logs.tex
 Wed Jun 10 19:04:25 2015
@@ -20,19 +20,24 @@
  \section{Overview}
 
     This chapter provides an overview of the DUCC process logs and how to 
interpret the
-    entries therin.
+    entries therein.
 
     Each of the DUCC ``head node'' processes writes a detailed log of its 
operation to
     the directory \ducchome/logs.  The logs are managed by Apache log4j.  All 
logs are
     managed by a single log4j configuration file
 \begin{verbatim}
-        DUCC_HOME/resources/log4j.xml
+        $DUCC_HOME/resources/log4j.xml
 \end{verbatim}
 
     The DUCC logger is configured to check for updates to the log4j.xml
     configuration file and automatically update without the need to restart 
any of
     the DUCC processes.  The update may take up to 60 seconds to take effect.
 
+    The DUCC logger is loaded and configured through the log4j API such that 
other
+    log4j configuration files that might be in the classpath are ignored.  
This also
+    means that log4j configuration files in the user's classpath will not 
interfere
+    with DUCC's logger.
+
     The logs are set to roll after some reaching a given size and the number 
of generations
     is limited to prevent overrunning disk space.  In general the log level is 
set to
     provide sufficient diagnostic output to resolve most issues.
@@ -52,6 +57,8 @@
       \hline
           Process Manager & pm.log \\
       \hline
+          Web Server & ws.log \\
+      \hline
           Agent & {\em [hostname].agent.log } \\
       \hline
     \end{tabular}
@@ -59,9 +66,40 @@
     Because there may be many agents, the agent log is prefixed with the name 
of the host for
     each running agent.
 
-\section{Common Log Format}
-     
-    Timestamp   LOGLEVEL  COMPONENT.sourceFileName method-name Jobid-or-NA text
+    The log4j file may be customized for each installation to change the 
format or content of the
+    log files, according to the rules defined by log4j itself.  This section 
defines the 
+    default configuration.
+
+    The general format of a log message is as follows:
+\begin{verbatim}
+    Timestamp   LOGLEVEL  COMPONENT.sourceFileName method-name J[Jobid-or-NA] 
T[TID] text
+\end{verbatim}
+    where
+    \begin{description}
+      \item[Timestamp] is the time of the occurrence.  By default, the 
timestamp uses millisecond granularity.
+      \item[LOGLEVEL] This is one of the log4j debug levels, INFO, ERROR, 
DEBUG, etc.
+      \item[COMPONENT] This identifies the DUCC component emitting the 
message.  The components include
+        \begin{description}
+          \item[SM] Service Manager
+          \item[RM] Resource Manager
+          \item[PM] Process Manager
+          \item[OR] Orchestrator
+          \item[WS] Web Server
+          \item[Agent] Agent            
+          \item[JD] Job Driver
+          \item[JobProcessComponent] Job process, also known as JP
+        \end{description}
+      \item[sourceFileName] This is the name of the Java source file from 
which the message is emitted.
+      \item[method-name] This is the name of the method in {\em 
sourceFileName} which emitted the message.
+      \item[{J[Workid-or-NA]}] This is the DUCC assigned id of the work being 
processed, when relevant.  If the
+        message is not associated with work, this field shows ``N/A''.  Some 
logs (such as JP and JD logs)
+        pertain ONLY to a specific job and do not contain this field.
+      \item[{T[TID]}] This is the ID of the thread emitting the message.  Some 
logs (such as RM) do not use
+        this field so it is omitted.
+      \item[text] This is the component-specific text content of the message.  
Key messages are described
+        in detail in subsequent sections.
+
+    \end{description}
 
 \section{Resource Manager Log (rm.log)}
 
@@ -70,7 +108,7 @@
     more detail below:
     \begin{itemize}
       \item Bootstrap configuration
-      \item Node arrival and missesd heartbeats
+      \item Node arrival and missed heartbeats
       \item Node occupancy
       \item Job arrival and status updates
       \item Calculation of job caps
@@ -84,7 +122,7 @@
 \subsection{Bootstrap Configuration}
    The RM summarizes its entire configuration when it starts up and prints it 
to the log to
    provide context for subsequent data and as verification that the RM is 
configured in the
-   way it was though to be.  All the following are fond in the bootstrap 
section and are mostly
+   way it was thought to be.  All the following are found in the bootstrap 
section and are mostly
    self-explanatory:
 
    \begin{itemize}
@@ -93,13 +131,68 @@
      \item A summary of all classes, one per line.  This is a more concise 
display and is similar to the
        DUCC Classes page in the web server.
      \item A listing of all RM configuration parameters and the environment 
including things such as the
-       version of JAVA, the operating system, etc.
+       version of Java, the operating system, etc.
      \item Nodepool occupancy.  As host names are parsed from the {\em 
ducc.nodes} files, the RM log
        shows exactly which nodepool each node is added to.
    \end{itemize}
    
    The RM logs can wrap quickly under high load in which case this information 
is lost.
 
+   The following represent key RM logs lines to search for if it is desired to 
examine or verify its
+   initialization.  (Part of the leaders on these messages are removed here to 
shorten the
+   lines for publication.)
+
+    \paragraph{Initial RM start}
+    The first logged line of any RM start will contain the string {\em 
Starting component:  resourceManager}:
+\begin{verbatim}
+RM.ResourceManagerComponent- N/A boot  ... Starting Component:  resourceManager
+\end{verbatim}
+
+    \paragraph{RM Node and Class Configuration}
+    The first configuration lines show the reading and validation of the node 
and class configuration.  Look
+    for the string {\em printNodepool} to find these lines:
+\begin{verbatim}
+RM.Config- N/A printNodepool     Nodepool --default--
+RM.Config- N/A printNodepool        Search Order: 100
+RM.Config- N/A printNodepool        Node File: None
+RM.Config- N/A printNodepool                   <None>
+RM.Config- N/A printNodepool        Classes: background low normal high 
normal-all nightly-test reserve
+RM.Config- N/A printNodepool        Subpools: jobdriver power intel
+         ...
+\end{verbatim}
+
+   \paragraph{RM Scheduling Configuration}
+   Next the RM reads configures its scheduling parameters and emits the 
information.  It also emits information
+   about its environment: the ActiveMQ broker, JVM information, OS 
information, DUCC version, etc.  To fine
+   this search for the string {\em init  Scheduler}.
+\begin{verbatim}
+ init  Scheduler running with share quantum           :  15  GB
+ init                         reserved DRAM           :  0  GB
+ init                         DRAM override           :  0  GB
+ init                         scheduler               :  
org.apache.uima.ducc.rm.scheduler.NodepoolScheduler
+
+ init                         DUCC home               :  
/home/challngr/ducc_runtime
+ init                         ActiveMQ URL            :  
tcp://bluej537:61617?jms.useCompression=true
+ init                         JVM                     :  Oracle Corporation 
1.7.0_45
+ init                         JAVA_HOME               :  
/users1/challngr/jdk1.7.0_45/jre
+ init                         JVM Path                :  
/users/challngr/jdk1.7.0_45/bin/java
+ init                         JMX URL                 :  
service:jmx:rmi:///jndi/rmi://bluej537:2099/jmxrmi
+ init                         OS Architecture         :  amd64
+ init                         OS Name                 :  Linux
+ init                         DUCC Version            :  2.0.0-beta
+ init                         RM Version              :  2.0.0
+\end{verbatim}
+
+   \paragraph{RM Begins to Schedule}
+   The next lines will show the nodes checking in and which nodepools they are 
assigned to.  When the scheduler is
+   ready to accept Orchestrator requests you will see assignment of the 
JobDriver reservation.  At this point
+   RM is fully operational.  The confirmation of JobDriver assignment is 
similar to this:
+\begin{verbatim}
+Reserved:
+         ID    JobName       User      Class Shares Order QShares NTh Memory 
nQuest Ques Rem InitWait Max P/Nst
+R______7434 Job_Driver     System  JobDriver      1     1       1   0      2   
   0        0        0         1
+\end{verbatim}
+
 \subsection{Node Arrival and Missed Heartbeats}
 \subsubsection{Node Arrival}
     As each node ``checks in'' with the RM a line is printed with details 
about the node.  Some fields
@@ -108,10 +201,9 @@
 
     A node arrival entry is of the form:
 \begin{verbatim}
-    [LOGHEADER] Nodepool: power Host added: power :  bluej290-18   shares  3 
total    9:          
-                              bluej290-18     3             0             3  
48128 <none>
+LOGHEADER Nodepool: power Host added: power :  bluej290-18   shares  3 total   
 9:   48128 <none>
 \end{verbatim}
-    where the fields mean (f the field isn't described here, the value is not 
relevent to node arrival):
+    where the fields mean (f the field isn't described here, the value is not 
relevant to node arrival):
     \begin{description}
       \item[LOGHEADER] is the log entry header as described above.
       \item[Nodepool:power] The node is added to the ``power'' nodepool
@@ -122,9 +214,9 @@
     \end{description}
 
 \subsubsection{Missed Heartbeats}
-    The DUCC Agents send out regular ``heartbeat'' meessages with current node 
statistics. These
+    The DUCC Agents send out regular ``heartbeat'' messages with current node 
statistics. These
     messages are used by RM to determine if a node has failed.  If a heartbeat 
does not arrive
-    at the specified time this is noted in the log as a {\em missing} 
heartbeat. If a specific (configurable) number
+    at the specified time this is noted in the log as a {\em missing 
heartbeat}. If a specific (configurable) number
     of consecutive heartbeats is missed, the RM marks the node offline and 
instructs the
     DUCC Orchestrator to purge the shares so they can be rescheduled.
 
@@ -159,7 +251,13 @@
     [LOGHEADER] Machine occupancy before schedule
 \end{verbatim}
 
-    Sample node occupancy follows.  The header is included in the log.
+    NOTE: The current node occupancy can be queried interactively with the 
+    \hyperref[subsec:admin.rm-qoccupancy]{rm\_occupancy} command:
+\begin{verbatim}
+    DUCC_HOME/admin/rm_qoccupancy
+\end{verbatim}
+
+    Sample node occupancy as displayed in the log follows.  The header is 
included in the log.
 \begin{verbatim}
            Name Order Active Shares Unused Shares Memory (MB) Jobs
 --------------- ----- ------------- ------------- ----------- ------ ...
@@ -182,30 +280,428 @@ f6n10.bluej.net    16             3
     The meaning of each column is:
     \begin{description}
       \item[Name] The host name.
-      \item[Order] This is the share order of the node.  The number represents 
the number of shares
+      \item[Order] This is the share order of the node.  The number represents 
the number of quantum shares
         that can be scheduled on this node. (Recall that an actual process may 
and usually does
-        occupy multiple shares.)
-      \item[Active Shares] This is the number of the shares on the node which 
are scheduled
+        occupy multiple quantum shares.)
+      \item[Active Shares] This is the number of quantum shares on the node 
which are scheduled
         for work.
-      \item[Unused Shares] This is the number of shares available for new work.
+      \item[Unused Shares] This is the number of quantum shares available for 
new work.
       \item[Memory] This is the real memory capacity of the node, as reported 
by the node's
         Agent process.
       \item[Jobs] Each entry here is the DUCC-assigned id of a job with 
process assigned to
         this node.  Each entry corresponds to one process.  If an ID appears 
more than 
         once the job has more than one process assigned to the node; see for 
example, the
-        node {\bf f6n10.bluej.net} with multiple entries for job {\em 206693}.
+        node {\bf f1n3.bluej.net} with multiple entries for job {\em 206693}.
 
         When no work is assigned to the node, the string {\bf $<$none$>$} is 
displayed.  
         
         When there is a number in brackets, e.g. {\bf [13]} for node {\bf 
f7n1.bluej.net}, the
-        number represents the number of shares available to be shcheduled on 
the node.
+        number represents the number of quantum shares available to be 
scheduled on the node.
     \end{description}
-    
-    The current node occupancy can be queried interactively with the command:
+
+\subsection{Job Arrival and Status Updates}
+ 
+   \paragraph{Orchestrator State Arrival}
+
+     On a regular basis the Orchestrator publishes the full state of
+     work which may require resources.  This is the prime input to the
+     RM's scheduler and must arrive on a regular basis.  Every arrival
+     of an Orchestrator publication is flagged in the log as follows.
+     If these aren't observed every
+     \hyperref[itm:props-or.state.publish.rate]{Orchestrator publish
+       interval} something is wrong; most likely the Orchestrator or
+     the ActiveMQ broker has a problem.
+
 \begin{verbatim}
-    DUCC_HOME/admin/rm_qoccupancy --console
+RM.ResourceManagerComponent- N/A onJobManagerStateUpdate  -------> OR state 
arrives
 \end{verbatim}
-    
+        
+    \paragraph{Job State}
+    Immediately after the OR state arrival is logged the state of all work 
needing scheduling
+    is logged.  These are always tracked by the {\em JobManagerConverter} 
module in the
+    RM and is logged similar to the following.  It shows the state of each bit 
of work
+    of interest, and if that state has changed since the last publication, 
what that state is.
+\begin{verbatim}
+   ...
+RM.JobManagerConverter- 7433 eventArrives  Received non-schedulable job, state 
=  Completed
+RM.JobManagerConverter- 7434 eventArrives  [SPR] State:  WaitingForResources 
-> Assigned
+   ...
+\end{verbatim}
+
+\subsection{Calculation Of Job Caps}
+   Prior to every schedule, and immediately after receipt of the Orchestrator 
state,
+   the RM examines every piece of work and calculates the maximum level of 
resources the
+   job can physically use at the moment.  This handles the {\em 
expand-by-doubling} 
+   function, the {\em prediction} function, and accounts for the amount of 
work left
+   relative to the resources the work already possesses.
+
+   The curious or intrepid can see the code that implements this in {\em 
RmJob.java} method 
+   {\em initJobCap()}.
+
+   The calculation is done in two steps:
+   \begin{enumerate}
+     \item Calculate the projected cap.  This uses the prediction logic and 
amount of
+       work remaining to
+       calculate the {\em largest} number of resources the job can use, if the 
system
+       had unlimited resources.  This is an upper bound on the actual 
resources 
+       assigned to the job.
+     \item Adjust the cap down using expand-by-doubling and the initialization 
state of
+       the work.  The result of this step is always a {\em smaller or equal} 
number
+       as the projected cap.
+   \end{enumerate}
+   The goal of this step is to calculate the largest number of resource the 
job can
+   actually use at the moment.  The FAIR\_SHARE calculations may further 
revise this
+   down, but will never revise it up.
+
+   If there is no data yet on the initialization state of work, the rejected 
cap cannot
+   be calculated and a line such as the following is emitted:
+\begin{verbatim}
+RM.RmJob - 7483 getPrjCap  Hilaria Cannot predict cap: init_wait true || 
time_per_item 0.0
+\end{verbatim}
+
+   If the job has completed initialization the projected cap is calculated 
based on 
+   the average initialization time of all the job processes and the current 
rate of
+   work-item completion.  A line such as this is emitted:
+\begin{verbatim}
+RM.RmJob- 7483 Hilaria O 2 T 58626 NTh 28 TI 18626 TR 12469.0 R 2.2456e-03 QR 
1868 P 132 F 1736 \
+       ST 1433260775524 return 434
+\end{verbatim}
+   In this particular line:
+      \begin{description}
+        \item[7483] is the job id
+        \item[Hilaria] is the job's owner (userid)
+        \item[O 2] this says this is an {\em order 2} job: each process will 
occupy two quantum shares.
+        \item[T 58626] is the smallest number of milliseconds until a new 
process for this job
+          can be made runnable, based on the average initialization time for 
processes in
+          this job, the Orchestrator publish rate, and the
+          \hyperref[itm:props-rm.prediction.fudge{\em RM prediction fudge.}]
+        \item[Nth] This is the number of threads currently executing for this 
job.  It is 
+          calculated as the (number of currently allocated processes) * (the 
number of threads
+          per process).
+        \item[TI] This is the average initialization time in milliseconds for 
processes in this job.
+        \item[TR] This is the average execution time in milliseconds for work 
items in this job.
+        \item[R] This is the current rate at which the job is completing work 
items, 
+          calculated as (Nth / TR).
+        \item[QR] The is the number of work items (questions) remaining to be 
executed.
+        \item[P] This is the projected number of questions that can be 
completed
+          in the time from ``now'' until a new process can be started and 
initialized
+          (in this case 58626 milliseconds from now, see above), with the 
currently
+          allocated resources, calculated as (T * R).
+        \item[F] This is the number of questions that will remain unanswered 
at the
+          end of the target (T) period, calculated as (QR - P).
+        \item[ST] This is the time the job was submitted.
+        \item[return] This is the projected cap, the largest number of 
processes this
+          job can physically use, calculated as (F / threads-per-process).
+          
+          If the returned projected cap is 0, it is adjusted up to the number 
of
+          processes currently allocated.
+      \end{description}
+          
+   Once the projected cap is calculated a final check is made to avoid several 
problems:
+   \begin{itemize}
+     \item Preemption of processes that contain active work but are not using 
all their
+       threads.  This occurs when a job is ``winding down'' and may have more
+       processes than it technically needs, but all processes still are 
performing work.
+     \item The job may have declared a maximum number of processes to 
allocate, which is
+       less than the number it could otherwise be awarded.
+     \item If prediction is being used, revise the estimate down to the smaller
+       of the projected cap and the resources currently allocated.
+     \item If initialization caps are being applied and no process in the job 
has
+       successfully initialized, revise the estimate down to the 
initialization cap.
+     \item If expand-by-doubling is being used, potentially revise the 
estimate down
+       to no more than double the currently allocated processes.
+   \end{itemize}
+
+   The final cap is emitted in a line such as:
+\begin{verbatim}
+RM.RmJob- 7483 initJobCap Hilaria O 2 Base cap: 7 Expected future cap: 434 
potential cap 7 actual cap 7
+\end{verbatim}
+    In this line:
+    \begin{description}
+      \item[7483] is the job id.
+      \item[Hilaria] is the job's user name.
+      \item[O 2] indicates this job uses two quantum shares per processes.
+      \item[Base cap:] This is an upper-bound on the number of processes
+        that can be used in a perfect world.  It is calculated by 
+        dividing the number of questions by the number of threads per
+        process.  It is then revised down by the declared max-processes
+        in the job.  In the example above, the job declared
+        max-processes of 7.
+      \item[Expected future cap] This is the projected cap, described above.
+      \item[Potential cap] This is the base cap, possibly revised downward
+        by the future cap, if it is projected that fewer processes are
+        would be useful.
+      \item[Actual cap] This is the assigned maximum processes to be
+        scheduled for this job, possibly adjusted based on the
+        initialization status of the job and the expand-by-doubling policy.    
    
+    \end{description}
+
+    The {\em actual cap} is the one used to calculate the job's FAIR\_SHARE and
+    is always the the largest number of processes usable in a perfect world.  
Note
+    that the FAIR\_SHARE calculation may result in further reduction of this
+    number.
+
+\subsection{The ``how much'' calculations}
+   The RM log includes a section that details the fair-share calculations.  
The details
+   of this are rather involved and out-of-scope for this section.  Interested 
parties
+   are welcome to read the scheduler source, in the file {\em 
NodePoolScheduler.java},
+   methods {\em countClassShares, countJobShares, countUserShares}, and {\em 
apportion\_qshares}.
+
+   The logs reveal the inputs to each of the methods above.  The overall logic 
is as follows
+   and can be followed in the logs.
+
+   \begin{itemize}
+     \item All job classes of equal priority are bundled together and handed 
to the
+       {\em countClassShares} method.  This method assigns some number of 
shares
+       to each class based on the weighted fair-share logic, using the 
configured
+       class weights.  The start of this can be seen under log lines similar 
to this:
+\begin{verbatim}
+INFO RM.NodepoolScheduler- N/A apportion_qshares  countClassShares RmCounter 
Start
+\end{verbatim}
+
+     \item All users for each class are passed to the {\em countUserShares} 
method
+       and then assigned some number of shares from the
+       pool of shares assigned to the class, again
+       using the fair-share computations, but with equal weighs.
+       The start of this can be seen under log lines similar to this:
+\begin{verbatim}
+INFO RM.NodepoolScheduler- N/A apportion_qshares  countJobShares RmCounter 
Start
+\end{verbatim}
+
+     \item All jobs for each user are passed to the {\em countJobShares} method
+       and assigned some number of shares from the pool assigned to the user, 
using
+       the fair-share calculator with equal weights.
+       The start of this can be seen under log lines similar to this:
+\begin{verbatim}
+INFO RM.NodepoolScheduler- N/A apportion_qshares  countUserShares RmCounter 
Start
+\end{verbatim}
+
+     \item The method {\em apportion\_qshares} is the common fair-share
+       calculator, used by the three routines above.
+   \end{itemize}
+
+\subsection{The ``what of'' calculations}
+    These calculations are also too involved to discuss in detail for this 
section.
+
+    Interested parties may look in {\em NodePoolScheduler.java}, method
+    {\em whatOfFairShare}, and {\em NodePool.java} method {\em 
traverseNodepoolsForExpansion}
+    to see details.
+
+    The logs track the general flow through the methods above and generally 
contain
+    enough information to diagnose problems should they arise.  
+
+    The key log message here, other than those sketching logic flow, shows the
+    assignment of specific processes to jobs as seen below.
+
+\begin{verbatim}
+RM.NodePool- 7483 connectShare  share bluej290-12.461 order 2 machine \
+                  bluej290-12        false     2             0             2   
    31744 <none>
+\end{verbatim}
+    This shows job {\em 7483} being assigned a process on host {\em 
bluej290-12} as
+    RM share id {\em 461}, which consists of {\em 2 quantum shares} (order 2). 
 Host
+    bluej290-12 is a 32GB machine with {\em 31744} KB of usable, schedulable 
memory.
+
+\subsection{Defragmentation}
+    The RM considers the system's memory pool to be fragmented if the counted
+    resources from the the ``how much'' phase of scheduling cannot be fully
+    mapped to real physical resources in the ``what of'' phase.  In short, the
+    ``how much'' phase assumes an ideal, unfragmented virtual cluster.  The 
``what of''
+    phase may be unable to make the necessary physical assignments without 
excessive
+    preemption of jobs that are legitimately at or below their fair share 
allocations.
+
+    Intuitively, the ``how much'' phase guarantees that if you could do 
unlimited
+    shuffling around of the allocated resources, everything would ``fit''.  The
+    system is considered fragmented if such shuffling is actually needed.  The
+    defragmentation processes attempts that shuffling, under the constraint of
+    interrupting the smallest amount of productive work possible.
+
+    One scheduling goal, however, is to attempt to guarantee every job gets
+    at least some minimal number of it's fairly-counted processes.  This 
minimal number
+    is called the 
+    \hyperref[itm:props-rm.defragmentation.threshold]{defragmentation 
threshold.} and
+    is configured in ducc.properties.  This threshold is used to rigorously 
define
+    ``smallest amount of productive work'' as used in the previous paragraph.
+    The defragmentation threshold is used in
+    two ways:
+
+    \begin{enumerate}
+      \item Attempt to get every work request resources allocated at least up
+        to the level of the defragmentation threshold.
+      \item Never steal resources beyond the defragmentation threshold during
+        the ``take from the rich'' phase of defragmentation, described below.
+    \end{enumerate}
+    To accomplish this, a final stage, ``defragmentation'', is
+    performed before publishing the new schedule to the Orchestrator
+    for deployment.
+
+    Defragmentation consists of several steps.  The details are again involved,
+    but an understanding of the logic will make following the log relatively
+    straightforward.
+    \begin{itemize}
+      \item Examine every job and determine whether it was assigned
+        all the processes from the ``how much'' phase.  If not, it is marked
+        as POTENTIALLY NEEDY.
+        
+        This step is logged with the tag {\em detectFragmentation}.
+
+      \item Examine every POTENTIALLY NEEDY job to determine if there are
+        sufficient preemptions pending such that the ``how much'' phase will 
be able
+        to complete as soon as the preemptions complete.  If not, the
+        job is marked ACTUALLY NEEDY.
+
+        This step is also logged with the tag {\em detectFragmentation}.
+
+      \item For every job marked ACTUALLY NEEDY, examine all jobs in the
+        system already assigned shares to determine which ones can
+        donate some resources to the ACTUALLY NEEDY jobs.  These are typically
+        jobs with more processes than their FAIR SHARE, but which, in a
+        perfect, unfragmented layout, would be allocated more resources.  These
+        jobs are called {\em rich} jobs.
+        
+        This step is logged with the tags {\em insureFullEviction} and
+        {\em doFinalEvictions}.
+
+      \item Attempt to match allocations from ``rich'' jobs with jobs that
+        are ACTUALLY NEEDY.  If the ACTUALLY NEEDY job is able to use
+        one of the ``rich job'' allocations, the allocation is scheduled for
+        preemption.  (Note there are many reasons that a rich job may not
+        have appropriate resources to donate: mismatched nodepool, physical
+        host too small, not preemptable, etc.).
+        
+        This step is logged with the tag {\em takeFromTheRich}. If this
+        step has any successes, the log will also show lines with the
+        tags {\em clearShare} and {\em shrinkByOne} as the resources
+        are scheduled for reuse.
+
+        \item The needy job is placed in a list of jobs which are given the
+          highest priority for assignment of new processes, at the start of 
each
+          subsequent scheduling cycle, until such time
+          as they are no longer needy.
+
+          This step is logged with the tag {\em Expand needy}.
+    \end{itemize}
+
+    Those who wish to see the details of defragmentation can find them in
+    {\em NodepoolScheduler.java}, starting with the method {\em 
detectFragmentation}
+    and tracing the flows from there.
+
+\subsection{Published Schedule}
+
+   The schedule gets printed to the log twice on every scheduling cycle.  The 
first
+   form is a pretty-printed summary of all known jobs, showing which ones are
+   getting more resources, {\em expanding}, those which are losing resources, 
+   {\em shrinking}, and those which are not changing, {\em stable.}
+
+   The second form is a {\em toString()} of the structure sent to the 
Orchestrator,
+   showing the exact resources currently assigned, added, or lost this cycle.
+
+   \paragraph{The pretty-printed schedule}
+      This entry is divided into five sections.  Each section contains one 
line for
+      each relevant job, with largely self-explanatory headers. An example 
follows (
+      wrapped here so it fits within a printed page):
+\begin{verbatim}
+         ID                        JobName       User      Class   Shares 
Order QShares NTh Memory  \
+J______7485 mega-15-min/jobs/mega-2.job[DD   Tanaquil nightly-test      7     
2      14   4     24  \
+J______7486 mega-15-min/jobs/mega-3.job[DD    Rodrigo normal-all       93     
2     186   4     28  \
+
+              nQuest Ques Rem InitWait Max P/Nst
+               11510    11495    false         7
+               14768    14764    false        93
+
+\end{verbatim}
+     Here,
+
+     \begin{description}
+       \item[ID] is the unique DUCC ID of the work, prefixed with an 
indication of what kind of
+         work it is: Job (J), a Service (S), a Reservation (R), or Managed 
Reservation (M).
+       \item[JobName] is the user-supplied name / description of the job.
+       \item[User] is the owner of the work.
+       \item[Class] is the scheduling class used to schedule the work.
+       \item[Shares] is the number of allocations awarded, which might be 
processes, or simply reserved space.  It
+         is a human-readable convenience, calculated as (Order * QShares).
+       \item[Order] is the number of share quanta per allocation.
+       \item[QShares] is the total quantum shares awarded to the work.
+       \item[Nth] is the declared number of threads per process.
+       \item[Memory] is the amount of memory in GB for each allocation.
+       \item[nQuest] is the number of work items (questions) for the job, 
where relevant.
+       \item[Ques Rem] is the number of work items not yet completed.
+       \item[InitWait] is either {\em true} or {\em false}, indicating whether 
at least one process
+         has successfully completed initialization.
+       \item[Max P/Nst] is the job-declared maximum processes / instances for 
the job.
+     \end{description}
+
+     The five subsections of this log section are:
+     \begin{description}
+       \item[Expanded] This is the list of all work that is receiving more 
resources this cycle.
+       \item[Shrunken] This is the list of work that is losing resources this 
cycle.
+       \item[Stable] This is the list of work whose assigned resources do not 
change this cycle.
+       \item[Dormant] This is the list of work that is unable to receive any 
resources this cycle.
+       \item[Reserved] This is the list of reservations.
+     \end{description}
+
+   \paragraph{The Orchestrator Structure}
+      This is a list containing up to four lines per scheduled work.  
+      
+      The specific resources shown here are formatted thus:
+\begin{verbatim}
+    hostname.RM share id^Initialization time
+\end{verbatim}
+      The {\em hostname} is the name of the host where the resource is 
assigned.  The {\em RM Share}
+      is the unique (to RM only) id of the share assigned to this resource.  
The {\em Initialization time}
+      is the amount of time spent by the process residing within this resource 
in its initialization phase.
+      
+      The lines are:
+      \begin{enumerate}
+        \item The type of work and it's DUCC ID, for example: 
+\begin{verbatim}
+ Reservation 7438
+\end{verbatim}
+        \item The complete set of all resources currently assigned to the 
work, for example:
+\begin{verbatim}
+Existing[1]: bluej537-7-73.1^0
+\end{verbatim}
+          The resources here include all resources the RM tracks as being 
owned by the job, including
+          older resources, newly assigned resources, and resources scheduled 
for eviction.  The specific
+          resources which are being added or removed are shown in the next 
lines.
+          
+        \item The complete set of resources the RM has scheduled for eviction, 
but which are not
+          yet confirmed freed.  For example, we see 7 resources which have 
been evicted:
+\begin{verbatim}
+ Removals[7]: bluej290-11.465^19430 bluej290-12.461^11802 bluej290-4.460^12672 
bluej290-5.464^23004 
+              bluej290-2.467^22909 bluej290-7.463^20636 bluej290-6.466^19931 
+\end{verbatim}
+
+        \item The complete set of resources which are being added to the work 
in this cycle.  For
+          example:
+
+\begin{verbatim}
+ Additions[4]: bluej291-43.560^0 bluej291-42.543^0 bluej290-23.544^0 
bluej291-44.559^0 
+\end{verbatim}
+      \end{enumerate}
+         
+      In most cases, if resources cannot be awarded, this section also shows 
the reason 
+      string which is published for the benefit of the web server and the 
Orchestrator's job monitor:
+\begin{verbatim}
+ Job         7487 Waiting for defragmentation.
+       Existing[0]: 
+       Additions[0]: 
+       Removals[0]: 
+\end{verbatim}
+
+     In some cases, it is possible that a job will show BOTH Additions and 
Removals.  This usually
+     occurs as a result of the defragmentation step.  The job will have been 
found in need of
+     new resources during the initial fair-share computation but later during 
fragmentation,
+     it is also found to be a ``rich'' job which must donate resources to 
under-allocated work.
+     Not all the processes belonging to the ``rich'' job may be appropriate 
for the poor job,
+     in which case they will be allowed to expand even as it is donating some 
to the
+     under-allocated work.
+
+     This can also occur if resources were previously preempted, for some 
reason the
+     preemption is taking a long time.  Since then other resources have become 
freed and
+     the can now re-expand.  It is not possible to reverse a preemption 
(because the actual
+     state of the preemption is not knowable) so both expansion and shrinkage 
can be
+     in progress for the same job.
+
 \section{Service  Manager Log (sm.log)}
     To be filled in.
 


Reply via email to