[Pig Wiki] Update of "PigExecutionModel" by AlanGates

Apache Wiki Mon, 11 Feb 2008 10:35:23 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by AlanGates:
http://wiki.apache.org/pig/PigExecutionModel

------------------------------------------------------------------------------
     * Stream Programming Model / MIT Stream-It
        * official page for stream-it: http://www.cag.csail.mit.edu/streamit/ 
(Articles on the compiler might be useful)
  
- == Physical Plan Structure ==
+ == Logical and Physical Plans ==
  
   1. A criterion we are adopting in the redesign of the logical and physical 
layer in Pig is to promote what used to be EvalSpecâs and Condâs to 
operators.
   1. Such approach provides: 1.) a clearer definition of the language; 2.) 
better identification of possibility for optimizations of various form
@@ -154, +154 @@

   1. These are the following exceptions:
      1. Logical Cogroup to be translated into a Physical LocalRearrange and 
GlobalRearrange.
      2. Chris mentioned that even Algebraic Functions are exceptions to this.
+ 
+ === Logical Plan ===
+ 
+ The logical plan will consist of a directed acyclic graph (DAG) of with 
logical
+ operators as the nodes, and data flow between the operators as the edges.
+ 
+ The focus of the logical operators is post parse stage checking (such as type
+ checking), optimization, and translation to a physical plan.  The logical
+ operators will contain information that facilitates these objectives.  
+ 
+ A list of the classes used to model the logical plan follows, with a
+ description of the classes, and the interfaces defined for major classes, or
+ in some cases one interface defined to give an example of a set of classes.
+ 
+ ==== LogicalPlan ====
+ 
+ The class !LogicalPlan will contain the collection of logical operators.  It
+ will not contain the edges between the operators.  To see why, consider the 
following simple pig script:
+ 
+ {{{
+ a = load 'myfile';
+ b = filter a by $0 > 5;
+ store b into 'myfilteredfile';
+ }}}
+ 
+ This will generate a logical plan that looks something like this:
+ 
+ attachment:SimpleLP.jpg
+ 
+ Notice that the graph edges represent data flow between relational operators
+ (LOAD, FILTER, STORE, PROJECT) and between expression operators (greater than,
+ CONSTANT), and between the two types.  This means that edges in the graph have
+ meaning beyond simply input and output for a node.  For example, the filter
+ node has two inputs, the load node and its condition (in this case, the
+ greater than node).  These inputs however have different semantics, as tuples
+ coming from the load input are evaluated based on the boolean result coming
+ from the conditional input and then possibly passed on to the store node.  
Given the differing semantics of different
+ inputs, it seems better to encode the edges of the graph in the logical
+ operators themselves, rather than in a generic graph contain object.
+ 
+ In addition to containing the collection of logical operators, !LogicalPlan
+ will provide methods for callers to insert logical operators into the graph,
+ and connect the inputs and outputs of operators.  These connections will only
+ be for data flow inputs and outputs, not contextual inputs such as the
+ condition on a filter node.  This strains the model described above somewhat,
+ but it allows for generic manipulation of inputs and outpus of the operators
+ without every visitor to the tree needing to understand all the different
+ operator types.
+ 
+ The interface for !LogicalPlan is:
+ {{{
+ public class LogicalPlan {
+       private static final long serialVersionUID = 2L;
+ 
+       protected PigContext mContext = null;
+ 
+     protected Map<LogicalOperator, OperatorKey> mOps;
+     protected Map<OperatorKey, LogicalOperator> mKeys;
+ 
+     private List<LogicalOperator> mRoots;
+ 
+       
+       public LogicalPlan(PigContext pigContext) {
+               ...
+       }
+ 
+       public LogicalOperator getRoots() {
+               ...
+     }
+ 
+       public PigContext getPigContext() {
+               return mContext;
+       }
+ 
+       public byte getOutputType(){
+               return root.getOutputType();
+       }
+ 
+     /**
+      * Given an operator, find its OperatorKey.
+      * @param op Logical operator.
+      * @return associated OperatorKey
+      */
+     public OperatorKey getOperatorKey(LogicalOperator op) {
+         return mOps.get(op);
+     }
+ 
+     /**
+      * Given an OperatorKey, find the associated operator.
+      * @param opKey OperatorKey
+      * @return associated operator.
+      */
+     public LogicalOperator getOperator(OperatorKey opKey) {
+         return mKeys.get(opKey);
+     }
+ 
+     /**
+      * Insert an operator into the plan.  This only inserts it as a node in
+      * the graph, it does not connect it to any other operators.  That should
+      * be done as a separate step using makeSuccessor or addSuccessor.
+      * @param op Logical Operator to add to the plan.
+      */
+     public void add(LogicalOperator op) {
+               ...
+     }
+ 
+     /**
+      * Make one operator the <b>sole</b> input of another.  If that operator
+      * already has an input, that operator will become the passed in
+      * operator's input.  So, for example, if the plan current contains
+      * three nodes:  a, b, c.  And a is currently c's input, and this
+      * function is called makeInput(b, c), then a will become b's input
+      * and b will become c's input.
+      * @param op Operator to make input of another operator.
+      * @param inputOf Operator to make op an input of.
+      * @throws IOException if op or inputOf are not in the plan.
+      */
+     public void makeInput(LogicalOperator op,
+                           LogicalOperator inputOf) throws IOException {
+         ...
+     }
+ 
+     /**
+      * Make one operator an <b>additional</b> input of another.  This can only
+      * legally be called on operators that can have multiple inputs, such as
+      * Cogroup, Generate, or BinaryExpression.
+      * @param op Operator to make input of another operator.
+      * @param inputOf Operator to make op an input of.
+      * @throws IOException if op or inputOf are not in the plan.
+      */
+     public void addInput(LogicalOperator op,
+                          LogicalOperator inputOf) throws IOException {
+         ...
+     }
+ 
+     /**
+      * Remove an operator from the plan.  Connections in the graph will be
+      * reconnected after the operator is removed.  So if a is b's input and b
+      * is c's input, and b is removed, then a will become c's input.
+      * @param op Operator to revmove.
+      * @throws IOException if op or inputOf are not in the plan.
+      */
+     public void remove(LogicalOperator op) throws IOException {
+         ...
+     }
+ }
+ 
+ }}}
+ 
+ ==== LogicalOperator ====
+ All logical operators will be a subclass of !LogicalOperator.
+ !LogicalOperator itself will contain lists of the inputs and outputs of the
+ operator, the schema for the operator, and the data type of the operator.
+ 
+ {{{
+ abstract public class LogicalOperator {
+     private static final long serialVersionUID = 2L;
+ 
+     /**
+        * Schema associated with this logical operator.
+        */
+     protected Schema mSchema;
+ 
+     /**
+      * OperatorKey associated with this operator.  This key is used to find 
the
+      * operator in the LogicalPlan.
+      */
+     protected OperatorKey mKey;
+ 
+     /**
+      * Datatype of this output of this operator.  Operators start out with 
data type
+      * set to UNKNOWN, and have it set for them by the type checker.
+      */
+     protected byte mType = DataType.UNKNOWN;
+ 
+     /**
+        * Requested level of parallelism for this operation.
+        */
+     protected int mRequestedParallelism;
+ 
+     /**
+        * References to an operators inputs
+        */
+     protected List<LogicalOperator> mInputs;
+ 
+     /**
+        * Back pointers so that the logical plan can be navigated in either 
direction.
+        */
+     protected List<LogicalOperator> mOutputs;
+ 
+     /**
+      * Equivalent to LogicalOperator(k, 0).
+      * @param - k Operator key to assign to this node.
+      */
+     public LogicalOperator(OperatorKey k) {
+         this(k, 0);
+     }
+ 
+     /**
+      * @param - k Operator key to assign to this node.
+      * @param = rp degree of requested parallelism with which to execute this 
node.
+      */
+     public LogicalOperator(OperatorKey k, int rp) {
+               ...
+     }
+     
+     /**
+      * Get the operator key for this operator.
+      */
+     public OperatorKey getOperatorKey() {
+         return mKey;
+     }
+ 
+     /**
+      * Set the schema for this oeprator.
+      * @param schema Schema to set.
+      */
+     public void setSchema(Schema schema) {
+         mSchema = schema;
+     }
+ 
+     /**
+      * Get a copy of the schema for the output of this operator.
+      */
+     public Schema getSchema() {
+         return mSchema;
+     }
+ 
+     /**
+      * Set the type of this operator.  This should only be called by the type
+      * checking routines.
+      * @param type - Type to set this operator to.
+      */
+     final public void setType(byte t) {
+         mType = t;
+     }
+ 
+     /**
+      * Get the type of this operator.
+      */
+     public byte getType() {
+         return mType;
+     }
+     
+     /**
+      * Get a list of all inputs to the operator.
+      */
+     public List<LogicalOperator> getInputs() {
+         return mInputs;
+     }
+ 
+     /**
+      * Get a list of all outputs to the operator.
+      */
+     public List<LogicalOperator> getOutputs() {
+         return mOutputs;
+     }
+ 
+     public abstract void visit(LOVisitor v) throws ParseException;
+     
+     public abstract String name();
+     
+     @Override
+     public String toString() {
+               ...
+     }
+ }
+ 
+ }}}
+ 
+ Each of the relational operators will be modeled as a logical operator.  
There will be a class !ExpressionOperator that extends !LogicalOperator and
+ represents all types of expression operators.  The class hierarchy will look
+ like:
+ 
+ Extenders of !LogicalOperator:
+   * LOLoad
+   * LOStore
+   * LO!ForEach
+   * LOGenerate
+   * LOFilter
+   * LO!CoGroup
+   * LOSort
+   * LODistinct
+   * LOProject
+   * LO!MapLookup
+   * LOStream
+   * LOSplit
+   * LOUnion
+   * !ExpressionOperator (abstract, represents all expression types)
+ 
+ Extenders of !ExpressionOperator
+   * !BinaryExpressionOperator (abstract, represents all binary expressions)
+   * !UnaryExpressionOperator (abstract, represents all unary expressions)
+   * LO!BinCond
+   * LOConst (constant values)
+   * LO!UserFunc (invocation of user defined function)
+   * LOParend
+   * LOCast
+   
+ Extenders of !BinaryExpressionOperator
+   * LOAnd
+   * LOOr
+   * LO!GreaterThan
+   * LO!GreaterThanEqual
+   * LOEqual
+   * LO!LesserThan
+   * LO!LesserThanEqual
+   * LO!NotEqual
+   * LOAdd
+   * LOSubstract
+   * LOMultiply
+   * LODivide
+   * LOMod
+ 
+ Extenders of !UnaryExpressionOperator
+   * LONot
+   * LONegative
+ 
+ ==== Logical Plan Visitors ====
+ The method for accessing the logical plan will be a visitor class, !LOVisitor.
+ This class will contain the logic for traversing logical plans.  Any class 
that
+ needs to operate on the plan should extend this class.  The extending class
+ need not provide logic to navigate the logical plan (unless it needs to
+ navigate it in some non-standard way).  It just needs to provide logic for the
+ specific operations it wants to do on the tree.
+ 
+ LOVisitor:
+ {{{
+ abstract public class LOVisitor {
+ 
+     /**
+      * Extenders of this class should implement this to either call 
+      * depthFirst() dependencyOrder().  If order is not important, 
+      * depthFirst() should be called, as it's faster.  If it is important
+      * that your nodes only be visited after all the nodes they depend on
+      * have been visited than you should call dependencyOrder() instead.
+      */
+     public abstract void visit();
+ 
+      /**
+      * Only LOFilter.visit() and subclass implementations of this function
+      * should ever call this method.
+      */
+     void visitFilter(LOFilter f) throws ParseException {
+         f.getCondition().visit(this);
+         f.getInput().visit(this);
+     }
+ 
+       // And so on with other LO operators
+ 
+     /**
+      * Visit the graph in a depth first traversal.
+      */
+     private void depthFirst() {
+         // TODO
+     }
+ 
+     /**
+      * Visit the graph in a way that guarantees that no node is visited before
+      * all the nodes it depends on (that is, all those giving it input) have
+      * already been visited.
+      */
+     private void dependencyOrder() {
+         // TODO
+     }
+ }
+ 
+ }}}
+ 
+ 
+ === Physical Plan Structure ===
+ 
+ TBD
  
  === Logical to Physical Translation Scheme ===

[Pig Wiki] Update of "PigExecutionModel" by AlanGates

Reply via email to