[Pig Wiki] Update of "PigExecutionModel" by ChrisOlston

2008-01-04 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by ChrisOlston:
http://wiki.apache.org/pig/PigExecutionModel

--
- = Pig Execution Model =
- 
- == Introduction ==
+ = Introduction =
  
  The goal is to decide how to structure operator evaluation pipelines in Pig. 
The major issues include whether data flow follows a push model or a pull 
model, whether an operator evaluation tree is multi-threaded or 
single-threaded, and what the API for user-defined functions (UDFs) looks like.
  
+ The new model needs to support arbitrary operator DAGs (which may arise in a 
single pig program, or when jointly executing groups of interrelated programs).
+ 
- == Alternative Execution Models ==
+ = Alternative Execution Models =
  
  Some possible execution models are:
  
- === Model 1 ===
+ == Model 1 ==
  
 * One thread.
 * Top-level scheduler round-robins through "leaf" operators.
@@ -18, +18 @@

 * Special "split" operator buffers data that gets fed to multiple 
operators; at most one tuple gets buffered at each split point.
 * UDF API: declare zero or one input bag as "streaming"; init() hands it 
all data except the streaming bag; next() hands one tuple from streaming bag
  
- === Model 2 ===
+ == Model 2 ==
  
 * One thread per "leaf" operator.
 * Scheduling done by OS.
@@ -26, +26 @@

 * Split operator may buffer up to K tuples (or B bytes); if an operator 
tries to read too far ahead it gets blocked until other operators reading from 
the same buffer catch up.
 * Deadlock can arise; need to detect it and release it by relaxing the K/B 
constraint on one or more of the split buffers.
  
- === Discussion ===
+ == Discussion ==
  
-  Model 1 Drawbacks 
+ === Model 1 Drawbacks ===
  
 * underutilize multi-core systems? (depends whether the policy is to 
assign several map/reduce tasks to a machine)
 * difficult (or impossible?) to support operations that require streaming 
access to multiple inputs (e.g., merge join, merge-based set difference, etc. 
which operate over pre-sorted input streams)
 * UDF APIs more complex?
  
-  Model 2 Drawbacks 
+ === Model 2 Drawbacks ===
  
 * thread synchronization overhead
 * complexity of multi-threaded implementation
  
-  Related Reading 
+ === Related Reading ===
  
 * Fjords paper 
* paper: http://db.lcs.mit.edu/madden/html/madden_fjords.pdf


[Pig Wiki] Update of "PigExecutionModel" by ChrisOlston

2008-01-04 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by ChrisOlston:
http://wiki.apache.org/pig/PigExecutionModel

--
- ---+ Pig Execution Model
+ = Pig Execution Model =
  
- ---++ Introduction
+ == Introduction ==
  
  The goal is to decide how to structure operator evaluation pipelines in Pig. 
The major issues include whether data flow follows a push model or a pull 
model, whether an operator evaluation tree is multi-threaded or 
single-threaded, and what the API for user-defined functions (UDFs) looks like.
  
- ---++ Alternative Execution Models
+ == Alternative Execution Models ==
  
  Some possible execution models are:
  
- ---+++ Model 1
+ === Model 1 ===
  
 * One thread.
 * Top-level scheduler round-robins through "leaf" operators.
@@ -18, +18 @@

 * Special "split" operator buffers data that gets fed to multiple 
operators; at most one tuple gets buffered at each split point.
 * UDF API: declare zero or one input bag as "streaming"; init() hands it 
all data except the streaming bag; next() hands one tuple from streaming bag
  
- ---+++ Model 2
+ === Model 2 ===
  
 * One thread per "leaf" operator.
 * Scheduling done by OS.
@@ -26, +26 @@

 * Split operator may buffer up to K tuples (or B bytes); if an operator 
tries to read too far ahead it gets blocked until other operators reading from 
the same buffer catch up.
 * Deadlock can arise; need to detect it and release it by relaxing the K/B 
constraint on one or more of the split buffers.
  
- ---+++ Discussion
+ === Discussion ===
  
- --- Model 1 Drawbacks
+  Model 1 Drawbacks 
  
 * underutilize multi-core systems? (depends whether the policy is to 
assign several map/reduce tasks to a machine)
 * difficult (or impossible?) to support operations that require streaming 
access to multiple inputs (e.g., merge join, merge-based set difference, etc. 
which operate over pre-sorted input streams)
 * UDF APIs more complex?
  
- --- Model 2 Drawbacks
+  Model 2 Drawbacks 
  
 * thread synchronization overhead
 * complexity of multi-threaded implementation
  
- --- Related work
+  Related Reading 
  
-* Fjords paper 
[[http://db.lcs.mit.edu/madden/html/madden_fjords.pdf][paper]][[http://www.cs.umd.edu/class/fall2002/cmsc818s/Lectures/fjords.pdf][slides]]
+* Fjords paper 
+   * paper: http://db.lcs.mit.edu/madden/html/madden_fjords.pdf
+   * slides: 
http://www.cs.umd.edu/class/fall2002/cmsc818s/Lectures/fjords.pdf
-* Stream Programming Model / MIT StreamIt [waiting for pointer]
+* Stream Programming Model / MIT StreamIt 
+   * waiting for pointer
  


[Pig Wiki] Update of "PigExecutionModel" by ChrisOlston

2008-01-04 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by ChrisOlston:
http://wiki.apache.org/pig/PigExecutionModel

New page:
---+ Pig Execution Model

---++ Introduction

The goal is to decide how to structure operator evaluation pipelines in Pig. 
The major issues include whether data flow follows a push model or a pull 
model, whether an operator evaluation tree is multi-threaded or 
single-threaded, and what the API for user-defined functions (UDFs) looks like.

---++ Alternative Execution Models

Some possible execution models are:

---+++ Model 1

   * One thread.
   * Top-level scheduler round-robins through "leaf" operators.
   * Each time an operator is invoked, it gets handed exactly one tuple.
   * Special "split" operator buffers data that gets fed to multiple operators; 
at most one tuple gets buffered at each split point.
   * UDF API: declare zero or one input bag as "streaming"; init() hands it all 
data except the streaming bag; next() hands one tuple from streaming bag

---+++ Model 2

   * One thread per "leaf" operator.
   * Scheduling done by OS.
   * Operator gets to read as many tuples as it wants; if it reads from 
multiple inputs, can interleave "next()" calls on the inputs in arbitrary 
fashion.
   * Split operator may buffer up to K tuples (or B bytes); if an operator 
tries to read too far ahead it gets blocked until other operators reading from 
the same buffer catch up.
   * Deadlock can arise; need to detect it and release it by relaxing the K/B 
constraint on one or more of the split buffers.

---+++ Discussion

--- Model 1 Drawbacks

   * underutilize multi-core systems? (depends whether the policy is to assign 
several map/reduce tasks to a machine)
   * difficult (or impossible?) to support operations that require streaming 
access to multiple inputs (e.g., merge join, merge-based set difference, etc. 
which operate over pre-sorted input streams)
   * UDF APIs more complex?

--- Model 2 Drawbacks

   * thread synchronization overhead
   * complexity of multi-threaded implementation

--- Related work

   * Fjords paper 
[[http://db.lcs.mit.edu/madden/html/madden_fjords.pdf][paper]][[http://www.cs.umd.edu/class/fall2002/cmsc818s/Lectures/fjords.pdf][slides]]
   * Stream Programming Model / MIT StreamIt [waiting for pointer]