[Pig Wiki] Update of "PigExecutionModel" by ChrisOlston
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by ChrisOlston: http://wiki.apache.org/pig/PigExecutionModel -- - = Pig Execution Model = - - == Introduction == + = Introduction = The goal is to decide how to structure operator evaluation pipelines in Pig. The major issues include whether data flow follows a push model or a pull model, whether an operator evaluation tree is multi-threaded or single-threaded, and what the API for user-defined functions (UDFs) looks like. + The new model needs to support arbitrary operator DAGs (which may arise in a single pig program, or when jointly executing groups of interrelated programs). + - == Alternative Execution Models == + = Alternative Execution Models = Some possible execution models are: - === Model 1 === + == Model 1 == * One thread. * Top-level scheduler round-robins through "leaf" operators. @@ -18, +18 @@ * Special "split" operator buffers data that gets fed to multiple operators; at most one tuple gets buffered at each split point. * UDF API: declare zero or one input bag as "streaming"; init() hands it all data except the streaming bag; next() hands one tuple from streaming bag - === Model 2 === + == Model 2 == * One thread per "leaf" operator. * Scheduling done by OS. @@ -26, +26 @@ * Split operator may buffer up to K tuples (or B bytes); if an operator tries to read too far ahead it gets blocked until other operators reading from the same buffer catch up. * Deadlock can arise; need to detect it and release it by relaxing the K/B constraint on one or more of the split buffers. - === Discussion === + == Discussion == - Model 1 Drawbacks + === Model 1 Drawbacks === * underutilize multi-core systems? (depends whether the policy is to assign several map/reduce tasks to a machine) * difficult (or impossible?) to support operations that require streaming access to multiple inputs (e.g., merge join, merge-based set difference, etc. which operate over pre-sorted input streams) * UDF APIs more complex? - Model 2 Drawbacks + === Model 2 Drawbacks === * thread synchronization overhead * complexity of multi-threaded implementation - Related Reading + === Related Reading === * Fjords paper * paper: http://db.lcs.mit.edu/madden/html/madden_fjords.pdf
[Pig Wiki] Update of "PigExecutionModel" by ChrisOlston
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by ChrisOlston: http://wiki.apache.org/pig/PigExecutionModel -- - ---+ Pig Execution Model + = Pig Execution Model = - ---++ Introduction + == Introduction == The goal is to decide how to structure operator evaluation pipelines in Pig. The major issues include whether data flow follows a push model or a pull model, whether an operator evaluation tree is multi-threaded or single-threaded, and what the API for user-defined functions (UDFs) looks like. - ---++ Alternative Execution Models + == Alternative Execution Models == Some possible execution models are: - ---+++ Model 1 + === Model 1 === * One thread. * Top-level scheduler round-robins through "leaf" operators. @@ -18, +18 @@ * Special "split" operator buffers data that gets fed to multiple operators; at most one tuple gets buffered at each split point. * UDF API: declare zero or one input bag as "streaming"; init() hands it all data except the streaming bag; next() hands one tuple from streaming bag - ---+++ Model 2 + === Model 2 === * One thread per "leaf" operator. * Scheduling done by OS. @@ -26, +26 @@ * Split operator may buffer up to K tuples (or B bytes); if an operator tries to read too far ahead it gets blocked until other operators reading from the same buffer catch up. * Deadlock can arise; need to detect it and release it by relaxing the K/B constraint on one or more of the split buffers. - ---+++ Discussion + === Discussion === - --- Model 1 Drawbacks + Model 1 Drawbacks * underutilize multi-core systems? (depends whether the policy is to assign several map/reduce tasks to a machine) * difficult (or impossible?) to support operations that require streaming access to multiple inputs (e.g., merge join, merge-based set difference, etc. which operate over pre-sorted input streams) * UDF APIs more complex? - --- Model 2 Drawbacks + Model 2 Drawbacks * thread synchronization overhead * complexity of multi-threaded implementation - --- Related work + Related Reading -* Fjords paper [[http://db.lcs.mit.edu/madden/html/madden_fjords.pdf][paper]][[http://www.cs.umd.edu/class/fall2002/cmsc818s/Lectures/fjords.pdf][slides]] +* Fjords paper + * paper: http://db.lcs.mit.edu/madden/html/madden_fjords.pdf + * slides: http://www.cs.umd.edu/class/fall2002/cmsc818s/Lectures/fjords.pdf -* Stream Programming Model / MIT StreamIt [waiting for pointer] +* Stream Programming Model / MIT StreamIt + * waiting for pointer
[Pig Wiki] Update of "PigExecutionModel" by ChrisOlston
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by ChrisOlston: http://wiki.apache.org/pig/PigExecutionModel New page: ---+ Pig Execution Model ---++ Introduction The goal is to decide how to structure operator evaluation pipelines in Pig. The major issues include whether data flow follows a push model or a pull model, whether an operator evaluation tree is multi-threaded or single-threaded, and what the API for user-defined functions (UDFs) looks like. ---++ Alternative Execution Models Some possible execution models are: ---+++ Model 1 * One thread. * Top-level scheduler round-robins through "leaf" operators. * Each time an operator is invoked, it gets handed exactly one tuple. * Special "split" operator buffers data that gets fed to multiple operators; at most one tuple gets buffered at each split point. * UDF API: declare zero or one input bag as "streaming"; init() hands it all data except the streaming bag; next() hands one tuple from streaming bag ---+++ Model 2 * One thread per "leaf" operator. * Scheduling done by OS. * Operator gets to read as many tuples as it wants; if it reads from multiple inputs, can interleave "next()" calls on the inputs in arbitrary fashion. * Split operator may buffer up to K tuples (or B bytes); if an operator tries to read too far ahead it gets blocked until other operators reading from the same buffer catch up. * Deadlock can arise; need to detect it and release it by relaxing the K/B constraint on one or more of the split buffers. ---+++ Discussion --- Model 1 Drawbacks * underutilize multi-core systems? (depends whether the policy is to assign several map/reduce tasks to a machine) * difficult (or impossible?) to support operations that require streaming access to multiple inputs (e.g., merge join, merge-based set difference, etc. which operate over pre-sorted input streams) * UDF APIs more complex? --- Model 2 Drawbacks * thread synchronization overhead * complexity of multi-threaded implementation --- Related work * Fjords paper [[http://db.lcs.mit.edu/madden/html/madden_fjords.pdf][paper]][[http://www.cs.umd.edu/class/fall2002/cmsc818s/Lectures/fjords.pdf][slides]] * Stream Programming Model / MIT StreamIt [waiting for pointer]