[Pig Wiki] Update of "TuringCompletePig" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "TuringCompletePig" page has been changed by AlanGates. http://wiki.apache.org/pig/TuringCompletePig?action=diff&rev1=5&rev2=6 -- } }}} + == Approach 3 == + At the Pig contributor workshop in June 2010 Dmitriy Ryaboy proposed that we go the DSL route in Java. Thus the example given above becomes something like: + + {{{ + + public class Main { + + public static void main(String[] args) { + float error = 100.0; + String infile = "original.data"; + PigBuilder pig = new PigBuilder(); + while (error > 1.0) { + PigRelation A = pig.load(infile, "piggybank.MyLoader"); + PigRelation B = A.group(pig.ALL); + // It's not entirely clear to me how nested foreach works in this scenario + PigRelation C = B.foreach(new MyFunc("A")); + + PigIterator pi = pig.openIterator(C, "outfile"); + Tuple t = pi.next(); + error = t.get(1); + if (error >= 1.0) { + pig.fs.mv('outfile', 'infile'); + } + } + } + } + }}} + + This would be accomplished by creating a public interface for Pig operators (here called !PigBuilder, but I'm not proposing that as the actual name) that would + construct a logical plan and execute it when openIterator is called, much as !PigServer does today. Another way to look at this is !PigServer could be changed to + expose Pig operators instead of just strings as it does today. + + The beauty of doing this in Java is it facilitates it being used in scripting languages as well. Since Java packages can be directly imported into Jython, JRuby, + Groovy, and other languages this immediately provides a scripting interface in the language of the users choice. + + This does violate requirement 10 above (that Pig Latin should appear the same in embedded and non-embedded form), but the cross language functionality may be worth + it. +
[Pig Wiki] Update of "TuringCompletePig" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "TuringCompletePig" page has been changed by AlanGates. http://wiki.apache.org/pig/TuringCompletePig?action=diff&rev1=4&rev2=5 -- } }}} + === Other Thoughts === + Whichever way we do it, we need to consider what built in variables we need in the system. For example, it would be really nice to have a + status variable so that you could do something like: + + {{{ + ... + store X into 'foo'; + if ($status == 0) { -- or "success" or whatever + ... + } else { + ... + } + }}} +
[Pig Wiki] Update of "TuringCompletePig" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "TuringCompletePig" page has been changed by AlanGates. http://wiki.apache.org/pig/TuringCompletePig?action=diff&rev1=3&rev2=4 -- Object outfile = new String("result.data"); while (error != null && (Double)error > 1.0) { PigServer ps = new PigServer(); - ps.registerQuery("A = load infile;"); + ps.registerQuery("A = load " + infile + ";"); ps.registerQuery("B = group A all;"); ps.registerQuery("C = foreach B generate flatten(doSomeCalculation(A)) as (result, error);"); ps.registerQuery("error = foreach C generate error;");
[Pig Wiki] Update of "TuringCompletePig" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "TuringCompletePig" page has been changed by AlanGates. http://wiki.apache.org/pig/TuringCompletePig?action=diff&rev1=2&rev2=3 -- Thoughts? Preferences for one of the options I did not like? Comments welcome. + == Approach 2 == + And now for something completely different. + + After thinking on the above for a week or so it occurs to me that in dismissing making Pig Latin itself Turing complete I am conflating two tasks + that could be decoupled. The first is defining a grammar for the language and extending the parser. The second is building an execution engine to execute + Pig Latin scripts. It is the second that I am concerned is too much work. Defining the grammar and building the parser is relatively easy (as + we say in the Pig team at Yahoo, "parsers are easy"). + + So what if we did extend Pig Latin itself to be Turing complete, but the first pass over the language was to compile it down to Java code that made + use of the existing !PigServer class to execute the code? This meets all ten requirements given above (some extra work will need to be done to meet + requirement 8 on up front semantic checking, but it is possible). It deals with my initial concern that supporting Turing completeness in Pig Latin + is too much work. It also has the exceedingly nice feature that we do not have to pick any one scripting language. The more I talked to people the + more I discovered some wanted Python, some Ruby, some Perl, some Groovy, etc. This avoids that problem. And the extensions to Pig Latin themselves + will be simple enough that it should not be onerous for people to learn it. It also means that at some future time if we decide that we want more + control over how the language is executed we can make changes without people needing to switch from whatever scripting language we embed it in. + + A significant downside to this proposal is now users have to have a Java compiler along to run their Pig Latin scripts. + + The other concerns I gave above about making Pig Latin Turing complete are somewhat addressed, but not totally. It would be possible, though + painful, to use a Java debugger on the generated Java code. Syntax highlighting and completion files could be created for Vim, Emacs, Eclipse, and + whatever other favorite editors people have. + + === Specifics === + The grammar of the language should be kept as simple as possible. The goal is not to create a general purpose programming language. + Tasks requiring these features should still be written in UDFs in Java or a scripting language. + + Each Pig Latin file would be considered as a module. All functions would have global scope within that module and would be visible once the module is + imported. + + The type system would be existing Pig Latin types (we may need to add a list type). Types would be bound at run time (this is necessary to support + existing PL grammar where A = load ... is a declaration of A). + + The grammar would look something like: + + {{{ + program: + import + | register + | define + | func_definition + | block + + import: + IMPORT _modulename_ namespace_clause + + namespace_clause: + (empty) + | AS _namespacename_ + + register: + ... // as now + + define: + ... // as now + + func_definition: + DEF _functionname_ ( arg_list ) { block } + // not sure about this, having DEF and DEFINE different keywords. + // May want to reuse DEFINE here or DEFINE FUNCTION + + arg_list: + expr + | arg_list , expr + + block: + statement + | block statement + + statement: + ; + | assignment + | if + | while + | for + | return // only valid inside functions + | CONTINUE ; // only valid inside loops + | BREAK ; // only valid inside loops + | split + | store + | dump + | fs + + assignment: + _var_ = expr ; + | _var_ = LOAD _inputsrc_ ; + ... // GROUP, FILTER, etc. as now + + statement_or_block: + statement + | { block } + + if: + IF ( expr ) statement_or_block else + + else: + (empty) + | ELSE statement_or_block + + while: + WHILE ( expr ) statement_or_block + + for: + FOR ( assignment ; expr ; expr ) statement_or_block + + return: + RETURN ; + | RETURN expr ; + + // split, dump, store, fs as now + }}} + + So the example given initially would look like: + {{{ + error = 100.0; + infile = 'original.data'; + outfile = 'result.data'; + while (error > 1.0) { + A = load infile; + B = group A all; + C = foreach B generate flatten(doSomeCalculation(A)) as (result, error); + error = foreach C generate error; + store C into outfile; + if (error > 1.0) fs mv outfile
[Pig Wiki] Update of "TuringCompletePig" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "TuringCompletePig" page has been changed by AlanGates. http://wiki.apache.org/pig/TuringCompletePig?action=diff&rev1=1&rev2=2 -- = Making Pig Latin Turing Complete = == Introduction == As more users adopt Pig and begin writing their data processing in Pig Latin and as they use Pig to process more and more complex - tasks, a consistent request from these users is to add branches, loops, and functions to Pig Latin. This will enable Pig Latin to + tasks, a consistent request from these users has been to add branches, loops, and functions to Pig Latin. This will enable Pig Latin to process a whole new class of problems. Consider, for example, an algorithm that needs to iterate until an error estimate is less than a given threshold. This might look like (this just suggests logic, not syntax): @@ -22, +22 @@ == Requirements == The following should be provided by this Turing complete Pig Latin: - 1. Branching. This will be satisfied by a standard `if` `else if` `else` functionality + 1. Branching. This will be satisfied by a standard `if / else if / else` functionality 1. Looping. This should include standard `while` and some form of `for`. for could be C style or Python style (foreach). Care needs to be taken to select syntax that does not cause confusion with the existing `foreach` operator in Pig Latin. 1. Functions. 1. Modules. @@ -49, +49 @@ * Which scripting language to choose? Perl, Python, and Ruby all have significant adoption and could make a claim to be the best choice. * Syntactic and semantic checking is usually delayed until an embedded bit of code is reached in the outer control flow. Given that Pig jobs can run for hours this can mean spending hours to discover a simple typo. - Consider for example if built a python class that wrapped !PigServer and then translated the above code snippet. + Consider for example if Pig provided a Jython class that wrapped !PigServer and then we translated the above code snippet. {{{ error = 100.0 @@ -68, +68 @@ grunt.exec("fs mv 'outfile' 'infile'") }}} - All of these references to `pig` and `grunt` as objects with command strings is undesirable. + All of these references to `pig` and `grunt` as objects with command strings are undesirable. So while I believe that embedding is a much better approach due to the lower work load and the plethora of tools available for other languages, I do not believe the above is an acceptable way to do it. Thus I would like to place three additional requirements on embedded Pig Latin beyond those given above for Turing complete Pig Latin: @@ -79, +79 @@ This overcomes two of the three drawbacks noted above. It does not provide for a way to do certain optimizations such as loop unrolling, but I think that is acceptable. + Having rejected the quote style of programming we could choose the Domain Specific Language (DSL) option, where we define Pig operators in the + target language. Again using Python as an example: + + {{{ +error = 100.0 +infile = 'original.data' +pig = PigServer() +grunt = Grunt() +while error > 1.0: +A = pig.load(infile, { 'loader' => 'piggybank.MyLoader'}); +B = A.group(pig.ALL); +C = B.foreach { + innerBag = doSomeCalculation(:A); + generate innerBag.flatten().as(:result, :error) +} + +PigIterator pi = pig.openIterator(C, 'outfile'); +output = grunt.fs.cat('outfile'"); +bla = output.partition("\t"); +error = bla(2) +if error >= 1.0: +grunt.fs.mv('outfile', 'infile'); + }}} + + This meets requirements 7 and 9 above. It can partially but not fully meet 8. It can check that we use the right operators and pass + them the right types. It cannot check the semantics of the operators, for example that `infile` exists and is readable. This might be ok, + because it might turn out that things that cannot be checked at script compile time should not be checked up front anyway. As an example, it should not + check for `infile` up front because the script may not have created it yet. + + This approach has the advantage that it will integrate very nicely with tools from the target language. Debuggers, IDE, etc. will all now + view some form of Pig Latin as native to their language. + + It does however have drawback, which is that what we would be creating a new dialect of Pig Latin. There would be a Pig Latin dialect used when writing it + directly, and a different dialect for embedding. This leads to confusion and duplication of effort. So I would like to suggest another + requirement: + + 1.#10 Pig Latin should appear the same in the embedded form as in the non-embedded form. +
[Pig Wiki] Update of "TuringCompletePig" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "TuringCompletePig" page has been changed by AlanGates. http://wiki.apache.org/pig/TuringCompletePig -- New page: = Making Pig Latin Turing Complete = == Introduction == As more users adopt Pig and begin writing their data processing in Pig Latin and as they use Pig to process more and more complex tasks, a consistent request from these users is to add branches, loops, and functions to Pig Latin. This will enable Pig Latin to process a whole new class of problems. Consider, for example, an algorithm that needs to iterate until an error estimate is less than a given threshold. This might look like (this just suggests logic, not syntax): {{{ error = 100.0; infile = 'original.data'; while (error > 1.0) { A = load 'infile'; B = group A all; C = foreach B generate flatten(doSomeCalculation(A)) as (result, error); error = foreach C generate error; store C into 'outfile'; if (error > 1.0) mv 'outfile' 'infile'; } }}} == Requirements == The following should be provided by this Turing complete Pig Latin: 1. Branching. This will be satisfied by a standard `if` `else if` `else` functionality 1. Looping. This should include standard `while` and some form of `for`. for could be C style or Python style (foreach). Care needs to be taken to select syntax that does not cause confusion with the existing `foreach` operator in Pig Latin. 1. Functions. 1. Modules. 1. The ability to use local in memory variables in the Pig Latin script. For example, in the snippet given above the way `infile` is defined above the `while` and then used in the `load`. 1. The ability to "store" results into local in memory variables. For example, in the snippet given above the way the error calculation from the data processing is stored into `error` in the line `error = foreach C generate error;`. == Approach == There are two possible approaches to this. One is to add all of these features to Pig Latin itself. This has several advantages: * All Pig Latin operations will be first class objects in the language. There will not be a need to do quoted programming, like what happens when JDBC is used to write SQL inside a Java program. * There will be opportunities to do optimizations that are not available in embedded programming, such as loop unrolling, etc. However, the cost of this approach is incredible. It means turning Pig Latin into a full scripting language. And it means all kinds of tools like debuggers, etc. will never be available for Pig Latin users because the Pig team will not have the resources or expertise to develop and maintain such tools. And finally, does the world need another scripting language that starts with P? The second possible approach to this is to embed Pig Latin into an existing scripting language, such as Perl, Python, Ruby, etc. The advantages of this are: * Most of the requirements noted above (branching, looping, functions, and modules) are present in these languages. * For any of these languages whole hosts of tools such as debuggers, IDEs, etc. exist and could be used. * Users do not have to learn a new language. There are a few significant drawbacks to this approach: * It leads to a quoted programming style which is unnatural and irritating for developers. * Which scripting language to choose? Perl, Python, and Ruby all have significant adoption and could make a claim to be the best choice. * Syntactic and semantic checking is usually delayed until an embedded bit of code is reached in the outer control flow. Given that Pig jobs can run for hours this can mean spending hours to discover a simple typo. Consider for example if built a python class that wrapped !PigServer and then translated the above code snippet. {{{ error = 100.0 infile = 'original.data' pig = PigServer() grunt = Grunt() while error > 1.0: pig.registerQuery("A = load 'infile'; \ B = group A all; \ C = foreach B generate flatten(doSomeCalculation(A)) as (result, error); \ PigIterator pi = pig.openIterator("C", 'outfile'); output = grunt.exec("fs cat 'outfile'"); bla = output.partition("\t"); error = bla(2) if error >= 1.0: grunt.exec("fs mv 'outfile' 'infile'") }}} All of these references to `pig` and `grunt` as objects with command strings is undesirable. So while I believe that embedding is a much better approach due to the lower work load and the plethora of tools available for other languages, I do not believe the above is an acceptable way to do it. Thus I would like to place three additional requirements on embedded Pig Latin beyond those given above for Turing complete Pig Latin: 1.#7 Pig Latin should appear as