[Pig Wiki] Update of "TuringCompletePig" by AlanGates

2010-07-13 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "TuringCompletePig" page has been changed by AlanGates.
http://wiki.apache.org/pig/TuringCompletePig?action=diff&rev1=5&rev2=6

--

  }
  }}}
  
+ == Approach 3 ==
+ At the Pig contributor workshop in June 2010 Dmitriy Ryaboy proposed that we 
go the DSL route in Java.  Thus the example given above becomes something like:
+ 
+ {{{
+ 
+ public class Main {
+ 
+ public static void main(String[] args) {
+ float error = 100.0;
+ String infile = "original.data";
+ PigBuilder pig = new PigBuilder();
+ while (error > 1.0) {
+ PigRelation A = pig.load(infile, "piggybank.MyLoader");
+ PigRelation B = A.group(pig.ALL);
+ // It's not entirely clear to me how nested foreach works in this 
scenario
+ PigRelation C = B.foreach(new MyFunc("A"));
+
+ PigIterator pi = pig.openIterator(C, "outfile");
+ Tuple t = pi.next();
+ error = t.get(1);
+ if (error >= 1.0) {
+ pig.fs.mv('outfile', 'infile');
+ }
+ }
+ }
+ }
+ }}}
+ 
+ This would be accomplished by creating a public interface for Pig operators 
(here called !PigBuilder, but I'm not proposing that as the actual name) that 
would
+ construct a logical plan and execute it when openIterator is called, much as 
!PigServer does today.  Another way to look at this is !PigServer could be 
changed to
+ expose Pig operators instead of just strings as it does today.  
+ 
+ The beauty of doing this in Java is it facilitates it being used in scripting 
languages as well.  Since Java packages can be directly imported into Jython, 
JRuby,
+ Groovy, and other languages this immediately provides a scripting interface 
in the language of the users choice.
+ 
+ This does violate requirement 10 above (that Pig Latin should appear the same 
in embedded and non-embedded form), but the cross language functionality may be 
worth
+ it.
+ 


[Pig Wiki] Update of "TuringCompletePig" by AlanGates

2010-06-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "TuringCompletePig" page has been changed by AlanGates.
http://wiki.apache.org/pig/TuringCompletePig?action=diff&rev1=4&rev2=5

--

  }
  }}}
  
+ === Other Thoughts ===
+ Whichever way we do it, we need to consider what built in variables we need 
in the system.  For example, it would be really nice to have a
+ status variable so that you could do something like:
+ 
+ {{{
+ ...
+ store X into 'foo';
+ if ($status == 0) { -- or "success" or whatever
+ ...
+ } else {
+ ...
+ }
+ }}}
+ 


[Pig Wiki] Update of "TuringCompletePig" by AlanGates

2010-06-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "TuringCompletePig" page has been changed by AlanGates.
http://wiki.apache.org/pig/TuringCompletePig?action=diff&rev1=3&rev2=4

--

  Object outfile = new String("result.data");
  while (error != null && (Double)error > 1.0) {
  PigServer ps = new PigServer();
- ps.registerQuery("A = load infile;");
+ ps.registerQuery("A = load " + infile + ";");
  ps.registerQuery("B = group A all;");
  ps.registerQuery("C = foreach B generate 
flatten(doSomeCalculation(A)) as (result, error);");
  ps.registerQuery("error = foreach C generate error;");


[Pig Wiki] Update of "TuringCompletePig" by AlanGates

2010-06-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "TuringCompletePig" page has been changed by AlanGates.
http://wiki.apache.org/pig/TuringCompletePig?action=diff&rev1=2&rev2=3

--

  
  Thoughts?  Preferences for one of the options I did not like?  Comments 
welcome.
  
+ == Approach 2 ==
+ And now for something completely different.
+ 
+ After thinking on the above for a week or so it occurs to me that in 
dismissing making Pig Latin itself Turing complete I am conflating two tasks
+ that could be decoupled.  The first is defining a grammar for the language 
and extending the parser.  The second is building an execution engine to execute
+ Pig Latin scripts.  It is the second that I am concerned is too much work.  
Defining the grammar and building the parser is relatively easy (as
+ we say in the Pig team at Yahoo, "parsers are easy").
+ 
+ So what if we did extend Pig Latin itself to be Turing complete, but the 
first pass over the language was to compile it down to Java code that made
+ use of the existing !PigServer class to execute the code?  This meets all ten 
requirements given above (some extra work will need to be done to meet
+ requirement 8 on up front semantic checking, but it is possible).  It deals 
with my initial concern that supporting Turing completeness in Pig Latin
+ is too much work.  It also has the exceedingly nice feature that we do not 
have to pick any one scripting language.  The more I talked to people the
+ more I discovered some wanted Python, some Ruby, some Perl, some Groovy, etc. 
 This avoids that problem.  And the extensions to Pig Latin themselves
+ will be simple enough that it should not be onerous for people to learn it.  
It also means that at some future time if we decide that we want more
+ control over how the language is executed we can make changes without people 
needing to switch from whatever scripting language we embed it in.
+ 
+ A significant downside to this proposal is now users have to have a Java 
compiler along to run their Pig Latin scripts.
+ 
+ The other concerns I gave above about making Pig Latin Turing complete are 
somewhat addressed, but not totally.  It would be possible, though
+ painful, to use a Java debugger on the generated Java code.  Syntax 
highlighting and completion files could be created for Vim, Emacs, Eclipse, and
+ whatever other favorite editors people have.
+ 
+ === Specifics ===
+ The grammar of the language should be kept as simple as possible.  The goal 
is not to create a general purpose programming language.
+ Tasks requiring these features should still be written in UDFs in Java or a 
scripting language.
+ 
+ Each Pig Latin file would be considered as a module.  All functions would 
have global scope within that module and would be visible once the module is
+ imported.
+ 
+ The type system would be existing Pig Latin types (we may need to add a list 
type).  Types would be bound at run time (this is necessary to support
+ existing PL grammar where A = load ... is a declaration of A).
+ 
+ The grammar would look something like:
+ 
+ {{{
+ program:
+   import
+ | register
+ | define
+ | func_definition
+ | block
+ 
+ import:
+   IMPORT _modulename_ namespace_clause
+ 
+ namespace_clause:
+   (empty)
+ | AS _namespacename_
+ 
+ register:
+   ... // as now
+ 
+ define:
+   ... // as now
+ 
+ func_definition:
+   DEF _functionname_ ( arg_list ) { block }
+   // not sure about this, having DEF and DEFINE different keywords.
+   // May want to reuse DEFINE here or DEFINE FUNCTION
+ 
+ arg_list:
+   expr
+ | arg_list , expr
+ 
+ block:
+   statement
+ | block statement
+ 
+ statement:
+   ;
+ | assignment
+ | if
+ | while
+ | for
+ | return // only valid inside functions
+ | CONTINUE ; // only valid inside loops
+ | BREAK ; // only valid inside loops
+ | split
+ | store
+ | dump
+ | fs
+ 
+ assignment:
+   _var_ = expr ;
+ | _var_ = LOAD _inputsrc_ ;
+ ... // GROUP, FILTER, etc. as now
+ 
+ statement_or_block:
+   statement
+ | { block }
+ 
+ if:
+   IF ( expr ) statement_or_block else
+ 
+ else:
+   (empty)
+ | ELSE statement_or_block
+ 
+ while:
+   WHILE ( expr ) statement_or_block
+ 
+ for:
+   FOR ( assignment ; expr ; expr ) statement_or_block
+ 
+ return:
+   RETURN ;
+ | RETURN expr ;
+ 
+ // split, dump, store, fs as now
+ }}}
+ 
+ So the example given initially would look like:
+ {{{
+ error = 100.0;
+ infile = 'original.data';
+ outfile = 'result.data';
+ while (error > 1.0) {
+ A = load infile;
+ B = group A all;
+ C = foreach B generate flatten(doSomeCalculation(A)) as (result, 
error);
+ error = foreach C generate error;
+ store C into outfile;
+ if (error > 1.0) fs mv outfile 

[Pig Wiki] Update of "TuringCompletePig" by AlanGates

2010-06-08 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "TuringCompletePig" page has been changed by AlanGates.
http://wiki.apache.org/pig/TuringCompletePig?action=diff&rev1=1&rev2=2

--

  = Making Pig Latin Turing Complete =
  == Introduction ==
  As more users adopt Pig and begin writing their data processing in Pig Latin 
and as they use Pig to process more and more complex
- tasks, a consistent request from these users is to add branches, loops, and 
functions to Pig Latin.  This will enable Pig Latin to
+ tasks, a consistent request from these users has been to add branches, loops, 
and functions to Pig Latin.  This will enable Pig Latin to
  process a whole new class of problems.  Consider, for example, an algorithm 
that needs to iterate until an error estimate is less
  than a given threshold.  This might look like (this just suggests logic, not 
syntax):
  
@@ -22, +22 @@

  
  == Requirements ==
  The following should be provided by this Turing complete Pig Latin:
-  1. Branching.  This will be satisfied by a standard `if` `else if` `else` 
functionality
+  1. Branching.  This will be satisfied by a standard `if / else if / else` 
functionality
   1. Looping.  This should include standard `while` and some form of `for`.  
for could be C style or Python style (foreach).  Care needs to be taken to 
select syntax that does not cause confusion with the existing `foreach` 
operator in Pig Latin.
   1. Functions.  
   1. Modules.
@@ -49, +49 @@

   * Which scripting language to choose?  Perl, Python, and Ruby all have 
significant adoption and could make a claim to be the best choice.
   * Syntactic and semantic checking is usually delayed until an embedded bit 
of code is reached in the outer control flow.  Given that Pig jobs can run for 
hours this can mean spending hours to discover a simple typo.
  
- Consider for example if built a python class that wrapped !PigServer and then 
translated the above code snippet.
+ Consider for example if Pig provided a Jython class that wrapped !PigServer 
and then we translated the above code snippet.
  
  {{{
  error = 100.0
@@ -68, +68 @@

  grunt.exec("fs mv 'outfile' 'infile'")
  }}}
  
- All of these references to `pig` and `grunt` as objects with command strings 
is undesirable.
+ All of these references to `pig` and `grunt` as objects with command strings 
are undesirable.
  So while I believe that embedding is a much better approach due to the lower 
work load and the plethora of tools available for other
  languages, I do not believe the above is an acceptable way to do it.  Thus I 
would like to place three additional requirements on
  embedded Pig Latin beyond those given above for Turing complete Pig Latin:
@@ -79, +79 @@

  This overcomes two of the three drawbacks noted above.  It does not provide 
for a way to do certain optimizations such as loop
  unrolling, but I think that is acceptable.
  
+ Having rejected the quote style of programming we could choose the Domain 
Specific Language (DSL) option, where we define Pig operators in the
+ target language.  Again using Python as an example:
+ 
+ {{{
+error = 100.0
+infile = 'original.data'
+pig = PigServer()
+grunt = Grunt()
+while error > 1.0:
+A = pig.load(infile, { 'loader' => 'piggybank.MyLoader'});
+B = A.group(pig.ALL);
+C = B.foreach { 
+   innerBag = doSomeCalculation(:A);
+   generate innerBag.flatten().as(:result,  :error)
+}
+
+PigIterator pi = pig.openIterator(C, 'outfile');
+output = grunt.fs.cat('outfile'");
+bla = output.partition("\t");
+error = bla(2)
+if error >= 1.0:
+grunt.fs.mv('outfile', 'infile');
+ }}}
+ 
+ This meets requirements 7 and 9 above.  It can partially but not fully meet 
8.  It can check that we use the right operators and pass
+ them the right types.  It cannot check the semantics of the operators, for 
example that `infile` exists and is readable.  This might be ok,
+ because it might turn out that things that cannot be checked at script 
compile time should not be checked up front anyway.  As an example, it should 
not 
+ check for `infile` up front because the script may not have created it yet.
+ 
+ This approach has the advantage that it will integrate very nicely with tools 
from the target language.  Debuggers, IDE, etc. will all now
+ view some form of Pig Latin as native to their language.
+ 
+ It does however have drawback, which is that what we would be creating a new 
dialect of Pig Latin.  There would be a Pig Latin dialect used when writing it
+ directly, and a different dialect for embedding.  This leads to confusion and 
duplication of effort.  So I would like to suggest another
+ requirement:
+ 
+   1.#10 Pig Latin should appear the same in the embedded form as in the 
non-embedded form.
+ 

[Pig Wiki] Update of "TuringCompletePig" by AlanGates

2010-06-07 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "TuringCompletePig" page has been changed by AlanGates.
http://wiki.apache.org/pig/TuringCompletePig

--

New page:
= Making Pig Latin Turing Complete =
== Introduction ==
As more users adopt Pig and begin writing their data processing in Pig Latin 
and as they use Pig to process more and more complex
tasks, a consistent request from these users is to add branches, loops, and 
functions to Pig Latin.  This will enable Pig Latin to
process a whole new class of problems.  Consider, for example, an algorithm 
that needs to iterate until an error estimate is less
than a given threshold.  This might look like (this just suggests logic, not 
syntax):

{{{
error = 100.0;
infile = 'original.data';
while (error > 1.0) {
A = load 'infile';
B = group A all;
C = foreach B generate flatten(doSomeCalculation(A)) as (result, error);
error = foreach C generate error;
store C into 'outfile';
if (error > 1.0) mv 'outfile' 'infile';
}
}}}

== Requirements ==
The following should be provided by this Turing complete Pig Latin:
 1. Branching.  This will be satisfied by a standard `if` `else if` `else` 
functionality
 1. Looping.  This should include standard `while` and some form of `for`.  for 
could be C style or Python style (foreach).  Care needs to be taken to select 
syntax that does not cause confusion with the existing `foreach` operator in 
Pig Latin.
 1. Functions.  
 1. Modules.
 1. The ability to use local in memory variables in the Pig Latin script.  For 
example, in the snippet given above the way `infile` is defined above the 
`while` and then used in the `load`.
 1. The ability to "store" results into local in memory variables.  For 
example, in the snippet given above the way the error calculation from the data 
processing is stored into `error` in the line `error = foreach C generate 
error;`.

== Approach ==
There are two possible approaches to this.  One is to add all of these features 
to Pig Latin itself.  This has several advantages:
 * All Pig Latin operations will be first class objects in the language.  There 
will not be a need to do quoted programming, like what happens when JDBC is 
used to write SQL inside a Java program.
 * There will be opportunities to do optimizations that are not available in 
embedded programming, such as loop unrolling, etc.

However, the cost of this approach is incredible.  It means turning Pig Latin 
into a full scripting language.  And it means
all kinds of tools like debuggers, etc. will never be available for Pig Latin 
users because the Pig team will not have the resources
or expertise to develop and maintain such tools.  And finally, does the world 
need another scripting language that starts with P?

The second possible approach to this is to embed Pig Latin into an existing 
scripting language, such as Perl, Python, Ruby, etc.  The
advantages of this are:
 * Most of the requirements noted above (branching, looping, functions, and 
modules) are present in these languages.
 * For any of these languages whole hosts of tools such as debuggers, IDEs, 
etc. exist and could be used.
 * Users do not have to learn a new language.

There are a few significant drawbacks to this approach:
 * It leads to a quoted programming style which is unnatural and irritating for 
developers.
 * Which scripting language to choose?  Perl, Python, and Ruby all have 
significant adoption and could make a claim to be the best choice.
 * Syntactic and semantic checking is usually delayed until an embedded bit of 
code is reached in the outer control flow.  Given that Pig jobs can run for 
hours this can mean spending hours to discover a simple typo.

Consider for example if built a python class that wrapped !PigServer and then 
translated the above code snippet.

{{{
error = 100.0
infile = 'original.data'
pig = PigServer()
grunt = Grunt()
while error > 1.0:
pig.registerQuery("A = load 'infile'; \
   B = group A all; \
   C = foreach B generate flatten(doSomeCalculation(A)) 
as (result, error); \
PigIterator pi = pig.openIterator("C", 'outfile');
output = grunt.exec("fs cat 'outfile'");
bla = output.partition("\t");
error = bla(2)
if error >= 1.0:
grunt.exec("fs mv 'outfile' 'infile'")
}}}

All of these references to `pig` and `grunt` as objects with command strings is 
undesirable.
So while I believe that embedding is a much better approach due to the lower 
work load and the plethora of tools available for other
languages, I do not believe the above is an acceptable way to do it.  Thus I 
would like to place three additional requirements on
embedded Pig Latin beyond those given above for Turing complete Pig Latin:
 1.#7 Pig Latin should appear as