RE: Revisit Pig Philosophy?

2009-09-21 Thread Santhosh Srinivasan
Hey Milind,

Varaha is a boar and not a pig :) I agree with you on the point that Pig
and Pig Latin have not been clearly defined and most times they are used
interchangeably.

Santhosh 

-Original Message-
From: Milind A Bhandarkar [mailto:mili...@yahoo-inc.com] 
Sent: Friday, September 18, 2009 8:02 PM
To: pig-dev@hadoop.apache.org
Cc: pig-dev@hadoop.apache.org
Subject: Re: Revisit Pig Philosophy?

It's Friday evening, so I have some time to discuss philosophy ;-)

Before we discuss any question about revisiting pig philosophy, the
first question that needs to be answered is what is pig ? (this
corresponds to the Hindu philosophy's basic argument, that any deep
personal philosophical investigations need to start with a question
koham? (in Sanskrit, it means 'who am I?'))

So, coming back to approx 4000 years after the origin of that
philosophy, we need to ask what is pig? (incidentally, pig, or varaaha
in Sanskrit, was the second incarnation of lord Vishnu in hindu
scriptures, but that's not relevant here.)

What we need to decide is, is pig is a dataflow language ? I think not.
Pig Latin is the language. Pig is referred to in countless slide decks
( aka pig scriptures, btw I own 50% of these scriptures) as a runtime
system that interprets pig Latin, kind of like java and jvm. (Duality of
nature, called dwaita philosophy in sanskrit is applicable here. But I
won't go deeper than that.)

So, pig-Latin-the-language's stance  could still be that it could be
implemented on any runtime. But pig the runtime's philosophy could be
that it is a thin layer on top of hadoop. And all the world could
breathe a sigh of relief. (mostly, by not having to answer these
philosophical questions.)

So, 'koham' is the 4000 year old question this project needs to answer.
That's all.

AUM.. (it's Friday.)

- (swami) Milind ;-)

On Sep 18, 2009, at 19:05, Jeff Hammerbacher ham...@cloudera.com
wrote:

 Hey,

 2. Local mode and other parallel frameworks

 snip
 Pigs Live Anywhere

 Pig is intended to be a language for parallel data processing. It is 
 not tied to one particular parallel framework. It has been 
 implemented first on hadoop, but we do not intend that to be only on 
 hadoop.
 /snip

 Are we still holding onto this? What about local mode? Local mode is 
 not being treated on equal footing with that of Hadoop for practical 
 reasons. However, users expect things that work on local mode to work

 without any hitches on Hadoop.

 Are we still designing the system assuming that Pig will be stacked 
 on top of other parallel frameworks?


 FWIW, I appreciate this philosophical stance from Pig. Allowing 
 locally tested scripts to be migrated to the cluster without breakage 
 is a noble goal, and keeping the option of (one day) developing an 
 alternative execution environment for Pig that runs over HDFS but uses

 a richer physical set of operators than MapReduce would be great.

 Of course, those of you who are running Pig in production will have a 
 much better sense of the feasibility, rather than desirability, of 
 this philosophical stance.

 Later,
 Jeff


Re: Revisit Pig Philosophy?

2009-09-21 Thread Alan Gates
I agree with Milind that we should move to saying that Pig Latin is a  
data flow language independent of any particular platform, while the  
current implementation of Pig is tied to Hadoop.  I'm not sure how  
thin that implementation will be, but I'm in favor of making it thin  
where possible (such as the recent proposal to shift LoadFunc to  
directly use InputFormat).


I also strongly agree that we need to be more precise in our  
terminology between Pig (the platform) and Pig Latin (the language),  
especially as we're working on making Pig bilingual (with the addition  
of SQL).


I am fine with saying that Pig SQL adheres as much as possible (given  
the underlying systems, etc.) to ANSI SQL semantics.  And where there  
is shared functionality such as UDFs we again adhere to SQL semantics  
when it does not conflict with other Pig goals.  So COUNT, and SUM  
should handle nulls the way SQL does, for example.  But we need to  
craft the statement carefully.  To see why, consider Pig's data  
model.  We would like our types to map nicely into SQL types, so that  
if Pig SQL users declare a column to be of type VARCHAR(32) or  
FLOAT(10) we can map those onto some Pig type.  But we don't want to  
use SQL types directly inside Pig, as they aren't a good match for  
much of Pig processing.  So any statement of using SQL semantics needs  
caveats.


I would also vote for modifying our Pigs Live Anywhere dictum to be:

Pig Latin is intended to be a language for parallel data processing.  
It is not
tied to one particular parallel framework. The initial implementation  
of Pig  is on Hadoop and seeks to leverage the power of Hadoop  
wherever possible.  However, nothing Hadoop specific should be exposed  
in Pig Latin.


We may also want to add a vocabulary section to the philosophy  
statement to clarify between Pig and Pig Latin.


Alan.


On Sep 18, 2009, at 8:01 PM, Milind A Bhandarkar wrote:


It's Friday evening, so I have some time to discuss philosophy ;-)

Before we discuss any question about revisiting pig philosophy, the
first question that needs to be answered is what is pig ? (this
corresponds to the Hindu philosophy's basic argument, that any deep
personal philosophical investigations need to start with a question
koham? (in Sanskrit, it means 'who am I?'))

So, coming back to approx 4000 years after the origin of that
philosophy, we need to ask what is pig? (incidentally, pig, or
varaaha in Sanskrit, was the second incarnation of lord Vishnu in
hindu scriptures, but that's not relevant here.)

What we need to decide is, is pig is a dataflow language ? I think
not. Pig Latin is the language. Pig is referred to in countless
slide decks ( aka pig scriptures, btw I own 50% of these scriptures)
as a runtime system that interprets pig Latin, kind of like java and
jvm. (Duality of nature, called dwaita philosophy in sanskrit is
applicable here. But I won't go deeper than that.)

So, pig-Latin-the-language's stance  could still be that it could be
implemented on any runtime. But pig the runtime's philosophy could be
that it is a thin layer on top of hadoop. And all the world could
breathe a sigh of relief. (mostly, by not having to answer these
philosophical questions.)

So, 'koham' is the 4000 year old question this project needs to
answer. That's all.

AUM.. (it's Friday.)

- (swami) Milind ;-)

On Sep 18, 2009, at 19:05, Jeff Hammerbacher ham...@cloudera.com
wrote:


Hey,


2. Local mode and other parallel frameworks

snip
Pigs Live Anywhere

Pig is intended to be a language for parallel data processing. It
is not
tied to one particular parallel framework. It has been implemented
first
on hadoop, but we do not intend that to be only on hadoop.
/snip

Are we still holding onto this? What about local mode? Local mode
is not
being treated on equal footing with that of Hadoop for practical
reasons. However, users expect things that work on local mode to  
work

without any hitches on Hadoop.

Are we still designing the system assuming that Pig will be stacked
on
top of other parallel frameworks?



FWIW, I appreciate this philosophical stance from Pig. Allowing
locally
tested scripts to be migrated to the cluster without breakage is a
noble
goal, and keeping the option of (one day) developing an alternative
execution environment for Pig that runs over HDFS but uses a richer
physical
set of operators than MapReduce would be great.

Of course, those of you who are running Pig in production will have
a much
better sense of the feasibility, rather than desirability, of this
philosophical stance.

Later,
Jeff




Re: Revisit Pig Philosophy?

2009-09-21 Thread Amr Awadallah
 Pig Latin is intended to be a language for parallel data processing. 
It is not tied to one particular parallel framework


+1

-- amr

Alan Gates wrote:
I agree with Milind that we should move to saying that Pig Latin is a 
data flow language independent of any particular platform, while the 
current implementation of Pig is tied to Hadoop.  I'm not sure how 
thin that implementation will be, but I'm in favor of making it thin 
where possible (such as the recent proposal to shift LoadFunc to 
directly use InputFormat).


I also strongly agree that we need to be more precise in our 
terminology between Pig (the platform) and Pig Latin (the language), 
especially as we're working on making Pig bilingual (with the addition 
of SQL).


I am fine with saying that Pig SQL adheres as much as possible (given 
the underlying systems, etc.) to ANSI SQL semantics.  And where there 
is shared functionality such as UDFs we again adhere to SQL semantics 
when it does not conflict with other Pig goals.  So COUNT, and SUM 
should handle nulls the way SQL does, for example.  But we need to 
craft the statement carefully.  To see why, consider Pig's data 
model.  We would like our types to map nicely into SQL types, so that 
if Pig SQL users declare a column to be of type VARCHAR(32) or 
FLOAT(10) we can map those onto some Pig type.  But we don't want to 
use SQL types directly inside Pig, as they aren't a good match for 
much of Pig processing.  So any statement of using SQL semantics needs 
caveats.


I would also vote for modifying our Pigs Live Anywhere dictum to be:

Pig Latin is intended to be a language for parallel data processing. 
It is not
tied to one particular parallel framework. The initial implementation 
of Pig  is on Hadoop and seeks to leverage the power of Hadoop 
wherever possible.  However, nothing Hadoop specific should be exposed 
in Pig Latin.


We may also want to add a vocabulary section to the philosophy 
statement to clarify between Pig and Pig Latin.


Alan.


On Sep 18, 2009, at 8:01 PM, Milind A Bhandarkar wrote:


It's Friday evening, so I have some time to discuss philosophy ;-)

Before we discuss any question about revisiting pig philosophy, the
first question that needs to be answered is what is pig ? (this
corresponds to the Hindu philosophy's basic argument, that any deep
personal philosophical investigations need to start with a question
koham? (in Sanskrit, it means 'who am I?'))

So, coming back to approx 4000 years after the origin of that
philosophy, we need to ask what is pig? (incidentally, pig, or
varaaha in Sanskrit, was the second incarnation of lord Vishnu in
hindu scriptures, but that's not relevant here.)

What we need to decide is, is pig is a dataflow language ? I think
not. Pig Latin is the language. Pig is referred to in countless
slide decks ( aka pig scriptures, btw I own 50% of these scriptures)
as a runtime system that interprets pig Latin, kind of like java and
jvm. (Duality of nature, called dwaita philosophy in sanskrit is
applicable here. But I won't go deeper than that.)

So, pig-Latin-the-language's stance  could still be that it could be
implemented on any runtime. But pig the runtime's philosophy could be
that it is a thin layer on top of hadoop. And all the world could
breathe a sigh of relief. (mostly, by not having to answer these
philosophical questions.)

So, 'koham' is the 4000 year old question this project needs to
answer. That's all.

AUM.. (it's Friday.)

- (swami) Milind ;-)

On Sep 18, 2009, at 19:05, Jeff Hammerbacher ham...@cloudera.com
wrote:


Hey,


2. Local mode and other parallel frameworks

snip
Pigs Live Anywhere

Pig is intended to be a language for parallel data processing. It
is not
tied to one particular parallel framework. It has been implemented
first
on hadoop, but we do not intend that to be only on hadoop.
/snip

Are we still holding onto this? What about local mode? Local mode
is not
being treated on equal footing with that of Hadoop for practical
reasons. However, users expect things that work on local mode to work
without any hitches on Hadoop.

Are we still designing the system assuming that Pig will be stacked
on
top of other parallel frameworks?



FWIW, I appreciate this philosophical stance from Pig. Allowing
locally
tested scripts to be migrated to the cluster without breakage is a
noble
goal, and keeping the option of (one day) developing an alternative
execution environment for Pig that runs over HDFS but uses a richer
physical
set of operators than MapReduce would be great.

Of course, those of you who are running Pig in production will have
a much
better sense of the feasibility, rather than desirability, of this
philosophical stance.

Later,
Jeff




Re: Revisit Pig Philosophy?

2009-09-18 Thread Jeff Hammerbacher
Hey,

 2. Local mode and other parallel frameworks

 snip
 Pigs Live Anywhere

 Pig is intended to be a language for parallel data processing. It is not
 tied to one particular parallel framework. It has been implemented first
 on hadoop, but we do not intend that to be only on hadoop.
 /snip

 Are we still holding onto this? What about local mode? Local mode is not
 being treated on equal footing with that of Hadoop for practical
 reasons. However, users expect things that work on local mode to work
 without any hitches on Hadoop.

 Are we still designing the system assuming that Pig will be stacked on
 top of other parallel frameworks?


FWIW, I appreciate this philosophical stance from Pig. Allowing locally
tested scripts to be migrated to the cluster without breakage is a noble
goal, and keeping the option of (one day) developing an alternative
execution environment for Pig that runs over HDFS but uses a richer physical
set of operators than MapReduce would be great.

Of course, those of you who are running Pig in production will have a much
better sense of the feasibility, rather than desirability, of this
philosophical stance.

Later,
Jeff


Re: Revisit Pig Philosophy?

2009-09-18 Thread Milind A Bhandarkar
It's Friday evening, so I have some time to discuss philosophy ;-)

Before we discuss any question about revisiting pig philosophy, the  
first question that needs to be answered is what is pig ? (this  
corresponds to the Hindu philosophy's basic argument, that any deep  
personal philosophical investigations need to start with a question  
koham? (in Sanskrit, it means 'who am I?'))

So, coming back to approx 4000 years after the origin of that  
philosophy, we need to ask what is pig? (incidentally, pig, or  
varaaha in Sanskrit, was the second incarnation of lord Vishnu in  
hindu scriptures, but that's not relevant here.)

What we need to decide is, is pig is a dataflow language ? I think  
not. Pig Latin is the language. Pig is referred to in countless  
slide decks ( aka pig scriptures, btw I own 50% of these scriptures)  
as a runtime system that interprets pig Latin, kind of like java and  
jvm. (Duality of nature, called dwaita philosophy in sanskrit is  
applicable here. But I won't go deeper than that.)

So, pig-Latin-the-language's stance  could still be that it could be  
implemented on any runtime. But pig the runtime's philosophy could be  
that it is a thin layer on top of hadoop. And all the world could  
breathe a sigh of relief. (mostly, by not having to answer these  
philosophical questions.)

So, 'koham' is the 4000 year old question this project needs to  
answer. That's all.

AUM.. (it's Friday.)

- (swami) Milind ;-)

On Sep 18, 2009, at 19:05, Jeff Hammerbacher ham...@cloudera.com  
wrote:

 Hey,

 2. Local mode and other parallel frameworks

 snip
 Pigs Live Anywhere

 Pig is intended to be a language for parallel data processing. It  
 is not
 tied to one particular parallel framework. It has been implemented  
 first
 on hadoop, but we do not intend that to be only on hadoop.
 /snip

 Are we still holding onto this? What about local mode? Local mode  
 is not
 being treated on equal footing with that of Hadoop for practical
 reasons. However, users expect things that work on local mode to work
 without any hitches on Hadoop.

 Are we still designing the system assuming that Pig will be stacked  
 on
 top of other parallel frameworks?


 FWIW, I appreciate this philosophical stance from Pig. Allowing  
 locally
 tested scripts to be migrated to the cluster without breakage is a  
 noble
 goal, and keeping the option of (one day) developing an alternative
 execution environment for Pig that runs over HDFS but uses a richer  
 physical
 set of operators than MapReduce would be great.

 Of course, those of you who are running Pig in production will have  
 a much
 better sense of the feasibility, rather than desirability, of this
 philosophical stance.

 Later,
 Jeff