RE: Revisit Pig Philosophy?
Hey Milind, Varaha is a boar and not a pig :) I agree with you on the point that Pig and Pig Latin have not been clearly defined and most times they are used interchangeably. Santhosh -Original Message- From: Milind A Bhandarkar [mailto:mili...@yahoo-inc.com] Sent: Friday, September 18, 2009 8:02 PM To: pig-dev@hadoop.apache.org Cc: pig-dev@hadoop.apache.org Subject: Re: Revisit Pig Philosophy? It's Friday evening, so I have some time to discuss philosophy ;-) Before we discuss any question about revisiting pig philosophy, the first question that needs to be answered is what is pig ? (this corresponds to the Hindu philosophy's basic argument, that any deep personal philosophical investigations need to start with a question koham? (in Sanskrit, it means 'who am I?')) So, coming back to approx 4000 years after the origin of that philosophy, we need to ask what is pig? (incidentally, pig, or varaaha in Sanskrit, was the second incarnation of lord Vishnu in hindu scriptures, but that's not relevant here.) What we need to decide is, is pig is a dataflow language ? I think not. Pig Latin is the language. Pig is referred to in countless slide decks ( aka pig scriptures, btw I own 50% of these scriptures) as a runtime system that interprets pig Latin, kind of like java and jvm. (Duality of nature, called dwaita philosophy in sanskrit is applicable here. But I won't go deeper than that.) So, pig-Latin-the-language's stance could still be that it could be implemented on any runtime. But pig the runtime's philosophy could be that it is a thin layer on top of hadoop. And all the world could breathe a sigh of relief. (mostly, by not having to answer these philosophical questions.) So, 'koham' is the 4000 year old question this project needs to answer. That's all. AUM.. (it's Friday.) - (swami) Milind ;-) On Sep 18, 2009, at 19:05, Jeff Hammerbacher ham...@cloudera.com wrote: Hey, 2. Local mode and other parallel frameworks snip Pigs Live Anywhere Pig is intended to be a language for parallel data processing. It is not tied to one particular parallel framework. It has been implemented first on hadoop, but we do not intend that to be only on hadoop. /snip Are we still holding onto this? What about local mode? Local mode is not being treated on equal footing with that of Hadoop for practical reasons. However, users expect things that work on local mode to work without any hitches on Hadoop. Are we still designing the system assuming that Pig will be stacked on top of other parallel frameworks? FWIW, I appreciate this philosophical stance from Pig. Allowing locally tested scripts to be migrated to the cluster without breakage is a noble goal, and keeping the option of (one day) developing an alternative execution environment for Pig that runs over HDFS but uses a richer physical set of operators than MapReduce would be great. Of course, those of you who are running Pig in production will have a much better sense of the feasibility, rather than desirability, of this philosophical stance. Later, Jeff
Re: Revisit Pig Philosophy?
I agree with Milind that we should move to saying that Pig Latin is a data flow language independent of any particular platform, while the current implementation of Pig is tied to Hadoop. I'm not sure how thin that implementation will be, but I'm in favor of making it thin where possible (such as the recent proposal to shift LoadFunc to directly use InputFormat). I also strongly agree that we need to be more precise in our terminology between Pig (the platform) and Pig Latin (the language), especially as we're working on making Pig bilingual (with the addition of SQL). I am fine with saying that Pig SQL adheres as much as possible (given the underlying systems, etc.) to ANSI SQL semantics. And where there is shared functionality such as UDFs we again adhere to SQL semantics when it does not conflict with other Pig goals. So COUNT, and SUM should handle nulls the way SQL does, for example. But we need to craft the statement carefully. To see why, consider Pig's data model. We would like our types to map nicely into SQL types, so that if Pig SQL users declare a column to be of type VARCHAR(32) or FLOAT(10) we can map those onto some Pig type. But we don't want to use SQL types directly inside Pig, as they aren't a good match for much of Pig processing. So any statement of using SQL semantics needs caveats. I would also vote for modifying our Pigs Live Anywhere dictum to be: Pig Latin is intended to be a language for parallel data processing. It is not tied to one particular parallel framework. The initial implementation of Pig is on Hadoop and seeks to leverage the power of Hadoop wherever possible. However, nothing Hadoop specific should be exposed in Pig Latin. We may also want to add a vocabulary section to the philosophy statement to clarify between Pig and Pig Latin. Alan. On Sep 18, 2009, at 8:01 PM, Milind A Bhandarkar wrote: It's Friday evening, so I have some time to discuss philosophy ;-) Before we discuss any question about revisiting pig philosophy, the first question that needs to be answered is what is pig ? (this corresponds to the Hindu philosophy's basic argument, that any deep personal philosophical investigations need to start with a question koham? (in Sanskrit, it means 'who am I?')) So, coming back to approx 4000 years after the origin of that philosophy, we need to ask what is pig? (incidentally, pig, or varaaha in Sanskrit, was the second incarnation of lord Vishnu in hindu scriptures, but that's not relevant here.) What we need to decide is, is pig is a dataflow language ? I think not. Pig Latin is the language. Pig is referred to in countless slide decks ( aka pig scriptures, btw I own 50% of these scriptures) as a runtime system that interprets pig Latin, kind of like java and jvm. (Duality of nature, called dwaita philosophy in sanskrit is applicable here. But I won't go deeper than that.) So, pig-Latin-the-language's stance could still be that it could be implemented on any runtime. But pig the runtime's philosophy could be that it is a thin layer on top of hadoop. And all the world could breathe a sigh of relief. (mostly, by not having to answer these philosophical questions.) So, 'koham' is the 4000 year old question this project needs to answer. That's all. AUM.. (it's Friday.) - (swami) Milind ;-) On Sep 18, 2009, at 19:05, Jeff Hammerbacher ham...@cloudera.com wrote: Hey, 2. Local mode and other parallel frameworks snip Pigs Live Anywhere Pig is intended to be a language for parallel data processing. It is not tied to one particular parallel framework. It has been implemented first on hadoop, but we do not intend that to be only on hadoop. /snip Are we still holding onto this? What about local mode? Local mode is not being treated on equal footing with that of Hadoop for practical reasons. However, users expect things that work on local mode to work without any hitches on Hadoop. Are we still designing the system assuming that Pig will be stacked on top of other parallel frameworks? FWIW, I appreciate this philosophical stance from Pig. Allowing locally tested scripts to be migrated to the cluster without breakage is a noble goal, and keeping the option of (one day) developing an alternative execution environment for Pig that runs over HDFS but uses a richer physical set of operators than MapReduce would be great. Of course, those of you who are running Pig in production will have a much better sense of the feasibility, rather than desirability, of this philosophical stance. Later, Jeff
Re: Revisit Pig Philosophy?
Pig Latin is intended to be a language for parallel data processing. It is not tied to one particular parallel framework +1 -- amr Alan Gates wrote: I agree with Milind that we should move to saying that Pig Latin is a data flow language independent of any particular platform, while the current implementation of Pig is tied to Hadoop. I'm not sure how thin that implementation will be, but I'm in favor of making it thin where possible (such as the recent proposal to shift LoadFunc to directly use InputFormat). I also strongly agree that we need to be more precise in our terminology between Pig (the platform) and Pig Latin (the language), especially as we're working on making Pig bilingual (with the addition of SQL). I am fine with saying that Pig SQL adheres as much as possible (given the underlying systems, etc.) to ANSI SQL semantics. And where there is shared functionality such as UDFs we again adhere to SQL semantics when it does not conflict with other Pig goals. So COUNT, and SUM should handle nulls the way SQL does, for example. But we need to craft the statement carefully. To see why, consider Pig's data model. We would like our types to map nicely into SQL types, so that if Pig SQL users declare a column to be of type VARCHAR(32) or FLOAT(10) we can map those onto some Pig type. But we don't want to use SQL types directly inside Pig, as they aren't a good match for much of Pig processing. So any statement of using SQL semantics needs caveats. I would also vote for modifying our Pigs Live Anywhere dictum to be: Pig Latin is intended to be a language for parallel data processing. It is not tied to one particular parallel framework. The initial implementation of Pig is on Hadoop and seeks to leverage the power of Hadoop wherever possible. However, nothing Hadoop specific should be exposed in Pig Latin. We may also want to add a vocabulary section to the philosophy statement to clarify between Pig and Pig Latin. Alan. On Sep 18, 2009, at 8:01 PM, Milind A Bhandarkar wrote: It's Friday evening, so I have some time to discuss philosophy ;-) Before we discuss any question about revisiting pig philosophy, the first question that needs to be answered is what is pig ? (this corresponds to the Hindu philosophy's basic argument, that any deep personal philosophical investigations need to start with a question koham? (in Sanskrit, it means 'who am I?')) So, coming back to approx 4000 years after the origin of that philosophy, we need to ask what is pig? (incidentally, pig, or varaaha in Sanskrit, was the second incarnation of lord Vishnu in hindu scriptures, but that's not relevant here.) What we need to decide is, is pig is a dataflow language ? I think not. Pig Latin is the language. Pig is referred to in countless slide decks ( aka pig scriptures, btw I own 50% of these scriptures) as a runtime system that interprets pig Latin, kind of like java and jvm. (Duality of nature, called dwaita philosophy in sanskrit is applicable here. But I won't go deeper than that.) So, pig-Latin-the-language's stance could still be that it could be implemented on any runtime. But pig the runtime's philosophy could be that it is a thin layer on top of hadoop. And all the world could breathe a sigh of relief. (mostly, by not having to answer these philosophical questions.) So, 'koham' is the 4000 year old question this project needs to answer. That's all. AUM.. (it's Friday.) - (swami) Milind ;-) On Sep 18, 2009, at 19:05, Jeff Hammerbacher ham...@cloudera.com wrote: Hey, 2. Local mode and other parallel frameworks snip Pigs Live Anywhere Pig is intended to be a language for parallel data processing. It is not tied to one particular parallel framework. It has been implemented first on hadoop, but we do not intend that to be only on hadoop. /snip Are we still holding onto this? What about local mode? Local mode is not being treated on equal footing with that of Hadoop for practical reasons. However, users expect things that work on local mode to work without any hitches on Hadoop. Are we still designing the system assuming that Pig will be stacked on top of other parallel frameworks? FWIW, I appreciate this philosophical stance from Pig. Allowing locally tested scripts to be migrated to the cluster without breakage is a noble goal, and keeping the option of (one day) developing an alternative execution environment for Pig that runs over HDFS but uses a richer physical set of operators than MapReduce would be great. Of course, those of you who are running Pig in production will have a much better sense of the feasibility, rather than desirability, of this philosophical stance. Later, Jeff
Re: Revisit Pig Philosophy?
Hey, 2. Local mode and other parallel frameworks snip Pigs Live Anywhere Pig is intended to be a language for parallel data processing. It is not tied to one particular parallel framework. It has been implemented first on hadoop, but we do not intend that to be only on hadoop. /snip Are we still holding onto this? What about local mode? Local mode is not being treated on equal footing with that of Hadoop for practical reasons. However, users expect things that work on local mode to work without any hitches on Hadoop. Are we still designing the system assuming that Pig will be stacked on top of other parallel frameworks? FWIW, I appreciate this philosophical stance from Pig. Allowing locally tested scripts to be migrated to the cluster without breakage is a noble goal, and keeping the option of (one day) developing an alternative execution environment for Pig that runs over HDFS but uses a richer physical set of operators than MapReduce would be great. Of course, those of you who are running Pig in production will have a much better sense of the feasibility, rather than desirability, of this philosophical stance. Later, Jeff
Re: Revisit Pig Philosophy?
It's Friday evening, so I have some time to discuss philosophy ;-) Before we discuss any question about revisiting pig philosophy, the first question that needs to be answered is what is pig ? (this corresponds to the Hindu philosophy's basic argument, that any deep personal philosophical investigations need to start with a question koham? (in Sanskrit, it means 'who am I?')) So, coming back to approx 4000 years after the origin of that philosophy, we need to ask what is pig? (incidentally, pig, or varaaha in Sanskrit, was the second incarnation of lord Vishnu in hindu scriptures, but that's not relevant here.) What we need to decide is, is pig is a dataflow language ? I think not. Pig Latin is the language. Pig is referred to in countless slide decks ( aka pig scriptures, btw I own 50% of these scriptures) as a runtime system that interprets pig Latin, kind of like java and jvm. (Duality of nature, called dwaita philosophy in sanskrit is applicable here. But I won't go deeper than that.) So, pig-Latin-the-language's stance could still be that it could be implemented on any runtime. But pig the runtime's philosophy could be that it is a thin layer on top of hadoop. And all the world could breathe a sigh of relief. (mostly, by not having to answer these philosophical questions.) So, 'koham' is the 4000 year old question this project needs to answer. That's all. AUM.. (it's Friday.) - (swami) Milind ;-) On Sep 18, 2009, at 19:05, Jeff Hammerbacher ham...@cloudera.com wrote: Hey, 2. Local mode and other parallel frameworks snip Pigs Live Anywhere Pig is intended to be a language for parallel data processing. It is not tied to one particular parallel framework. It has been implemented first on hadoop, but we do not intend that to be only on hadoop. /snip Are we still holding onto this? What about local mode? Local mode is not being treated on equal footing with that of Hadoop for practical reasons. However, users expect things that work on local mode to work without any hitches on Hadoop. Are we still designing the system assuming that Pig will be stacked on top of other parallel frameworks? FWIW, I appreciate this philosophical stance from Pig. Allowing locally tested scripts to be migrated to the cluster without breakage is a noble goal, and keeping the option of (one day) developing an alternative execution environment for Pig that runs over HDFS but uses a richer physical set of operators than MapReduce would be great. Of course, those of you who are running Pig in production will have a much better sense of the feasibility, rather than desirability, of this philosophical stance. Later, Jeff