[Pig Wiki] Update of "ProposedByLaws" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "ProposedByLaws" page has been changed by AlanGates. http://wiki.apache.org/pig/ProposedByLaws?action=diff&rev1=3&rev2=4 -- Voting can also be applied to changes already made to the Pig codebase. These typically take the form of a veto (-1) in reply to the commit message - sent when the commit is made. Note that this should be a rare occurance. + sent when the commit is made. Note that this should be a rare occurrence. All efforts should be made to discuss issues when they are still patches before the code is committed. === Approvals ===
[Pig Wiki] Update of "ProposedByLaws" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "ProposedByLaws" page has been changed by AlanGates. http://wiki.apache.org/pig/ProposedByLaws?action=diff&rev1=2&rev2=3 -- In general votes should not be called at times when it is known that interested members of the project will be unavailable. - || '''Action''' || '''Description''' || '''Approval''' || '''Binding Votes''' || '''Length''' || + || '''Action''' || '''Description''' || '''Approval''' || '''Binding Votes''' || '''Minimum Length''' || || Code Change || A change made to a codebase of the project and committed by a committer. This includes source code, documentation, website content, etc. || Lazy approval (not counting the vote of the contributor), moving to lazy majority if a -1 is received || Active committers || 1 || || Release Plan || Defines the timetable and actions for a release. The plan also nominates a Release Manager. || Lazy majority || Active committers || 3 || || Product Release || When a release of one of the project's products is ready, a vote is required to accept the release as an official release of the project. || Lazy Majority || Active PMC members || 3 ||
[Pig Wiki] Update of "ProposedByLaws" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "ProposedByLaws" page has been changed by AlanGates. http://wiki.apache.org/pig/ProposedByLaws?action=diff&rev1=1&rev2=2 -- perception of an action in the wider Pig community. For PMC decisions, only the votes of PMC members are binding. - Voting can also be applied to changes made to the Pig codebase. These + Voting can also be applied to changes already made to the Pig codebase. These typically take the form of a veto (-1) in reply to the commit message - sent when the commit is made. + sent when the commit is made. Note that this should be a rare occurance. + All efforts should be made to discuss issues when they are still patches before the code is committed. === Approvals === These are the types of approvals that can be sought. Different actions @@ -171, +172 @@ === Actions === This section describes the various actions which are undertaken within the project, the corresponding approval required for that action and - those who have binding votes over the action. + those who have binding votes over the action. It also specifies the minimum length of time that a vote must remain open, measured in business days. + In general votes should not be called at times when it is + known that interested members of the project will be unavailable. - || '''Action''' || '''Description''' || '''Approval''' || '''Binding Votes''' || + || '''Action''' || '''Description''' || '''Approval''' || '''Binding Votes''' || '''Length''' || - || Code Change || A change made to a codebase of the project and committed by a committer. This includes source code, documentation, website content, etc. || Lazy approval || Active committers || + || Code Change || A change made to a codebase of the project and committed by a committer. This includes source code, documentation, website content, etc. || Lazy approval (not counting the vote of the contributor), moving to lazy majority if a -1 is received || Active committers || 1 || - || Release Plan || Defines the timetable and actions for a release. The plan also nominates a Release Manager. || Lazy majority || Active committers || + || Release Plan || Defines the timetable and actions for a release. The plan also nominates a Release Manager. || Lazy majority || Active committers || 3 || - || Product Release || When a release of one of the project's products is ready, a vote is required to accept the release as an official release of the project. || Lazy Majority || Active PMC members || + || Product Release || When a release of one of the project's products is ready, a vote is required to accept the release as an official release of the project. || Lazy Majority || Active PMC members || 3 || - || Adoption of New Codebase || When the codebase for an existing, released product is to be replaced with an alternative codebase. If such a vote fails to gain approval, the existing code base will continue. This also covers the creation of new sub-projects within the project. || 2/3 majority || Active PMC members '''NOTE''': Change from Hadoop proposal which had Active committers || + || Adoption of New Codebase || When the codebase for an existing, released product is to be replaced with an alternative codebase. If such a vote fails to gain approval, the existing code base will continue. This also covers the creation of new sub-projects within the project. || 2/3 majority || Active PMC members '''NOTE''': Change from Hadoop proposal which had Active committers || 6 || - || New Committer || When a new committer is proposed for the project. || Lazy consensus || Active PMC members || + || New Committer || When a new committer is proposed for the project. || Lazy consensus || Active PMC members || 3 || - || New PMC Member || When a committer is proposed for the PMC. || Lazy consensus || Active PMC members || + || New PMC Member || When a committer is proposed for the PMC. || Lazy consensus || Active PMC members || 3 || - || Committer Removal || When removal of commit privileges is sought. '''Note:''' Such actions will also be referred to the ASF board by the PMC chair. || Consensus || Active PMC members (excluding the committer in question if a member of the PMC). || + || Committer Removal || When removal of commit privileges is sought. '''Note:''' Such actions will also be referred to the ASF board by the PMC chair. || Consensus || Active PMC members (excluding the committer in question if a member of the PMC). || 6 || - || PMC Member Removal || When removal of a PMC member is sought. '''Note:''' Such actions will also be referred to the ASF board by the PMC chair. || Consensus || Active PMC members (excluding the member in question). || + || PMC Member Removal || When removal of a PMC member is sought. '''Note:''' Such actions will also be referred to the ASF board b
[Pig Wiki] Update of "PigLatin" by jsha
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigLatin" page has been changed by jsha. http://wiki.apache.org/pig/PigLatin?action=diff&rev1=36&rev2=37 -- <> <> - '''THIS PAGE IS OBSOLETE. Please use documentation at http://hadoop.apache.org/pig/''' + '''THIS PAGE IS OBSOLETE. Please use Pig Latin documentation at http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref1.html''' '''Note:''' For Pig 0.2.0 or later, some content on this page may no longer be applicable.
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by nirajrai
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by nirajrai. http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=158&rev2=159 -- ||2254 ||Currently merged cogroup is not supported after blocking operators. || ||2255 ||POSkewedJoin operator has " + compiledInputs.length + " inputs. It should have 2. || ||2256 ||Cannot remove and reconnect node with multiple inputs/outputs || + ||2257 ||An unexpected exception caused the validation to stop || ||2998 ||Unexpected internal error. || ||2999 ||Unhandled internal error. ||
[Pig Wiki] Update of "ProposedByLaws" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "ProposedByLaws" page has been changed by AlanGates. http://wiki.apache.org/pig/ProposedByLaws -- New page: The following is a proposal for by laws for the Apache Pig project. I took this almost verbatim from a proposal made by Owen O'Malley for the Hadoop Project. Places where I modified it I have tagged with '''NOTE'''. = Apache Pig Project Bylaws = This document defines the bylaws under which the Apache Pig project operates. It defines the roles and responsibilities of the project, who may vote, how voting works, how conflicts are resolved, etc. Pig is a project of the [[http://www.apache.org/foundation/|Apache Software Foundation]]. The foundation holds the copyright on Apache code including the code in the Pig codebase. The [[http://www.apache.org/foundation/faq.html|foundation FAQ]] explains the operation and background of the foundation. Pig is typical of Apache projects in that it operates under a set of principles, known collectively as the 'Apache Way'. If you are new to Apache development, please refer to the [[http://incubator.apache.org|Incubator project]] for more information on how Apache projects operate. == Roles and Responsibilities == Apache projects define a set of roles with associated rights and responsibilities. These roles govern what tasks an individual may perform within the project. The roles are defined in the following sections. === Users === The most important participants in the project are people who use our software. The majority of our contributors start out as users and guide their development efforts from the user's perspective. Users contribute to the Apache projects by providing feedback to contributors in the form of bug reports and feature suggestions. As well, users participate in the Apache community by helping other users on mailing lists and user support forums. === Contributors === '''NOTE''': Changed from "Developers" in Hadoop proposal to "Contributors", and throughout All of the volunteers who are contributing time, code, documentation, or resources to the Pig Project. A contributor that makes sustained, welcome contributions to the project may be invited to become a Committer, though the exact timing of such invitations depends on many factors. === Committers === The project's Committers are responsible for the project's technical management. Committers have access to a specified set of subproject's subversion repositories. Committers on subprojects may cast binding votes on any technical discussion regarding that subproject. Committer access is by invitation only and must be approved by lazy consensus of the active PMC members. A Committer is considered emeritus by their own declaration or by not contributing in any form to the project for over six months. An emeritus committer may request reinstatement of commit access from the PMC which will be sufficient to restore him or her to active committer status. '''NOTE''': Change from Hadoop proposal, added phrase "which will be sufficient..." and removed "Such reinstatement is subject to lazy consensus of active PMC members." Commit access can be revoked by a unanimous vote of all the active PMC members (except the committer in question if they are also a PMC member). All Apache committers are required to have a signed Contributor License Agreement (CLA) on file with the Apache Software Foundation. There is a [[http://www.apache.org/dev/committers.html|Committer FAQ]] which provides more details on the requirements for Committers A committer who makes a sustained contribution to the project may be invited to become a member of the PMC. The form of contribution is not limited to code. It can also include code review, helping out users on the mailing lists, documentation, etc. === Project Management Committee === The PMC is responsible to the board and the ASF for the management and oversight of the Apache Pig codebase. The responsibilities of the PMC include * Deciding what is distributed as products of the Apache Pig project. In particular all releases must be approved by the PMC. * Maintaining the project's shared resources, including the codebase repository, mailing lists, websites. * Speaking on behalf of the project. * Resolving license disputes regarding products of the project. * Nominating new PMC members and committers. * Maintaining these bylaws and other guidelines of the project. Membership of the PMC is by invitation only and must be approved by a lazy consensus of active PMC members. A PMC member is considered 'emeritus' by their own declaration or by not contributing in any form to the project for over six months. An emeritus member may request reinstatement to the PMC, which will be sufficient to restore him or her to active PMC member. '''NOTE''': Change from Hadoop proposal, added phrase "which will
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by daijy
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by daijy. http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=157&rev2=158 -- ||2253 ||Side loaders in cogroup must implement IndexableLoadFunc. || ||2254 ||Currently merged cogroup is not supported after blocking operators. || ||2255 ||POSkewedJoin operator has " + compiledInputs.length + " inputs. It should have 2. || + ||2256 ||Cannot remove and reconnect node with multiple inputs/outputs || ||2998 ||Unexpected internal error. || ||2999 ||Unhandled internal error. ||
[Pig Wiki] Update of "SemanticsCleanup" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "SemanticsCleanup" page has been changed by AlanGates. http://wiki.apache.org/pig/SemanticsCleanup?action=diff&rev1=2&rev2=3 -- || [[https://issues.apache.org/jira/browse/PIG-1584|PIG-1584]] || Grammar || Cogroup inner does not match the semantics of inner join. It is also not clear what value the inner keyword has for cogroup. Consider removing it. || || || [[https://issues.apache.org/jira/browse/PIG-1538|PIG-1538]] || Nested types || Remove two level access || Maybe, if we can find a way to ignore calls to Schema.isTwoLevelAccessRequired(). || || [[https://issues.apache.org/jira/browse/PIG-1536|PIG-1536]] || Schema || Pick one semantic for schema merges and use it consistently throughout Pig || no || + || [[https://issues.apache.org/jira/browse/PIG-1371|PIG-1371]] || Nested types || unknown || || || [[https://issues.apache.org/jira/browse/PIG-1341|PIG-1341]] || Dynamic type binding || Close as won't fix || yes || || [[https://issues.apache.org/jira/browse/PIG-1281|PIG-1281]] || Dynamic type binding || In situations where a Hadoop shuffle key is assumed to be of type bytearray wrap the value in a tuple so that if the type is actually something else Hadoop can still process it. || yes || || [[https://issues.apache.org/jira/browse/PIG-1277|PIG-1277]] || Nested types || Unknown || || + || [[https://issues.apache.org/jira/browse/PIG-1222|PIG-1222]] || Dynamic type binding || The issue here is that Pig thinks the field is a bytearray while BinStorage actually produces a String. Need a way to handle these issues on the fly. || || || [[https://issues.apache.org/jira/browse/PIG-1188|PIG-1188]] || Schema || Make sure Pig handles missing data in Tuples by returning a null rather than failing. || yes || || [[https://issues.apache.org/jira/browse/PIG-1112|PIG-1112]] || Schema || When user provides AS to flatten of undefined bag or tuple, the contents of that AS are taken to be the schema of the bag or tuple. || yes || || [[https://issues.apache.org/jira/browse/PIG-1065|PIG-1065]] || Dynamic type binding || In situations where a Hadoop shuffle key is assumed to be of type bytearray wrap the value in a tuple so that if the type is actually something else Hadoop can still process it. || yes || || [[https://issues.apache.org/jira/browse/PIG-999|PIG-999]] || Dynamic type binding || In situations where a Hadoop shuffle key is assumed to be of type bytearray wrap the value in a tuple so that if the type is actually something else Hadoop can still process it. || yes || + || [[https://issues.apache.org/jira/browse/PIG-847|PIG-847]] || Nested types || Remove two level access || maybe || + || [[https://issues.apache.org/jira/browse/PIG-828|PIG-828]] || Nested types || According to the rules of Pig Latin, this should produce a bag with one field. Need to make sure that is what Pig is trying to do in this case. || yes || || [[https://issues.apache.org/jira/browse/PIG-767|PIG-767]] || Nested types || Remove two level access; bring DUMP and DESCRIBE output into sync. || no || + || [[https://issues.apache.org/jira/browse/PIG-749|PIG-749]] || Schema || Related to PIG-1112 || yes || || [[https://issues.apache.org/jira/browse/PIG-730|PIG-730]] || Nested types || Make sure schema of union is the same as schema before union (suspect his is a two level access issue) || unclear || || [[https://issues.apache.org/jira/browse/PIG-723|PIG-723]] || Nested types || Suspect this is a two level access issue || unclear || || [[https://issues.apache.org/jira/browse/PIG-696|PIG-696]] || Dynamic type binding || Class cast exceptions such as this should result in a null value and a warning, not a failure. || yes || || [[https://issues.apache.org/jira/browse/PIG-694|PIG-694]] || Nested types || Determine the semantics for merging tuples and bags. || unclear || + || [[https://issues.apache.org/jira/browse/PIG-678|PIG-678]] || Grammar || Decide whether we want to support this extension. || yes || || [[https://issues.apache.org/jira/browse/PIG-621|PIG-621]] || Dynamic type binding || Class cast exceptions such as this should result in a null value and a warning, not a failure. || yes || || [[https://issues.apache.org/jira/browse/PIG-435|PIG-435]] || Schema || Decide definitely on what it means when users declare a schema for a load. || unclear || || [[https://issues.apache.org/jira/browse/PIG-333|PIG-333]] || Dynamic type binding || Since it is specified that MIN and MAX treat unknown types as double, all the actual string data should be converted to NULLs, rather than cause errors. || yes || || [[https://issues.apache.org/jira/browse/PIG-313|PIG-313]] || Grammar || I propose that we continue not supporting this. But we should detect it at compile time rather than at runtime. || yes || + + Bugs I need to
[Pig Wiki] Update of "SemanticsCleanup" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "SemanticsCleanup" page has been changed by AlanGates. http://wiki.apache.org/pig/SemanticsCleanup?action=diff&rev1=1&rev2=2 -- The bugs have been placed into the following categories: * Schema: These are related to schemas that are improperly inferred, etc. * Grammar: Places where the grammar is unclear or produces unexpected results. - * Two Level Access: The concept of two level access was introduced long ago to deal with oddities in bag schemas. Ideally we will remove this. At least we have to improve it. + * Nested Types: Issues dealing with bags, tuples, and maps. + * Dynamic Type Binding: In certain situations Pig assumes a value to be of type byte array when it does not know the actual type, and handles whatever actual type it is at runtime. There are situations where this does not work properly. == Bug Table == - || *JIRA* || *Category* || *Proposed Solution* || + || '''JIRA''' || '''Category''' || '''Proposed Solution''' || '''Backward Compatible''' || - || [[https://issues.apache.org/jira/browse/PIG-1627|PIG-1627]] || Schema || Flattening a bag with an unknown schema should produce a record with an unknown schema || + || [[https://issues.apache.org/jira/browse/PIG-1627|PIG-1627]] || Schema || Flattening a bag with an unknown schema should produce a record with an unknown schema || no || - || [[https://issues.apache.org/jira/browse/PIG-1584|PIG-1584]] || Grammar || Cogroup inner does not match the semantics of inner join. It is also not clear what value the inner keyword has for cogroup. || + || [[https://issues.apache.org/jira/browse/PIG-1584|PIG-1584]] || Grammar || Cogroup inner does not match the semantics of inner join. It is also not clear what value the inner keyword has for cogroup. Consider removing it. || || - || [[https://issues.apache.org/jira/browse/PIG-1538|PIG-1538]] || Two level access || Remove two level access || + || [[https://issues.apache.org/jira/browse/PIG-1538|PIG-1538]] || Nested types || Remove two level access || Maybe, if we can find a way to ignore calls to Schema.isTwoLevelAccessRequired(). || - || [[https://issues.apache.org/jira/browse/PIG-1536|PIG-1536]] || Schema || Pig one semantic for schema merges and use it consistently throughout Pig || + || [[https://issues.apache.org/jira/browse/PIG-1536|PIG-1536]] || Schema || Pick one semantic for schema merges and use it consistently throughout Pig || no || + || [[https://issues.apache.org/jira/browse/PIG-1341|PIG-1341]] || Dynamic type binding || Close as won't fix || yes || + || [[https://issues.apache.org/jira/browse/PIG-1281|PIG-1281]] || Dynamic type binding || In situations where a Hadoop shuffle key is assumed to be of type bytearray wrap the value in a tuple so that if the type is actually something else Hadoop can still process it. || yes || + || [[https://issues.apache.org/jira/browse/PIG-1277|PIG-1277]] || Nested types || Unknown || || + || [[https://issues.apache.org/jira/browse/PIG-1188|PIG-1188]] || Schema || Make sure Pig handles missing data in Tuples by returning a null rather than failing. || yes || + || [[https://issues.apache.org/jira/browse/PIG-1112|PIG-1112]] || Schema || When user provides AS to flatten of undefined bag or tuple, the contents of that AS are taken to be the schema of the bag or tuple. || yes || + || [[https://issues.apache.org/jira/browse/PIG-1065|PIG-1065]] || Dynamic type binding || In situations where a Hadoop shuffle key is assumed to be of type bytearray wrap the value in a tuple so that if the type is actually something else Hadoop can still process it. || yes || + || [[https://issues.apache.org/jira/browse/PIG-999|PIG-999]] || Dynamic type binding || In situations where a Hadoop shuffle key is assumed to be of type bytearray wrap the value in a tuple so that if the type is actually something else Hadoop can still process it. || yes || + || [[https://issues.apache.org/jira/browse/PIG-767|PIG-767]] || Nested types || Remove two level access; bring DUMP and DESCRIBE output into sync. || no || + || [[https://issues.apache.org/jira/browse/PIG-730|PIG-730]] || Nested types || Make sure schema of union is the same as schema before union (suspect his is a two level access issue) || unclear || + || [[https://issues.apache.org/jira/browse/PIG-723|PIG-723]] || Nested types || Suspect this is a two level access issue || unclear || + || [[https://issues.apache.org/jira/browse/PIG-696|PIG-696]] || Dynamic type binding || Class cast exceptions such as this should result in a null value and a warning, not a failure. || yes || + || [[https://issues.apache.org/jira/browse/PIG-694|PIG-694]] || Nested types || Determine the semantics for merging tuples and bags. || unclear || + || [[https://issues.apache.org/jira/browse/PIG-621|PIG-621]] || Dynamic type binding |
[Pig Wiki] Update of "SemanticsCleanup" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "SemanticsCleanup" page has been changed by AlanGates. http://wiki.apache.org/pig/SemanticsCleanup -- New page: == Introduction == A number of bugs have been filed against Pig that roughly fall under the area of poorly defined or undefined semantics. In the 0.9 Pig release we would like to take on a number of these issues, clarifying semantics where they are unclear, defining them where they are undefined, and correctly them where they are clearly wrong. This page will classifies the existing bugs and indicates what we believe the proper fix is for them. == Categories == The bugs have been placed into the following categories: * Schema: These are related to schemas that are improperly inferred, etc. * Grammar: Places where the grammar is unclear or produces unexpected results. * Two Level Access: The concept of two level access was introduced long ago to deal with oddities in bag schemas. Ideally we will remove this. At least we have to improve it. == Bug Table == || *JIRA* || *Category* || *Proposed Solution* || || [[https://issues.apache.org/jira/browse/PIG-1627|PIG-1627]] || Schema || Flattening a bag with an unknown schema should produce a record with an unknown schema || || [[https://issues.apache.org/jira/browse/PIG-1584|PIG-1584]] || Grammar || Cogroup inner does not match the semantics of inner join. It is also not clear what value the inner keyword has for cogroup. || || [[https://issues.apache.org/jira/browse/PIG-1538|PIG-1538]] || Two level access || Remove two level access || || [[https://issues.apache.org/jira/browse/PIG-1536|PIG-1536]] || Schema || Pig one semantic for schema merges and use it consistently throughout Pig ||
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by nirajrai
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by nirajrai. http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=156&rev2=157 -- ||2252 ||Base loader in Cogroup must implement CollectableLoadFunc. || ||2253 ||Side loaders in cogroup must implement IndexableLoadFunc. || ||2254 ||Currently merged cogroup is not supported after blocking operators. || - ||2255 ||Base loader in Cogroup must implement CollectableLoadFunc. || - ||2256 ||POSkewedJoin operator has " + compiledInputs.length + " inputs. It should have 2. || + ||2255 ||POSkewedJoin operator has " + compiledInputs.length + " inputs. It should have 2. || ||2998 ||Unexpected internal error. || ||2999 ||Unhandled internal error. ||
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by nirajrai
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by nirajrai. http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=155&rev2=156 -- ||2253 ||Side loaders in cogroup must implement IndexableLoadFunc. || ||2254 ||Currently merged cogroup is not supported after blocking operators. || ||2255 ||Base loader in Cogroup must implement CollectableLoadFunc. || + ||2256 ||POSkewedJoin operator has " + compiledInputs.length + " inputs. It should have 2. || ||2998 ||Unexpected internal error. || ||2999 ||Unhandled internal error. ||
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by nirajrai
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by nirajrai. http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=154&rev2=155 -- ||2247 ||Cannot determine skewed join schema || ||2248 ||twoLevelAccessRequired==true is not supported with" +"and isSubNameMatch==true. || ||2249 ||While using 'collected' on group; data must be loaded via loader implementing CollectableLoadFunc. || + ||2250 ||Blocking operators are not allowed before Collected Group. Consider dropping using 'collected'. || + ||2251 ||Merge Cogroup work on two or more relations. To use map-side group-by on single relation, use 'collected' qualifier. || + ||2252 ||Base loader in Cogroup must implement CollectableLoadFunc. || + ||2253 ||Side loaders in cogroup must implement IndexableLoadFunc. || + ||2254 ||Currently merged cogroup is not supported after blocking operators. || + ||2255 ||Base loader in Cogroup must implement CollectableLoadFunc. || ||2998 ||Unexpected internal error. || ||2999 ||Unhandled internal error. ||
[Pig Wiki] Update of "PigJournal" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigJournal" page has been changed by AlanGates. http://wiki.apache.org/pig/PigJournal?action=diff&rev1=11&rev2=12 -- || Feature || JIRA || Comments || || Boolean Type|| [[https://issues.apache.org/jira/browse/PIG-1429|PIG-1429]] || || || Make Illustrate Work|| [[https://issues.apache.org/jira/browse/PIG-502|PIG-502]], [[https://issues.apache.org/jira/browse/PIG-534|PIG-534]], [[https://issues.apache.org/jira/browse/PIG-903|PIG-903]], [[https://issues.apache.org/jira/browse/PIG-1066|PIG-1066]] || || - || Better Parser and Scanner Technology|| many || || + || Better Parser and Scanner Technology|| [[https://issues.apache.org/jira/browse/PIG-1618|PIG-1618]] || || || Clarify Pig Latin Semantics || many || || || Extending Pig to Include Branching, Looping, and Functions || TuringCompletePig || ||
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by nirajrai
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by nirajrai. http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=153&rev2=154 -- ||2245 ||Cannot get schema from loadFunc || ||2246 ||Error merging schema || ||2247 ||Cannot determine skewed join schema || + ||2248 ||twoLevelAccessRequired==true is not supported with" +"and isSubNameMatch==true. || - ||2248 ||While using 'collected' on group; data must be loaded via loader implementing CollectableLoadFunc. || + ||2249 ||While using 'collected' on group; data must be loaded via loader implementing CollectableLoadFunc. || - ||2998 ||Unexpected internal error. || ||2999 ||Unhandled internal error. ||
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by nirajrai
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by nirajrai. http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=152&rev2=153 -- ||2245 ||Cannot get schema from loadFunc || ||2246 ||Error merging schema || ||2247 ||Cannot determine skewed join schema || + ||2248 ||While using 'collected' on group; data must be loaded via loader implementing CollectableLoadFunc. || ||2998 ||Unexpected internal error. ||
[Pig Wiki] Update of "PigJournal" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigJournal" page has been changed by AlanGates. http://wiki.apache.org/pig/PigJournal?action=diff&rev1=10&rev2=11 -- || Make configuration available to UDFs || 0.6 || || || Load Store Redesign || 0.7 || || || Pig Mix 2.0 || not yet released || || + || Rewrite Logical Optimizer|| not yet released || || + || Cleanup of javadocs || not yet released || || + || UDFs in scripting languages || not yet released || || + || Ability to specify a custom partitioner || not yet released || || + || Pig usage stats collection || not yet released || || + || Make Pig available via Maven || not yet released || || + || Standard UDFs Pig Should Provide || not yet released || || + || Add Scalars To Pig Latin || not yet released || || + || Run Map Reduce Jobs Directly From Pig|| not yet released || || == Work in Progress == This covers work that is currently being done. For each entry the main JIRA for the work is referenced. || Feature || JIRA || Comments || || Boolean Type|| [[https://issues.apache.org/jira/browse/PIG-1429|PIG-1429]] || || + || Make Illustrate Work|| [[https://issues.apache.org/jira/browse/PIG-502|PIG-502]], [[https://issues.apache.org/jira/browse/PIG-534|PIG-534]], [[https://issues.apache.org/jira/browse/PIG-903|PIG-903]], [[https://issues.apache.org/jira/browse/PIG-1066|PIG-1066]] || || + || Better Parser and Scanner Technology|| many || || + || Clarify Pig Latin Semantics || many || || + || Extending Pig to Include Branching, Looping, and Functions || TuringCompletePig || || + - || Query Optimizer || [[http://issues.apache.org/jira/browse/PIG-1178|PIG-1178]] || || - || Cleanup of javadocs || [[https://issues.apache.org/jira/browse/PIG-1311|PIG-1311]] || || - || UDFs in scripting languages || [[https://issues.apache.org/jira/browse/PIG-928|PIG-928]] || || - || Ability to specify a custom partitioner || [[https://issues.apache.org/jira/browse/PIG-282|PIG-282]] || || - || Pig usage stats collection || [[https://issues.apache.org/jira/browse/PIG-1389|PIG-1389]], [[https://issues.apache.org/jira/browse/PIG-908|PIG-908]], [[https://issues.apache.org/jira/browse/PIG-864|PIG-864]], [[https://issues.apache.org/jira/browse/PIG-809|PIG-809]] || || - || Make Pig available via Maven|| [[https://issues.apache.org/jira/browse/PIG-1334|PIG-1334]] || || - || Standard UDFs Pig Should Provide|| [[https://issues.apache.org/jira/browse/PIG-1405|PIG-1405]] || || - || Add Scalars To Pig Latin|| [[https://issues.apache.org/jira/browse/PIG-1434|PIG-1434]] || || - || Run Map Reduce Jobs Directly From Pig || [[https://issues.apache.org/jira/browse/PIG-506|PIG-506]] || || == Proposed Future Work == Work that the Pig project proposes to do in the future is further broken into three categories: @@ -74, +79 @@ Within each subsection order is alphabetical and does not imply priority. === Agreed Work, Agreed Approach === - Make Illustrate Work - Illustrate has become Pig's ignored step-child. Users find it very useful, but developers have not kept it up to date with new features (e.g. it does not work with merge join). Also, the way it is currently - implemented it has code in many of Pig's physical operators. This means the code is more complex and burdened with branches, making it harder to maintain. It also means that when doing new development it is - easy to forget about illustrate. Illustrate needs to be redesigned in such a way that it does not add complexity to physical operators and that as new operators are developed it is necessary and easy to add - illustrate functionality to them. Tests for illustrate also need to be added to the test suite so that it is no broken unintentionally. - - '''Category:''' Usability - - '''Dependency:''' - - '''References:''' - - '''Estimated Development Effort:''' medium - Combiner Not Used with Limit or Filter Pig Scripts that have a foreach with a nested limit or filter do not use the combiner even when they could. Not all filters can use the combiner, but in some cases they can. I think all limits could at least apply the limit i
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by daijy
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by daijy. http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=151&rev2=152 -- ||2242 ||TypeCastInserter invoked with an invalid operator || ||2243 ||Attempt to remove operator that is still connected to other operators || ||2244 ||Hadoop does not return any error message || + ||2245 ||Cannot get schema from loadFunc || + ||2246 ||Error merging schema || + ||2247 ||Cannot determine skewed join schema || ||2998 ||Unexpected internal error. ||
[Pig Wiki] Update of "Howl/HowlCliFuncSpec" by Ashutosh Chauhan
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "Howl/HowlCliFuncSpec" page has been changed by AshutoshChauhan. http://wiki.apache.org/pig/Howl/HowlCliFuncSpec -- New page: == Howl CLI Functional Specification == This wiki page outlines what is supported from Howl CLI. In http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL Hive's DDL spec outlines various allowed operations. This wiki will talk about which of those are allowed and are not allowed from Howl CLI. Among those which are allowed how are they different from Hive's CLI. CREATE TABLE * STORED AS clause which is currently defined as: [STORED AS file_format] file_format: . : SEQUENCEFILE | TEXTFILE | RCFILE | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname will be modified to support [STORED AS file_format] file_format: . : RCFILE | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname INPUTDRIVER input_driver_classname OUTPUTDRIVER output_driver_classname * CREATE TABLE command must contain "STORED AS" clause, if it doesnt it will result in an exception "Operation not supported. Create table doesn't contain STORED AS clause. Please provide one." * If table is partitioned, then user provides partition columns. These columns can only be of type String. * CLUSTERED BY clause is not supported. If provided will result in an exception "Operation not supported. CLUSTERED BY is not supported." CREATE TABLE AS SELECT * Not Supported. Throws an exception with message "Operation Not Supported". CREATE TABLE LIKE * Allowed only if existing table was created using Howl. Else, throws an exception "Operation not supported. Table table name should have been created through Howl. Seems like its not." DROP TABLE * Behavior same as of Hive. ALTER TABLE ALTER TABLE table_name ADD partition_spec [ LOCATION 'location1' ] partition_spec [ LOCATION 'location2' ] ... . partition_spec: . : PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...) * Allowed only if TABLE table_name was created using Howl. Else, throws an exception "Operation not supported. Partitions can be added only to tables through Howl." Alter Table File Format ALTER TABLE table_name SET FILEFORMAT file_format Here file_format must be same as the one described above in CREATE TABLE. Else, throw an exception "Operation not supported. Not a valid file format." * CLUSTERED BY clause is not supported. If provided will result in an exception "Operation not supported. CLUSTERED BY is not supported." Change Column Name/Type/Position/Comment ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name] * Not supported. Throws an exception with message "Operation Not Supported". Add/Replace Columns ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...) * ADD Columns is allowed. Behavior same as of Hive. * Replace column is not supported. Throws an exception with message "Operation Not Supported". Alter Table Touch ALTER TABLE table_name TOUCH; ALTER TABLE table_name TOUCH PARTITION partition_spec; * Not Supported. Throws an exception with message "Operation Not Supported". = CREATE VIEW = * Not Supported. Throws an exception with message "Operation Not Supported". = DROP VIEW = * Not Supported. Throws an exception with message "Operation Not Supported". = ALTER VIEW = * Not Supported. Throws an exception with message "Operation Not Supported". = SHOW TABLES = * Behavior same as of Hive. = SHOW PARTITIONS = * Behavior same as of Hive. = SHOW FUNCTIONS = * Not Supported. Throws an exception with message "Operation Not Supported". = DESCRIBE = * Behavior same as of Hive. Any other commands apart from one listed above will result in an exception with message "Operation Not Supported". User Interface for Howl It will support following four command line options: * -g : Usage is -g mygroup This indicates to Howl that table that needs to be created must have group as "mygroup" * -p : Usage is -p rwxr-xr-x This indicates to Howl that table that needs to be created must have permissions as "rwxr-xr-x" * -f : Usage is -f myscript.howl This indicates to howl that myscript.howl is a file which contains DDL commands it needs to execute. * -e : Usage is -e 'create table mytable(a int);' This indicates to Howl to treat the following string as DDL command and execute it. Notes: * -g and -p options are not mandatory. If not supplied and command contains a CREATE TABLE which is successful, user will be told with what permissions and in which group her table is created. This will be printed on stdout. Message will read as "Table tablename is created
[Pig Wiki] Update of "HowlSecurity" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "HowlSecurity" page has been changed by AlanGates. http://wiki.apache.org/pig/HowlSecurity -- New page: This page will outline design of Howl Security. == Related Hive Work == [[https://issues.apache.org/jira/browse/HIVE-78|Jira for authorization support in Hive]] == Authorization == Initially the thought is that Howl will have authorization implemented at some level to provide security. The initial implementation will be based on HDFS directory permissions. This may be enhanced/replaced by a role based model in a later release. === Permissions === The initial idea for authorization in Howl is to use the HDFS permissions to authorize metadata operations. To be able to do this, we would like to extend createTable() to add the ability to record a different group from the user's primary group and to record the complete Unix permissions on the table directory. Also, we would like to have a way for partition directories to inherit permissions and group information based on the table directory. To keep the metastore backward compatible for use with Hive, I propose having conf variables to achieve these objectives: * `table.group.name` : value will indicate the name of the Unix group for the table directory. This will be used by `createTable()` to perform a chgrp to the value provided. This property will provide the user the ability to choose from one of the many Unix groups he is part of to associate with the table. * `table.permissions` : value will be of the form `rwxrwxrwx` to indicate read-write-execute permissions on the table directory. This will be used by `createTable()` to perform a chmod to the value provided. This will let the user decide what permissions he wants on the table. * `partitions.inherit.permissions` : a value of true will indicate that partitions inherit the group name and permissions of the table level directory. This will be used by `addPartition()` to perform a chgrp and chmod to the values as on the table directory. Conf properties are preferable over API changes since the complete authorization design for Hive is not finalized yet. These properties can be deprecated/removed when that is in place. These properties would also be useful to some installation of vanilla Hive since at least DFS level authorization can now be achieved by Hive without the user having to manually perform chgrp and chmod operations on DFS. === Reading data(Select)/Writing data (Insert) === This will simply be governed by the dfs permission at the time of the read and will result in runtime errors if the user does not have permissions. === Create table === Internal/External table without location specified If the user has permissions to the directory pointed by `hive.metastore.warehouse.dir` then he can create the table. Internal/External table with location specified If the user has permissions to the location specified then he can create the table. === Drop Table === A user can drop a table (internal or external) only if he has write permissions to the table directory. A user could have write permission either by virtue of him being the owner of the table or through the group he belongs to. So if the permissions on the table directory allow him to write to it, he can drop the table. === Partition permissions === Partition directories will inherit the permissions/(owner,group) of the table directory. === Alter table === A user can "alter" table if he has write permissions on the table directory. So any of the following alter table commands are allowed only if the user has write permissions on the table directory: * `ALTER TABLE table_name ADD partition_spec [ LOCATION 'location1' ] partition_spec [ LOCATION 'location2' ] ...` * `ALTER TABLE table_name DROP partition_spec, partition_spec,...` * `ALTER TABLE table_name RENAME TO new_table_name` * `ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]` * `ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)` * `ALTER TABLE table_name SET TBLPROPERTIES table_properties` * `ALTER TABLE table_name SET SERDE serde_class_name [WITH SERDEPROPERTIES serde_properties]` * `ALTER TABLE table_name SET SERDEPROPERTIES serde_properties` * `ALTER TABLE table_name SET FILEFORMAT file_format` * `ALTER TABLE table_name CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name, ...)] INTO num_buckets BUCKETS` * `ALTER TABLE table_name TOUCH;` * `ALTER TABLE table_name TOUCH PARTITION partition_spec;` === Show tables === Since the top level warehouse dir will have read/write permissions for everyone, show tables will show all tables to all users. === Show Table/Partitions Extended === A user can issue "show table/parti
[Pig Wiki] Update of "PigJournal" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigJournal" page has been changed by AlanGates. http://wiki.apache.org/pig/PigJournal?action=diff&rev1=9&rev2=10 -- '''Estimated Development Effort:''' medium === Agreed Work, Unknown Approach === + Support Append in Pig + Appending to HDFS files is supported in Hadoop 0.21. None of Pig's standard load functions support append. We need to decide if append is added to + the language itself (is there an APPEND modifier to the STORE command?) or if each store function needs to decide how to indicate or allow appending on its own. !PigStorage + should support append as users are likely to want it. + + '''Category:''' New Functionality + + '''Dependency:''' Hadoop 0.21 or later + + '''References:''' + + '''Estimated Development Effort:''' small + + Move Piggybank out of Contrib Currently Pig hosts Piggybank (our repository of user contributed UDFs) as part of our contrib. This is not ideal for a couple of reasons. One, it means those who wish to share their UDFs have to go through the rigor of the patch process. Two, since contrib is tied to releases of the main product, there is no way for users
[Pig Wiki] Update of "HowlJournal" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "HowlJournal" page has been changed by AlanGates. http://wiki.apache.org/pig/HowlJournal?action=diff&rev1=1&rev2=2 -- '''Authorization'''<> The initial proposal is to use HDFS permissions to determine whether Howl operations can be executed. For example, it would not be possible to drop a table unless the user had write permissions on the directory holding that table. We need to determine how to extend this model to data not stored in HDFS (e.g. Hbase) and objects that do not exist in HDFS (e.g. views). See HowlSecurity for more information. + '''Dynamic Partitioning'''<> Currently Howl can only store data into one partition at a time. It needs to support + spraying to multiple partitions in one write. + '''Non-partition Predicate Pushdown'''<> Since in the future storage formats (such as RCFile) should support predicate pushdown, Howl needs to be able to push predicates into the storage layer when appropriate. '''Notification'''<> Add ability for systems such as work flow to be notified when new data arrives in Howl. This will be designed around a few systems receiving notification, not large numbers of users receiving notifications (i.e. we will not be building a general purpose publish/subscribe system). One solution to this might be an RSS feed or similar simple service.
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by nirajrai
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by nirajrai. http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=149&rev2=150 -- ||2241||UID is not found in the schema || ||2242||TypeCastInserter invoked with an invalid operator|| ||2243||Attempt to remove operator that is still connected to other operators|| + |2244||hadoop does not return any error message|| ||2998||Unexpected internal error.|| ||2999||Unhandled internal error.|| ||3000||IOException caught while compiling POMergeJoin||
[Pig Wiki] Update of "NativeMapReduce" by ThejasNair
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "NativeMapReduce" page has been changed by ThejasNair. http://wiki.apache.org/pig/NativeMapReduce?action=diff&rev1=11&rev2=12 -- A = load 'WordcountInput.txt'; B = MAPREDUCE wordcount.jar Store A into 'inputDir' Load 'outputDir' as (word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`; }}} + + Note that the files specified as input and output locations in MAPREDUCE statement will NOT be deleted by pig automatically. User has to delete them manually. == Comparison with similar features == === Pig Streaming ===
[Pig Wiki] Update of "NativeMapReduce" by Aniket Mokash i
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "NativeMapReduce" page has been changed by Aniket Mokashi. http://wiki.apache.org/pig/NativeMapReduce?action=diff&rev1=10&rev2=11 -- == Introduction == Pig needs to provide a way to natively run map reduce jobs written in java language. There are some advantages of this- - 1. The advantages of the ''mapreduce'' keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. + 1. The advantages of the ''mapreduce'' statement are that the user need not be worried about coordination between the jobs, pig will take care of it. 2. User can make use of existing java applications without being a java programmer. == Syntax == @@ -25, +25 @@ params are extra parameters required for native mapreduce job. - mymr.jar is any mapreduce jar file which can be run through '''"hadoop -jar mymr.jar params"''' command. Thus, the contract for ''inputLocation'' and ''outputLocation'' is typically managed through ''params''. + mymr.jar is any mapreduce jar file which can be run through '''"hadoop jar mymr.jar params"''' command. Thus, the contract for ''inputLocation'' and ''outputLocation'' is typically managed through ''params''. For Example, to run wordcount mapreduce program from Pig, we write {{{
[Pig Wiki] Update of "Howl" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "Howl" page has been changed by AlanGates. http://wiki.apache.org/pig/Howl?action=diff&rev1=2&rev2=3 -- be changed. And old data will not need to be converted. If there is a monthly Pig Latin script that roles up daily raw events, Howl will handle the fact that some of the data is stored in text and some in RCFile and present a single stream to Pig for processing. + == Join Us == + Currently Howl's code is hosted at github: http://github.com/yahoo/howl + + Howl issues are discussed on howl...@yahoogroups.com. You can join it by sending mail to howldev-subscr...@yahoogroups.com +
[Pig Wiki] Update of "HowlJournal" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "HowlJournal" page has been changed by AlanGates. http://wiki.apache.org/pig/HowlJournal -- New page: = Howl Journal = This document tracks the development of Howl. It summarizes work that has been done in previous releases, what is currently being worked on, and proposals for future work in Howl. == Completed Work == || Feature|| Available in || Comments || || Read/write of data from Map Reduce || Not yet released || || || Read/write of data from Pig|| Not yet released || || || Read from Hive || Not yet released || || || Support pushdown of columns to be projected into storage format|| Not yet released || || || Support for RCFile storage || Not yet released || || == Work in Progress == || Feature || Description || || Add a CLI || This will allow users to use Howl without installing all of Hive. The syntax will match that of Hive's DDL. || || Partition pruning || Currently, when asked to return information about a table Hive's metastore returns all partitions in the table. This has a couple of issues. One, for tables with large numbers of partitions it means the metadata operation of fetching information about the table is very expensive. Two, it makes more sense to have the partition pruning logic in one place (Howl) rather than in Hive, Pig, and MR. || == Proposed Work == '''Authentication'''<> Integrate Howl with security work done on Hadoop so that users can be properly authenticated. '''Authorization'''<> The initial proposal is to use HDFS permissions to determine whether Howl operations can be executed. For example, it would not be possible to drop a table unless the user had write permissions on the directory holding that table. We need to determine how to extend this model to data not stored in HDFS (e.g. Hbase) and objects that do not exist in HDFS (e.g. views). See HowlSecurity for more information. '''Non-partition Predicate Pushdown'''<> Since in the future storage formats (such as RCFile) should support predicate pushdown, Howl needs to be able to push predicates into the storage layer when appropriate. '''Notification'''<> Add ability for systems such as work flow to be notified when new data arrives in Howl. This will be designed around a few systems receiving notification, not large numbers of users receiving notifications (i.e. we will not be building a general purpose publish/subscribe system). One solution to this might be an RSS feed or similar simple service. '''Schema Evolution'''<> Currently schema evolution in Hive is limited to adding columns at the end of the non-partition keys columns. It may be desirable to support other forms of schema evolution, such as adding columns in other parts of the record, or making it so that new partitions for a table no longer contain a given column. '''Support data read across partitions with different storage formats'''<> This work is done except that only one storage format is currently supported. '''Support for more file formats'''<> Additional file formats such as sequence file, text, etc. need to be added. '''Utility APIs'''<> Grid managers will want to build tools that use Howl to help manage their grids. For example, one might build a tool to do replication between two grids. Such tools will want to use Howl's metadata. Howl needs to provide an appropriate API for these types of tools.
[Pig Wiki] Update of "NativeMapReduce" by Aniket Mokash i
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "NativeMapReduce" page has been changed by Aniket Mokashi. http://wiki.apache.org/pig/NativeMapReduce?action=diff&rev1=9&rev2=10 -- - = Under Construction = #format wiki #language en @@ -53, +52 @@ }}} === Pig Plans === - Logical Plan- Logical Plan creates a LONative operator with an internal plan that consists of a store and a load operator. The store operator cannot be attached to X at this level as it would start storing X at inputLocation for every plan that includes X, which is not intended. Although we can LOLoad operator for Y at this point, we delay this to physical plan and track this with LONative operator. Since Y has dataflow dependency on X, we make a connection between operators corresponding to these aliased at logical plan. + Logical Plan- Logical Plan creates a LONative operator with an internal plan that consists of a store and a load operator. The store operator cannot be attached to X at this level as it would start storing X at inputLocation for every plan that includes X, which is not intended. Although we can LOLoad operator for Y at this point, we delay this until the mapreduce plan and track this with LONative operator. Since Y has dataflow dependency on X, we make a connection between operators corresponding to these aliased at logical plan. {{{ X = ... ; @@ -68, +67 @@ | ... }}} - TypeCastInserter- + + TypeCastInserter- This is a mandatory optimizer that adds a foreach and a cast operator after a load so that if a field is loaded from a load we can convert it to required type. In absence of this, we fail with a cast exception after load is completed. Currently, we apply this optimizer on LOLoad and LOStream as they can be loaded "AS schema". As, mapreduce clause corresponds to a load operation, this optimization is also applicable to LONative operator. + A test case for this scenario is- + {{{ + B = mapreduce 'mapreduce.jar' Store A into 'input' Load 'output' as (name:chararray, count:int) `wordcount input output`; + C = foreach B generate count+1; + }}} Physical Plan- Logical plan is visited to convert internal plan of load store combination into corresponding physical plan operators and connections are maintained as per the logical plan. {{{ @@ -85, +90 @@ ... }}} - MapReduce Plan- While compiling the mapreduce plan, with MRCompiler, we introduce + MapReduce Plan- While compiling the mapreduce plan, with MRCompiler, we introduce a new MapReduceOper, NativeMapReduceOper that tracks the presence of native mapreduce job inside the plan. It also holds required parameters and jarname. {{{ X = ... ; | | - ||--- (POStore) Store X into 'inputLocation' + |--- (POStore) Store X into 'inputLocation' + + --- MR boundary - - Y = MapReduce ... ; | + Y = MapReduce ... ; - (PONative) -- innnerPlan ---| + (NativeMapReduceOper) - mymr.jar | + mymr.jar - params |--- (POLoad) Load 'outputLocation' + params + --- MR boundary - + Y = (POLoad) Load 'outputLocation' | | ... }}} - Inside the JobControlCompiler's compile method if we find the native mapreduce operator we run the org.apache.hadoop.util.RunJar's Main method with the specified parameters. + Inside the JobControlCompiler's compile method if we find the native mapreduce operator we run the org.apache.hadoop.util.RunJar's Main method with the specified parameters. We also make sure all the dependencies of job are obeyed for the native jobs. === Security Manager === - hadoop jar command is equivalent to invoking org.apache.hadoop.util.RunJar's main function with required arguments. RunJar internally can invoke several levels of driver classes before executing the hadoop job (for example- hadoop-example.jar). With the + hadoop jar command is equivalent to invoking org.apache.hadoop.util.RunJar's main function with required arguments. RunJar internally can invoke several levels of driver classes before executing the hadoop job (for example- hadoop-example.jar). To detect failure or success of the job we need to detect the innermost error value and return it to Pig. To achieve this we install our own RunJarSecurityManager that delegates the security management to current security manager and captures the innermost exit code. === Pig Stats === + Pig Stats are populated by assuming Native job as a single instance of mapreduce job and progress is also reported with the same assumption. As the native job is not under control of pig, except for the exit code, it is
[Pig Wiki] Update of "NativeMapReduce" by Aniket Mokash i
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "NativeMapReduce" page has been changed by Aniket Mokashi. http://wiki.apache.org/pig/NativeMapReduce?action=diff&rev1=8&rev2=9 -- + = Under Construction = #format wiki #language en @@ -18, +19 @@ To support native mapreduce job pig will support following syntax- {{{ X = ... ; - Y = MAPREDUCE ('mymr.jar' [, 'other.jar' ...]) STORE X INTO 'storeLocation' USING storeFunc LOAD 'loadLocation' USING loadFunc AS schema [params, ... ]; + Y = MAPREDUCE 'mymr.jar' [('other.jar', ...)] STORE X INTO 'inputLocation' USING storeFunc LOAD 'outputLocation' USING loadFunc AS schema [`params, ... `]; }}} - This stores '''X''' into the '''storeLocation''' using '''storeFunc''', which is then used by native mapreduce to read its data. After we run mymr.jar's mapreduce, we load back the data from '''loadLocation''' into alias '''Y''' using '''loadFunc'''. + This stores '''X''' into the '''inputLocation''' using '''storeFunc''', which is then used by native mapreduce to read its data. After we run mymr.jar's mapreduce, we load back the data from '''outputLocation''' into alias '''Y''' using '''loadFunc''' as '''schema'''. params are extra parameters required for native mapreduce job. - '''mymr.jar is any mapreduce jar file which can be run through "hadoop -jar mymr.jar params" command.''' + mymr.jar is any mapreduce jar file which can be run through '''"hadoop -jar mymr.jar params"''' command. Thus, the contract for ''inputLocation'' and ''outputLocation'' is typically managed through ''params''. For Example, to run wordcount mapreduce program from Pig, we write {{{ A = load 'WordcountInput.txt'; - B = MAPREDUCE wordcount.jar Store A into 'inputDir' Load 'outputDir' as (word:chararray, count: int) org.myorg.WordCount inputDir outputDir; + B = MAPREDUCE wordcount.jar Store A into 'inputDir' Load 'outputDir' as (word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`; }}} == Comparison with similar features == @@ -45, +46 @@ With native job support, pig can support native map reduce jobs written in java language that can convert a data set into a different data set after applying a custom map reduce functions of any complexity. == Implementation Details == + {{{ X = ... ; - Y = MAPREDUCE ('mymr.jar' [, 'other.jar' ...]) STORE X INTO 'storeLocation' USING storeFunc LOAD 'loadLocation' USING loadFunc [params, ... ]; + Y = MAPREDUCE 'mymr.jar' [('other.jar', ...)] STORE X INTO 'inputLocation' USING storeFunc LOAD 'outputLocation' USING loadFunc AS schema [`params, ... `]; }}} - Logical Plan- Logical Plan creates a LONative operator with an internal plan that consists of a store and a load operator. The store operator cannot be attached to X at this level as it would start storing X at storeLocation for every plan that includes X which is not intended. Although we can LOLoad operator for Y at this point, we delay this to physical plan and track this with LONative operator. Also, since Y has dependency on X, we add plan of Y whenever we see plan for X in ''registerQuery''. - Physical Plan- Physical Plan adds the internal store to the physical plan and connects it to X and also adds the load to the plan with alias Y. Also, it creates a dependency between map reduce job for X and native map reduce job, and also between native map reduce job and plan having Y (which is a POLoad operator). We also create a MapReduceOper (customized) for the native map reduce job. + === Pig Plans === + Logical Plan- Logical Plan creates a LONative operator with an internal plan that consists of a store and a load operator. The store operator cannot be attached to X at this level as it would start storing X at inputLocation for every plan that includes X, which is not intended. Although we can LOLoad operator for Y at this point, we delay this to physical plan and track this with LONative operator. Since Y has dataflow dependency on X, we make a connection between operators corresponding to these aliased at logical plan. - MapReduce Plan- Inside the JobControlCompiler's compile method if we find the native mapreduce operator we can create a thread and run the Main method of native map reduce job with the specified parameters. Alternatively, we can call into native map reduce job's getJobConf method to get the job conf for the native job, then we can add pig specific parameters to this job and then add the job inside pig's jobcontrol. + {{{ + X = ... ; + | + | + ||--- (LOStore) Store X into 'inputLocation' + Y = MapReduce ... ; | + (LONative) -- innnerPlan ---| + mymr.jar | + params |--- (LOLoad) Load 'outputLocation' + | + | +
[Pig Wiki] Update of "UDFsUsingScriptingLanguages" by A niket Mokashi
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "UDFsUsingScriptingLanguages" page has been changed by Aniket Mokashi. http://wiki.apache.org/pig/UDFsUsingScriptingLanguages?action=diff&rev1=3&rev2=4 -- '''schemaFunction''' defines delegate function and is not registered to pig. When no decorator is specified, pig assumes the output datatype as bytearray and converts the output generated by script function to bytearray. This is consistent with pig's behavior in case of Java UDFs. + ''Sample Schema String'' - y:{t:(word:chararray,num:long)}, variable names inside schema string are not used anywhere, they are used just to make syntax identifiable to the parser. == Inline Scripts == @@ -92, +93 @@ def percent(num, total): return num * 100 / total - #CommaFormat- + + # String Functions # + + #commaFormat- format a number with commas, 12345-> 12,345 @outputSchema("t:(numformat:chararray)") def commaFormat(num): return '{:,}'.format(num) - - # String Functions # - - + #concatMultiple- concat multiple words + @outputSchema("t:(numformat:chararray)") + def concatMult4(word1, word2, word3, word4): + return word1+word2+word3+word4 ### # Data Type Functions # ### + #collectBag- collect elements of a bag into other bag + #This is useful UDF after group operation + @outputSchema("bag:{(y:{t:(word:chararray)}}") + def collectBag(bag): + outBag = [] + for word in bag: + tup=(len(bag), word[1]) + outBag.append(tup) + return outBag + # Few comments- + # pig mandates that a bag should be a bag of tuples, python UDFs should follow this pattern. + # tuple in python are immutable, appending to a tuple is not possible. }}} - == Performance == === Jython ===
[Pig Wiki] Update of "UDFsUsingScriptingLanguages" by A niket Mokashi
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "UDFsUsingScriptingLanguages" page has been changed by Aniket Mokashi. http://wiki.apache.org/pig/UDFsUsingScriptingLanguages?action=diff&rev1=2&rev2=3 -- {{{ Register 'test.py' using jython as myfuncs; }}} - This uses org.apache.pig.scripting.jython.JythonScriptEngine to interpret the python script. Users can use custom script engines to support multiple languages and ways to interpret them. Currently, pig identifies jython as a keyword and ships the required scriptengine (jython) to interpret it. + This uses org.apache.pig.scripting.jython.JythonScriptEngine to interpret the python script. Users can develop and use custom script engines to support multiple programming languages and ways to interpret them. Currently, pig identifies jython as a keyword and ships the required scriptengine (jython) to interpret it. Following syntax is also supported - {{{ @@ -52, +52 @@ }}} Registering test.py with pig makes under myfuncs namespace creates functions - myfuncs.helloworld(), myfuncs.complex(2), myfuncs.square(2.0) available as UDFs. These UDFs can be used with {{{ - b = foreach a generate myfuncs.helloworld, myfuncs.square(3); + b = foreach a generate myfuncs.helloworld(), myfuncs.square(3); }}} === Decorators and Schemas === - For annotating python script so that pig can identify their return types, we use decorators to define output schema for a script UDF. + For annotating python script so that pig can identify their return types, we use python decorators to define output schema for a script UDF. '''outputSchema''' defines schema for a script udf in a format that pig understands and is able to parse. '''outputFunctionSchema''' defines a script delegate function that defines schema for this function depending upon the input type. This is needed for functions that can accept generic types and perform generic operations on these types. A simple example is ''square'' which can accept multiple types. SchemaFunction for this type is a simple identity function (same schema as input). '''schemaFunction''' defines delegate function and is not registered to pig. - - When no decorator is specified, pig assumes the output datatype as bytearray and converts the output generated by script function to bytearray. This is consistent with pig's behavior in other cases. + When no decorator is specified, pig assumes the output datatype as bytearray and converts the output generated by script function to bytearray. This is consistent with pig's behavior in case of Java UDFs. - - ''Sample Schema String'' - y:{t:(word:chararray,num:long)}, variable names are not used anywhere they are just to make syntax consistent. + ''Sample Schema String'' - y:{t:(word:chararray,num:long)}, variable names inside schema string are not used anywhere, they are used just to make syntax identifiable to the parser. == Inline Scripts == + As of today, Pig doesn't support UDFs using inline scripts. This feature is being tracked at [[#ref4|PIG-1471]]. + + == Sample Script UDFs == + Simple tasks like string manipulation, mathematical computations, reorganizing data types can be easily done using python scripts without having to develop long and complex UDFs in Java. The overall overhead of using scripting language is much less and development cost is almost negligible. Following are a few examples of UDFs developed in python that can be used with Pig. + {{{ + mySampleLib.py + - + #!/usr/bin/python + + ## + # Math functions # + ## + #Square - Square of a number of any data type + @outputSchemaFunction("squareSchema") + def square(num): + return ((num)*(num)) + @schemaFunction("squareSchema") + def squareSchema(input): + return input + + #Percent- Percentage + @outputSchema("t:(percent:double)") + def percent(num, total): + return num * 100 / total + + #CommaFormat- + @outputSchema("t:(numformat:chararray)") + def commaFormat(num): + return '{:,}'.format(num) + + + # String Functions # + + + + ### + # Data Type Functions # + ### + + + }}} == Performance == === Jython === @@ -78, +117 @@ 1. <> PIG-928, "UDFs in scripting languages", https://issues.apache.org/jira/browse/PIG-928 2. <> Jython, "The jython project", http://www.jython.org/ 3. <> Jruby, "100% pure-java implementation of ruby programming language", http://jruby.org/ + 4. <> PIG-1471, "inline UDFs in scripting languages", https://issues.apache.org/jira/browse/PIG-1471
[Pig Wiki] Update of "PigJournal" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigJournal" page has been changed by AlanGates. http://wiki.apache.org/pig/PigJournal?action=diff&rev1=8&rev2=9 -- '''Estimated Development Effort:''' medium === Agreed Work, Unknown Approach === + Move Piggybank out of Contrib + Currently Pig hosts Piggybank (our repository of user contributed UDFs) as part of our contrib. This is not ideal for a couple of reasons. One, it means those who + wish to share their UDFs have to go through the rigor of the patch process. Two, since contrib is tied to releases of the main product, there is no way for users + to share functions for older versions or quickly disseminate their new functions. If Piggybank were instead more similar to CPAN than users could upload their own + packages with little assistance from Pig committers and specify what versions of Pig the function is for. This could be done via hosting site such as github. + + '''Category:''' Usability + + '''Dependency:''' + + '''References:''' + + '''Estimated Development Effort:''' small + + Clarify Pig Latin Semantics There are areas of Pig Latin semantics that are not clear or not consistent. Take for example, a script like:
[Pig Wiki] Update of "PigJournal" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigJournal" page has been changed by AlanGates. http://wiki.apache.org/pig/PigJournal?action=diff&rev1=7&rev2=8 -- project is still open to input on whether and when such work should be done. == Completed Work == - The following table contains a list of features that have been completed, as of Pig 0.6 + The following table contains a list of features that have been completed, as of Pig 0.7 || Feature || Available in Release || Comments || || Describe Schema || 0.1 || || @@ -34, +34 @@ || Outer join for default, fragment-replicate, skewed || 0.6 || || || Make configuration available to UDFs || 0.6 || || || Load Store Redesign || 0.7 || || - || Add Owl as contrib project || not yet released || || || Pig Mix 2.0 || not yet released || || == Work in Progress == This covers work that is currently being done. For each entry the main JIRA for the work is referenced. - || Feature || JIRA || Comments || + || Feature || JIRA || Comments || - || Boolean Type || [[https://issues.apache.org/jira/browse/PIG-1429|PIG-1429]] || || + || Boolean Type|| [[https://issues.apache.org/jira/browse/PIG-1429|PIG-1429]] || || - || Query Optimizer || [[http://issues.apache.org/jira/browse/PIG-1178|PIG-1178]] || || + || Query Optimizer || [[http://issues.apache.org/jira/browse/PIG-1178|PIG-1178]] || || - || Cleanup of javadocs || [[https://issues.apache.org/jira/browse/PIG-1311|PIG-1311]] || || + || Cleanup of javadocs || [[https://issues.apache.org/jira/browse/PIG-1311|PIG-1311]] || || - || UDFs in scripting languages || [[https://issues.apache.org/jira/browse/PIG-928|PIG-928]] || || + || UDFs in scripting languages || [[https://issues.apache.org/jira/browse/PIG-928|PIG-928]] || || - || Ability to specify a custom partitioner || [[https://issues.apache.org/jira/browse/PIG-282|PIG-282]] || || + || Ability to specify a custom partitioner || [[https://issues.apache.org/jira/browse/PIG-282|PIG-282]] || || - || Pig usage stats collection || [[https://issues.apache.org/jira/browse/PIG-1389|PIG-1389]], [[https://issues.apache.org/jira/browse/PIG-908|PIG-908]], [[https://issues.apache.org/jira/browse/PIG-864|PIG-864]], [[https://issues.apache.org/jira/browse/PIG-809|PIG-809]] || || + || Pig usage stats collection || [[https://issues.apache.org/jira/browse/PIG-1389|PIG-1389]], [[https://issues.apache.org/jira/browse/PIG-908|PIG-908]], [[https://issues.apache.org/jira/browse/PIG-864|PIG-864]], [[https://issues.apache.org/jira/browse/PIG-809|PIG-809]] || || - || Make Pig available via Maven || [[https://issues.apache.org/jira/browse/PIG-1334|PIG-1334]] || || + || Make Pig available via Maven|| [[https://issues.apache.org/jira/browse/PIG-1334|PIG-1334]] || || - + || Standard UDFs Pig Should Provide|| [[https://issues.apache.org/jira/browse/PIG-1405|PIG-1405]] || || + || Add Scalars To Pig Latin|| [[https://issues.apache.org/jira/browse/PIG-1434|PIG-1434]] || || + || Run Map Reduce Jobs Directly From Pig || [[https://issues.apache.org/jira/browse/PIG-506|PIG-506]] || || == Proposed Future Work == Work that the Pig project proposes to do in the future is further broken into three categories: @@ -73, +74 @@ Within each subsection order is alphabetical and does not imply priority. === Agreed Work, Agreed Approach === + Make Illustrate Work + Illustrate has become Pig's ignored step-child. Users find it very useful, but developers have not kept it up to date with new features (e.g. it does not work with merge join). Also, the way it is currently + implemented it has code in many of Pig's physical operators. This means the code is more complex and burdened with branches, making it harder to maintain. It also means that when doing new development it is + easy to forget about illustrate. Illustrate needs to be redesigned in such a way that it does not add complexity to physical operators and that as new operators are developed it is necessary and easy to add + illustrate functionality to them. Tests for illustrate also need to be added to th
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by daijy
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by daijy. http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=148&rev2=149 -- ||2216||Cannot get field schema|| ||2217||Problem setFieldSchema|| ||2218||Invalid resource schema: bag schema must have tuple as its field|| + ||2219||Attempt to disconnect operators which are not connected|| + ||2220||Plan in inconssistent state, connected in fromEdges but not toEdges|| + ||2221||No more walkers to pop|| + ||||Expected LogicalExpressionVisitor to visit expression node|| + ||2223||Expected LogicalPlanVisitor to visit relational node|| + ||2224||Found LogicalExpressionPlan with more than one root|| + ||2225||Projection with nothing to reference|| + ||2226||Cannot fine reference for ProjectExpression|| + ||2227||LogicalExpressionVisitor expects to visit expression plans|| + ||2228||Could not find a related project Expression for Dereference|| + ||2229||Couldn't find matching uid for project expression|| + ||2230||Cannot get column from project|| + ||2231||Unable to set index on newly create POLocalRearrange|| + ||2232||Cannot get schema|| + ||2233||Cannot get predecessor|| + ||2234||Cannot get group key schema|| + ||2235||Expected an ArrayList of Expression Plans|| + ||2236||User defined load function should implement the LoadFunc interface|| + ||2237||Unsupported operator in inner plan|| + ||2238||Expected list of expression plans|| + ||2239||Structure of schema change|| + ||2240||LogicalPlanVisitor can only visit logical plan|| + ||2241||UID is not found in the schema || + ||2242||TypeCastInserter invoked with an invalid operator|| + ||2243||Attempt to remove operator that is still connected to other operators|| ||2998||Unexpected internal error.|| ||2999||Unhandled internal error.|| ||3000||IOException caught while compiling POMergeJoin||
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by PradeepKamath
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by PradeepKamath. http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=147&rev2=148 -- ||1112||Unsupported query: You have an partition column () in a construction like: (pcond and ...) or (pcond and ...) where pcond is a condition on a partition column.|| ||1113||Unable to describe schema for nested expression || ||1114||Unable to find schema for nested alias || + ||1115||Place holder for Howl related errors|| ||2000||Internal error. Mismatch in group by arities. Expected: . Found: || ||2001||Unable to clone plan before compiling|| ||2002||The output file(s): already exists||
FrontPage reverted to revision 148 on Pig Wiki
Dear wiki user, You have subscribed to a wiki page "Pig Wiki" for change notification. The page FrontPage has been reverted to revision 148 by daijy. The comment on this change is: remove spam. http://wiki.apache.org/pig/FrontPage?action=diff&rev1=149&rev2=150 -- * PigDeveloperCookbook * Road map * ProposedRoadMap (2007 document from Yahoo!) - * PigJournal (features currently being worked on, ideas for future [[http://www.essaybank.com|essay]] development) + * PigJournal (features currently being worked on, ideas for future development) * Specification Proposals * PigTypesFunctionalSpec * PigTypesDesign
[Pig Wiki] Update of "FrontPage" by SafiaYardley
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "FrontPage" page has been changed by SafiaYardley. http://wiki.apache.org/pig/FrontPage?action=diff&rev1=148&rev2=149 -- * PigDeveloperCookbook * Road map * ProposedRoadMap (2007 document from Yahoo!) - * PigJournal (features currently being worked on, ideas for future development) + * PigJournal (features currently being worked on, ideas for future [[http://www.essaybank.com|essay]] development) * Specification Proposals * PigTypesFunctionalSpec * PigTypesDesign
[Pig Wiki] Update of "FAQ" by daijy
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "FAQ" page has been changed by daijy. http://wiki.apache.org/pig/FAQ?action=diff&rev1=6&rev2=7 -- C = JOIN A by url, B by url PARALLEL 50. }}} - Even if you do not specify the parallel clause, the framework uses a default number of reducers, in the order of 0.9*(number of nodes allocated by user -1)*n where n is the number of maximum reduce slots, for running your M/R jobs which result from statements such as GROUP, COGROUP, JOIN, and ORDER BY. For example, when allocating 3 machines you get about 0.9*2*4 = 7 reducers for operating on your parallel jobs. + Besides PARALLEL clause, you can also use "set default_parallel" statement in Pig script, or set "mapred.reduce.tasks" system property to specify default parallel to use. If none of these values are set, Pig will only use 1 reducers. (In Pig 0.8, we change the default reducer from 1 to a number calculated by a simple heuristic for foolproof purpose) '''Q: Can I do a numerical comparison while filtering?'''
[Pig Wiki] Update of "NativeMapReduce" by Aniket Mokash i
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "NativeMapReduce" page has been changed by Aniket Mokashi. http://wiki.apache.org/pig/NativeMapReduce?action=diff&rev1=6&rev2=7 -- == Introduction == Pig needs to provide a way to natively run map reduce jobs written in java language. There are some advantages of this- - 1. The advantages of the ''native'' keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. + 1. The advantages of the ''mapreduce'' keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. 2. User can make use of existing java applications without being a java programmer. == Syntax == To support native mapreduce job pig will support following syntax- {{{ X = ... ; - Y = NATIVE ('mymr.jar' [, 'other.jar' ...]) STORE X INTO 'storeLocation' USING storeFunc LOAD 'loadLocation' USING loadFunc [params, ... ]; + Y = MAPREDUCE ('mymr.jar' [, 'other.jar' ...]) STORE X INTO 'storeLocation' USING storeFunc LOAD 'loadLocation' USING loadFunc [params, ... ]; }}} This stores '''X''' into the '''storeLocation''' using '''storeFunc''', which is then used by native mapreduce to read its data. After we run mymr.jar's mapreduce, we load back the data from '''loadLocation''' into alias '''Y''' using '''loadFunc'''. - params are extra parameters required for native mapreduce job (TBD). + params are extra parameters required for native mapreduce job. - mymr.jar is complaint with pig specification (see below). + '''mymr.jar is any mapreduce jar file which can be run through "hadoop -jar mymr.jar params" command.''' == Comparison with similar features == === Pig Streaming === @@ -38, +38 @@ With native job support, pig can support native map reduce jobs written in java language that can convert a data set into a different data set after applying a custom map reduce functions of any complexity. - == Native Mapreduce job specification == - Native Mapreduce job needs to conform to some specification defined by Pig. This is required because Pig specifies the input and output directory in the script for this job and is responsible for managing the coordination of the native job with the remaining pig mapreduce jobs. Pig also might need to provide some extra configuration like job name, input/output formats, parallelism to the native job. For communicating such parameters to the native job, it should be according to specification provided by Pig. - - Following are some of the approaches of achieving this- - 1. '''Ordered inputLoc/outputLoc parameters'''- This is simplistic approach wherein native programs follow up a convention so that their first and second parameters are treated as input and output respectively. Pig ''native'' command takes the parameters required by the native mapreduce job and passes it to native job as command line arguments. It is upto the native program to use these parameters for operations it performs. - Thus, only following lines of code are mandatory inside the native program. - {{{ - FileInputFormat.setInputPaths(conf, new Path(args[0])); - FileOutputFormat.setOutputPath(conf, new Path(args[1])); - }}} - 1.#2 '''getJobConf Function'''- Native jobs implement '''getJobConf''' method which returns ''org.apache.hadoop.mapred.JobConf'' object so that pig can construct a ''job'' and schedule that inside pigs ''jobcontrol'' job. This also provides a way to add more pig specific parameters to this job before it is submitted. Most of the current native hadoop program create JobConf's and run hadoop jobs with ''JobClient.runJob(conf)''. These applications need to change their code to a getJobConf function so that pig can hook into them to get the conf. This will also allow pig to set the input and output directory for the native job. - For example- - {{{ - public JobConf getJobConf() { - JobConf conf = new JobConf(WordCount.class); - conf.setJobName("wordcount"); - - conf.setOutputKeyClass(Text.class); - conf.setOutputValueClass(IntWritable.class); - - conf.setMapperClass(Map.class); - conf.setCombinerClass(Reduce.class); - conf.setReducerClass(Reduce.class); - - conf.setInputFormat(TextInputFormat.class); - conf.setOutputFormat(TextOutputFormat.class); - - FileInputFormat.setInputPaths(conf, new Path(args[0])); - FileOutputFormat.setOutputPath(conf, new Path(args[1])); - } - public static void main(String[] args) throws Exception { - JobClient.runJob(getJobConf()); - } - }}} == Implementation Details == {{{ X = ... ; - Y = NATIVE ('mymr.jar' [, 'other.jar' ...]) STORE X INTO 'storeLocation' USING storeFunc LOAD 'loadLocation' USING loadFunc [params, ... ]; + Y = MAPREDUCE ('mymr.jar' [, 'other.jar' ...]) STORE X INTO 'storeLocat
[Pig Wiki] Update of "Conferences" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "Conferences" page has been changed by AlanGates. http://wiki.apache.org/pig/Conferences?action=diff&rev1=3&rev2=4 -- || NoSQL Summer|| Summer 2010 || Multiple world wide || http://nosqlsummer.org/ || || || || Bay Area Hadoop User Group || Jul 21 2010 || Sunnyvale, CA USA || http://www.meetup.com/hadoop/calendar/13546804/ || || || || Apache Asia Roadshow|| Aug 14-15 2010 || Shanghai, China || http://roadshowasia.52ac.com/openconf.php || || || + || Seattle Hadoop Day || Aug 14-15 2010 || Seattle, WA USA || http://hadoopday2010.eventbrite.com/|| || || || Open SQL Camp || Aug 21-22 2010 || St. Augustin, Germany || http://bit.ly/9X21wr|| || || || VLDB|| Sep 13-17 2010 || Singapore || http://www.vldb2010.org/|| || || || Surge || Sep 30 - Oct 1 2010 || Baltimore, MD USA || http://omniti.com/surge/2010|| || || || XLDB|| Oct 6 - 7 2010 || Menlo Park, CA USA|| http://www.xldb.org/4 || Alan Gates (Yahoo) || || + || Hadoop World NYC|| Oct 12 2010 || New York City, NY USA || http://bit.ly/9WlnJZ|| || || || First International Mapreduce Workshop 2010 || Nov 30 - Dec 3 2010 || Indianapolis, IN USA || http://bit.ly/aXCflu|| || ||
[Pig Wiki] Update of "NativeMapReduce" by Aniket Mokash i
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "NativeMapReduce" page has been changed by Aniket Mokashi. http://wiki.apache.org/pig/NativeMapReduce?action=diff&rev1=5&rev2=6 -- - = Page under construction = - #format wiki #language en @@ -18, +16 @@ == Syntax == To support native mapreduce job pig will support following syntax- - {{{ X = ... ; Y = NATIVE ('mymr.jar' [, 'other.jar' ...]) STORE X INTO 'storeLocation' USING storeFunc LOAD 'loadLocation' USING loadFunc [params, ... ]; @@ -35, +32 @@ Purpose of [[#ref2|pig streaming]] is to send data through an external script or program to transform a dataset into a different dataset based on a custom script written in any programming/scripting language. Pig streaming uses support of hadoop streaming to achieve this. Pig can register custom programs in a script, inline in the stream clause or using a define clause. Pig also provides a level of data guarantees on the data processing, provides feature for job management, provides ability to use distributed cache for the scripts (configurable). Streaming application run locally on individual mapper and reducer nodes for transforming the data. === Hive Transforms === - With [[#ref3|hive transforms]], users can also plug in their own custom mappers and reducers in the data stream. Basically, it is also an application of custom streaming supported by hadoop. Thus, these mappers and reducers can be written in any scripting languages and can be registered to distributed cache to help performance. To support custom map reduce programs written in java ([[#ref4|bezo's blog]]), we can use our custom mappers and reducers as data streaming functions and use them to transform the data using 'java -cp mymr.jar'. This will not invoke a map reduce task but will attempt to transform the data during the map or the reduce task (locally). + With [[#ref3|hive transforms]], users can also plug in their own custom mappers and reducers in the data stream. Basically, it is also an application of custom streaming supported by hadoop. Thus, these mappers and reducers can be written in any scripting languages and can be registered to distributed cache to help performance. To support custom map reduce programs written in java ([[#ref4|bizo's blog]]), we can use our custom mappers and reducers as data streaming functions and use them to transform the data using 'java -cp mymr.jar'. This will not invoke a map reduce task but will attempt to transform the data during the map or the reduce task (locally). Thus, both these features can transform data submitted to a map reduce job (mapper) into a different data set and/or transform data produced by a mapreduce job (reducer) into a different data set. But we should notice that data tranformation takes on a single machine and does not take advantage of map reduce framework itself. Also, these blocks only allow custom transformations inside the data pipeline and does not break the pipeline. @@ -45, +42 @@ Native Mapreduce job needs to conform to some specification defined by Pig. This is required because Pig specifies the input and output directory in the script for this job and is responsible for managing the coordination of the native job with the remaining pig mapreduce jobs. Pig also might need to provide some extra configuration like job name, input/output formats, parallelism to the native job. For communicating such parameters to the native job, it should be according to specification provided by Pig. Following are some of the approaches of achieving this- - 1. Ordered inputLoc/outputLoc parameters- This is simplistic approach wherein native programs follow up a convention so that their first and second parameters are treated as input and output respectively. Pig ''native'' command takes the parameters required by the native mapreduce job and passes it to native job as command line arguments. It is upto the native program to use these parameters for operations it performs. + 1. '''Ordered inputLoc/outputLoc parameters'''- This is simplistic approach wherein native programs follow up a convention so that their first and second parameters are treated as input and output respectively. Pig ''native'' command takes the parameters required by the native mapreduce job and passes it to native job as command line arguments. It is upto the native program to use these parameters for operations it performs. Thus, only following lines of code are mandatory inside the native program. {{{ FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); }}} - 2. getJobConf Function- Native jobs implement '''getJobConf''' method which returns org.apache.hadoop.mapred.JobConf object so that pig can schedule the job. This also provides a wa
[Pig Wiki] Update of "NativeMapReduce" by Aniket Mokash i
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "NativeMapReduce" page has been changed by Aniket Mokashi. http://wiki.apache.org/pig/NativeMapReduce?action=diff&rev1=4&rev2=5 -- With native job support, pig can support native map reduce jobs written in java language that can convert a data set into a different data set after applying a custom map reduce functions of any complexity. == Native Mapreduce job specification == - Native Mapreduce job needs to conform to some specification defined by Pig. This is required as Pig specifies the input and output directory in the script for this job and is responsible for managing the coordination of the native job with the remaining pig mapreduce jobs. Pig also might need to provide some extra configuration like job name, input/output formats, parallelism to the native job. For communicating such parameters to the native job, it should provide some way of communication. + Native Mapreduce job needs to conform to some specification defined by Pig. This is required because Pig specifies the input and output directory in the script for this job and is responsible for managing the coordination of the native job with the remaining pig mapreduce jobs. Pig also might need to provide some extra configuration like job name, input/output formats, parallelism to the native job. For communicating such parameters to the native job, it should be according to specification provided by Pig. Following are some of the approaches of achieving this- 1. Ordered inputLoc/outputLoc parameters- This is simplistic approach wherein native programs follow up a convention so that their first and second parameters are treated as input and output respectively. Pig ''native'' command takes the parameters required by the native mapreduce job and passes it to native job as command line arguments. It is upto the native program to use these parameters for operations it performs. @@ -51, +51 @@ FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); }}} + 2. getJobConf Function- Native jobs implement '''getJobConf''' method which returns org.apache.hadoop.mapred.JobConf object so that pig can schedule the job. This also provides a way to add more pig specific parame - - 2. getJobConf Function-
[Pig Wiki] Update of "NativeMapReduce" by Aniket Mokash i
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "NativeMapReduce" page has been changed by Aniket Mokashi. http://wiki.apache.org/pig/NativeMapReduce?action=diff&rev1=3&rev2=4 -- == Comparison with similar features == === Pig Streaming === - Purpose of [[#ref2|pig streaming]] is to send data through an external script or program to transform a dataset into a different dataset based on a custom script written in any programming/scripting language. Pig streaming uses support of hadoop streaming to achieve this. Pig can register custom programs in a script, inline in the stream clause or using a define clause. Pig also provides a level of data guarantees on the data processing, provides feature for job management, provides ability to use distributed cache for the scripts (configurable). Streaming application run locally on individual mapper and reducer nodes. + Purpose of [[#ref2|pig streaming]] is to send data through an external script or program to transform a dataset into a different dataset based on a custom script written in any programming/scripting language. Pig streaming uses support of hadoop streaming to achieve this. Pig can register custom programs in a script, inline in the stream clause or using a define clause. Pig also provides a level of data guarantees on the data processing, provides feature for job management, provides ability to use distributed cache for the scripts (configurable). Streaming application run locally on individual mapper and reducer nodes for transforming the data. === Hive Transforms === With [[#ref3|hive transforms]], users can also plug in their own custom mappers and reducers in the data stream. Basically, it is also an application of custom streaming supported by hadoop. Thus, these mappers and reducers can be written in any scripting languages and can be registered to distributed cache to help performance. To support custom map reduce programs written in java ([[#ref4|bezo's blog]]), we can use our custom mappers and reducers as data streaming functions and use them to transform the data using 'java -cp mymr.jar'. This will not invoke a map reduce task but will attempt to transform the data during the map or the reduce task (locally). Thus, both these features can transform data submitted to a map reduce job (mapper) into a different data set and/or transform data produced by a mapreduce job (reducer) into a different data set. But we should notice that data tranformation takes on a single machine and does not take advantage of map reduce framework itself. Also, these blocks only allow custom transformations inside the data pipeline and does not break the pipeline. - With native job support, pig can support native map reduce jobs written in java language that can convert a data set into a different data set after applying a custom map reduce function of any complexity. + With native job support, pig can support native map reduce jobs written in java language that can convert a data set into a different data set after applying a custom map reduce functions of any complexity. == Native Mapreduce job specification == + Native Mapreduce job needs to conform to some specification defined by Pig. This is required as Pig specifies the input and output directory in the script for this job and is responsible for managing the coordination of the native job with the remaining pig mapreduce jobs. Pig also might need to provide some extra configuration like job name, input/output formats, parallelism to the native job. For communicating such parameters to the native job, it should provide some way of communication. - Native Mapreduce job needs to conform to some specification defined by Pig. Pig specifies the input and output directory in the script for this job and is responsible for managing the coordination of the native job with the remaining pig mapreduce jobs. To allow pig to communicate with native map reduce job - 1. Ordered inputLoc/outputLoc parameters- + Following are some of the approaches of achieving this- + 1. Ordered inputLoc/outputLoc parameters- This is simplistic approach wherein native programs follow up a convention so that their first and second parameters are treated as input and output respectively. Pig ''native'' command takes the parameters required by the native mapreduce job and passes it to native job as command line arguments. It is upto the native program to use these parameters for operations it performs. + Thus, only following lines of code are mandatory inside the native program. + {{{ + FileInputFormat.setInputPaths(conf, new Path(args[0])); + FileOutputFormat.setOutputPath(conf, new Path(args[1])); + }}} + - 2. getJobConf Function- + 2. getJobConf Function- + + == Implementation Details == - + Logical Plan- == References == 1. <> PI
[Pig Wiki] Update of "TuringCompletePig" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "TuringCompletePig" page has been changed by AlanGates. http://wiki.apache.org/pig/TuringCompletePig?action=diff&rev1=5&rev2=6 -- } }}} + == Approach 3 == + At the Pig contributor workshop in June 2010 Dmitriy Ryaboy proposed that we go the DSL route in Java. Thus the example given above becomes something like: + + {{{ + + public class Main { + + public static void main(String[] args) { + float error = 100.0; + String infile = "original.data"; + PigBuilder pig = new PigBuilder(); + while (error > 1.0) { + PigRelation A = pig.load(infile, "piggybank.MyLoader"); + PigRelation B = A.group(pig.ALL); + // It's not entirely clear to me how nested foreach works in this scenario + PigRelation C = B.foreach(new MyFunc("A")); + + PigIterator pi = pig.openIterator(C, "outfile"); + Tuple t = pi.next(); + error = t.get(1); + if (error >= 1.0) { + pig.fs.mv('outfile', 'infile'); + } + } + } + } + }}} + + This would be accomplished by creating a public interface for Pig operators (here called !PigBuilder, but I'm not proposing that as the actual name) that would + construct a logical plan and execute it when openIterator is called, much as !PigServer does today. Another way to look at this is !PigServer could be changed to + expose Pig operators instead of just strings as it does today. + + The beauty of doing this in Java is it facilitates it being used in scripting languages as well. Since Java packages can be directly imported into Jython, JRuby, + Groovy, and other languages this immediately provides a scripting interface in the language of the users choice. + + This does violate requirement 10 above (that Pig Latin should appear the same in embedded and non-embedded form), but the cross language functionality may be worth + it. +
[Pig Wiki] Update of "PigTalksPapers" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigTalksPapers" page has been changed by AlanGates. http://wiki.apache.org/pig/PigTalksPapers?action=diff&rev1=10&rev2=11 -- * Pig poster at USENIX 2008: [[http://www.cs.cmu.edu/~olston/usenix08-poster.ppt|ppt]] * An interview with one of Yahoo's most prominent Pig users, including his take on Pig Latin vs. SQL: [[http://developer.yahoo.net/blogs/theater/archives/2008/04/_size75.html|video]] + == Contributor Workshops == + * June 2010 [[attachment:PigContributorWorkshop.pptx|slides]] +
New attachment added to page PigTalksPapers on Pig Wiki
Dear Wiki user, You have subscribed to a wiki page "PigTalksPapers" for change notification. An attachment has been added to that page by AlanGates. Following detailed information is available: Attachment name: PigContributorWorkshop.pptx Attachment size: 165564 Attachment link: http://wiki.apache.org/pig/PigTalksPapers?action=AttachFile&do=get&target=PigContributorWorkshop.pptx Page link: http://wiki.apache.org/pig/PigTalksPapers
[Pig Wiki] Update of "NativeMapReduce" by Aniket Mokash i
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "NativeMapReduce" page has been changed by Aniket Mokashi. http://wiki.apache.org/pig/NativeMapReduce?action=diff&rev1=2&rev2=3 -- <> <> - This document captures the specification for native map reduce jobs and proposal for executing native mapreduce jobs inside pig script. This is tracked at [[#ref1|Jira]]. + This document captures the specification for native map reduce jobs and proposal for executing native mapreduce jobs inside pig script. This is tracked at [[#ref1|PIG-506]]. == Introduction == - Pig needs to provide a way to natively run map reduce jobs written in java language. + Pig needs to provide a way to natively run map reduce jobs written in java language. There are some advantages of this- 1. The advantages of the ''native'' keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. 2. User can make use of existing java applications without being a java programmer. @@ -24, +24 @@ Y = NATIVE ('mymr.jar' [, 'other.jar' ...]) STORE X INTO 'storeLocation' USING storeFunc LOAD 'loadLocation' USING loadFunc [params, ... ]; }}} - This stores '''X''' into the '''storeLocation''' which is used by native mapreduce to read its data. After we run mymr.jar's mapreduce we load back the data from '''loadLocation''' into alias '''Y'''. + This stores '''X''' into the '''storeLocation''' using '''storeFunc''', which is then used by native mapreduce to read its data. After we run mymr.jar's mapreduce, we load back the data from '''loadLocation''' into alias '''Y''' using '''loadFunc'''. + + params are extra parameters required for native mapreduce job (TBD). + + mymr.jar is complaint with pig specification (see below). == Comparison with similar features == === Pig Streaming === + Purpose of [[#ref2|pig streaming]] is to send data through an external script or program to transform a dataset into a different dataset based on a custom script written in any programming/scripting language. Pig streaming uses support of hadoop streaming to achieve this. Pig can register custom programs in a script, inline in the stream clause or using a define clause. Pig also provides a level of data guarantees on the data processing, provides feature for job management, provides ability to use distributed cache for the scripts (configurable). Streaming application run locally on individual mapper and reducer nodes. - === Hive Transform === + === Hive Transforms === + With [[#ref3|hive transforms]], users can also plug in their own custom mappers and reducers in the data stream. Basically, it is also an application of custom streaming supported by hadoop. Thus, these mappers and reducers can be written in any scripting languages and can be registered to distributed cache to help performance. To support custom map reduce programs written in java ([[#ref4|bezo's blog]]), we can use our custom mappers and reducers as data streaming functions and use them to transform the data using 'java -cp mymr.jar'. This will not invoke a map reduce task but will attempt to transform the data during the map or the reduce task (locally). + + Thus, both these features can transform data submitted to a map reduce job (mapper) into a different data set and/or transform data produced by a mapreduce job (reducer) into a different data set. But we should notice that data tranformation takes on a single machine and does not take advantage of map reduce framework itself. Also, these blocks only allow custom transformations inside the data pipeline and does not break the pipeline. + + With native job support, pig can support native map reduce jobs written in java language that can convert a data set into a different data set after applying a custom map reduce function of any complexity. == Native Mapreduce job specification == - Native Mapreduce job needs to conform to some specification defined by Pig. Pig specifies the input and output directory for this job and is responsible for + Native Mapreduce job needs to conform to some specification defined by Pig. Pig specifies the input and output directory in the script for this job and is responsible for managing the coordination of the native job with the remaining pig mapreduce jobs. To allow pig to communicate with native map reduce job + 1. Ordered inputLoc/outputLoc parameters- + 2. getJobConf Function- == Implementation Details == @@ -42, +54 @@ 1. <> PIG-506, "Does pig need a NATIVE keyword?", https://issues.apache.org/jira/browse/PIG-506 2. <> Pig Wiki, "Pig Streaming Functional Specification", http://wiki.apache.org/pig/PigStreamingFunctionalSpec 3. <> Hive Wiki, "Transform/Map-Reduce Syntax", http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform + 4. <> Bizos blog, "hi
[Pig Wiki] Update of "Conferences" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "Conferences" page has been changed by AlanGates. http://wiki.apache.org/pig/Conferences?action=diff&rev1=2&rev2=3 -- scheduled to present at one, please note that here. If you are aware of conferences, user groups, meetups, etc. that are of interest to the Pig community that are not listed here please add them to the list. - || '''Title''' || '''Date''' || '''Location'''|| '''More Information''' || '''Attending''' || '''Presenting''' || + || '''Title''' || '''Date''' || '''Location'''|| '''More Information''' || '''Attending'''|| '''Presenting''' || - || NoSQL Summer|| Summer 2010 || Multiple world wide || http://nosqlsummer.org/ || || || + || NoSQL Summer|| Summer 2010 || Multiple world wide || http://nosqlsummer.org/ || || || - || Bay Area Hadoop User Group || Jul 21 2010 || Sunnyvale, CA USA || http://www.meetup.com/hadoop/calendar/13546804/ || || || + || Bay Area Hadoop User Group || Jul 21 2010 || Sunnyvale, CA USA || http://www.meetup.com/hadoop/calendar/13546804/ || || || - || Apache Asia Roadshow|| Aug 14-15 2010 || Shanghai, China || http://roadshowasia.52ac.com/openconf.php || || || + || Apache Asia Roadshow|| Aug 14-15 2010 || Shanghai, China || http://roadshowasia.52ac.com/openconf.php || || || - || Open SQL Camp || Aug 21-22 2010 || St. Augustin, Germany || http://bit.ly/9X21wr|| || || + || Open SQL Camp || Aug 21-22 2010 || St. Augustin, Germany || http://bit.ly/9X21wr|| || || - || VLDB|| Sep 13-17 2010 || Singapore || http://www.vldb2010.org/|| || || + || VLDB|| Sep 13-17 2010 || Singapore || http://www.vldb2010.org/|| || || - || Surge || Sep 30 - Oct 1 2010 || Baltimore, MD USA || http://omniti.com/surge/2010|| || || + || Surge || Sep 30 - Oct 1 2010 || Baltimore, MD USA || http://omniti.com/surge/2010|| || || + || XLDB|| Oct 6 - 7 2010 || Menlo Park, CA USA|| http://www.xldb.org/4 || Alan Gates (Yahoo) || || - || First International Mapreduce Workshop 2010 || Nov 30 - Dec 3 2010 || Indianapolis, IN USA || http://bit.ly/aXCflu|| || || + || First International Mapreduce Workshop 2010 || Nov 30 - Dec 3 2010 || Indianapolis, IN USA || http://bit.ly/aXCflu|| || ||
[Pig Wiki] Update of "UDFsUsingScriptingLanguages" by A niket Mokashi
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "UDFsUsingScriptingLanguages" page has been changed by Aniket Mokashi. http://wiki.apache.org/pig/UDFsUsingScriptingLanguages?action=diff&rev1=1&rev2=2 -- @schemaFunction("squareSchema") def squareSchema(input): return input + + # No decorator - bytearray + def concat(str): + return str+str }}} Registering test.py with pig makes under myfuncs namespace creates functions - myfuncs.helloworld(), myfuncs.complex(2), myfuncs.square(2.0) available as UDFs. These UDFs can be used with {{{
[Pig Wiki] Update of "UDFsUsingScriptingLanguages" by A niket Mokashi
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "UDFsUsingScriptingLanguages" page has been changed by Aniket Mokashi. http://wiki.apache.org/pig/UDFsUsingScriptingLanguages -- New page: #format wiki #language en <> <> This document captures the specification for using UDFs using scripting languages, this document will document the syntax, usage details and performance numbers for using this feature. This is tracked at [[#ref1|PIG-928]]. == UDFs Using Scripting Languages == Pig needs to support user defined functions written in different scripting languages such as Python, Ruby, Groovy. Pig can make use of different modules such as [[#ref2|jython]], [[#ref3|jruby]] which make these scripts available for java to use. Pig needs to support ways to register functions from script files written in different scripting languages as well as inline functions to define these functions in pig script. == Syntax == === Registering scripts === {{{ Register 'test.py' using jython as myfuncs; }}} This uses org.apache.pig.scripting.jython.JythonScriptEngine to interpret the python script. Users can use custom script engines to support multiple languages and ways to interpret them. Currently, pig identifies jython as a keyword and ships the required scriptengine (jython) to interpret it. Following syntax is also supported - {{{ Register 'test.py' using org.apache.pig.scripting.jython.JythonScriptEngine as myfuncs; }}} myfuncs is the namespace created for all the functions inside test.py. A typical test.py looks as follows - {{{ #!/usr/bin/python @outputSchema("x:{t:(word:chararray)}") def helloworld(): return ('Hello, World') @outputSchema("y:{t:(word:chararray,num:long)}") def complex(word): return (str(word),long(word)*long(word)) @outputSchemaFunction("squareSchema") def square(num): return ((num)*(num)) @schemaFunction("squareSchema") def squareSchema(input): return input }}} Registering test.py with pig makes under myfuncs namespace creates functions - myfuncs.helloworld(), myfuncs.complex(2), myfuncs.square(2.0) available as UDFs. These UDFs can be used with {{{ b = foreach a generate myfuncs.helloworld, myfuncs.square(3); }}} === Decorators and Schemas === For annotating python script so that pig can identify their return types, we use decorators to define output schema for a script UDF. '''outputSchema''' defines schema for a script udf in a format that pig understands and is able to parse. '''outputFunctionSchema''' defines a script delegate function that defines schema for this function depending upon the input type. This is needed for functions that can accept generic types and perform generic operations on these types. A simple example is ''square'' which can accept multiple types. SchemaFunction for this type is a simple identity function (same schema as input). '''schemaFunction''' defines delegate function and is not registered to pig. When no decorator is specified, pig assumes the output datatype as bytearray and converts the output generated by script function to bytearray. This is consistent with pig's behavior in other cases. ''Sample Schema String'' - y:{t:(word:chararray,num:long)}, variable names are not used anywhere they are just to make syntax consistent. == Inline Scripts == == Performance == === Jython === == References == 1. <> PIG-928, "UDFs in scripting languages", https://issues.apache.org/jira/browse/PIG-928 2. <> Jython, "The jython project", http://www.jython.org/ 3. <> Jruby, "100% pure-java implementation of ruby programming language", http://jruby.org/
[Pig Wiki] Update of "NativeMapReduce" by Aniket Mokash i
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "NativeMapReduce" page has been changed by Aniket Mokashi. The comment on this change is: Page under construction. http://wiki.apache.org/pig/NativeMapReduce?action=diff&rev1=1&rev2=2 -- + = Page under construction = + #format wiki #language en <> <> - This document captures the specification for native map reduce jobs and proposal for executing native mapreduce jobs inside pig script. This is tracked at *https://issues.apache.org/jira/browse/PIG-506. + This document captures the specification for native map reduce jobs and proposal for executing native mapreduce jobs inside pig script. This is tracked at [[#ref1|Jira]]. == Introduction == Pig needs to provide a way to natively run map reduce jobs written in java language. @@ -37, +39 @@ == References == - 1. <> PIG-506, "Does pig need a NATIVE keyword?", https://issues.apache.org/jira/browse/PIG-506 2. <> Pig Wiki, "Pig Streaming Functional Specification", http://wiki.apache.org/pig/PigStreamingFunctionalSpec 3. <> Hive Wiki, "Transform/Map-Reduce Syntax", http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform
[Pig Wiki] Update of "NativeMapReduce" by Aniket Mokash i
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "NativeMapReduce" page has been changed by Aniket Mokashi. http://wiki.apache.org/pig/NativeMapReduce -- New page: #format wiki #language en <> <> This document captures the specification for native map reduce jobs and proposal for executing native mapreduce jobs inside pig script. This is tracked at *https://issues.apache.org/jira/browse/PIG-506. == Introduction == Pig needs to provide a way to natively run map reduce jobs written in java language. There are some advantages of this- 1. The advantages of the ''native'' keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. 2. User can make use of existing java applications without being a java programmer. == Syntax == To support native mapreduce job pig will support following syntax- {{{ X = ... ; Y = NATIVE ('mymr.jar' [, 'other.jar' ...]) STORE X INTO 'storeLocation' USING storeFunc LOAD 'loadLocation' USING loadFunc [params, ... ]; }}} This stores '''X''' into the '''storeLocation''' which is used by native mapreduce to read its data. After we run mymr.jar's mapreduce we load back the data from '''loadLocation''' into alias '''Y'''. == Comparison with similar features == === Pig Streaming === === Hive Transform === == Native Mapreduce job specification == Native Mapreduce job needs to conform to some specification defined by Pig. Pig specifies the input and output directory for this job and is responsible for == Implementation Details == == References == 1. <> PIG-506, "Does pig need a NATIVE keyword?", https://issues.apache.org/jira/browse/PIG-506 2. <> Pig Wiki, "Pig Streaming Functional Specification", http://wiki.apache.org/pig/PigStreamingFunctionalSpec 3. <> Hive Wiki, "Transform/Map-Reduce Syntax", http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform
[Pig Wiki] Update of "PoweredBy" by SeanTimm
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PoweredBy" page has been changed by SeanTimm. http://wiki.apache.org/pig/PoweredBy?action=diff&rev1=2&rev2=3 -- Applications and organizations using Pig include (alphabetically): + + * [[http://www.aol.com/|AOL]] + * AOL has multiple clusters from a few nodes to several hundred nodes. + * We use Hadoop for analytics and batch data processing for various applications. + * Hadoop is used by MapQuest, Ad, Search, Truveo, and Media groups. + * All of our jobs are written in Pig or native map reduce. * [[http://www.cooliris.com/|Cooliris]] - Cooliris transforms your browser into a lightning fast, cinematic way to browse photos and videos, both online and on your hard drive. * We have a 15-node Hadoop cluster where each machine has 8 cores, 8 GB ram, and 3-4 TB of storage.
Page 0102 deleted from Pig Wiki
Dear wiki user, You have subscribed to a wiki page "Pig Wiki" for change notification. The page "0102" has been deleted by daijy. The comment on this change is: delete spam. http://wiki.apache.org/pig/0102
[Pig Wiki] Update of "0102" by 0102
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "0102" page has been changed by 0102. http://wiki.apache.org/pig/0102 -- New page: Third, isolation of the local government departments and the administrative utility of the SAR and the constraints of economic development zones. On a regional reform, opening up and economic development constraints, in addition to the central and provincial government departments from the outside, but also from the management of the development of regional government departments. The special zone administrative system advantages: First, improve administrative efficiency. In the administrative examination and approval of a unified administration, prevented a project Touzi, a Qi Ye registration, etc., needed to many a department for approval and the time very long phenomenon. Second, to prevent government departments and utilities on the enterprise administrative fees and fines. Even some zones, the protection of businesses in the region, does not allow government departments and the administrative utilities to the development zone to the charges and fines. This is why the SAR and the operation and development zone enterprises to invest in an important reason for lower cost. Fourth, the structure and experience, including economic development, industrial growth and development zones for non-SAR, as well as the formation of the national pilot, demonstration, diffusion, lead, and other associated effects. From the SAR, to the Free Trade Zone, to economic and technological development zones, from the national economic and technological development zones, to the provincial and municipal economic and technological development zones, the government in a special area and the park systems and policies, and gradually from point to surface, from the coast to the interior, from the central zone to test and promote local level development zones. This pattern of reform and opening up has greatly liberated the productive forces, increasing the spread of industry and association, due to division of labor, industrial extension, production and supporting, etc., plus the logistics distribution, development led the Pearl River Delta, Yangtze River Delta, Bohai Bay economic development, but to the Midwest industrial and transport development. Fan Gang, On the role of the SAR model in the system when that began to reform a big issue is the lack of information, lack of knowledge and, as the reform and opening of the SAR, the responsibility and act as a rapid absorbing introduction of various relations, systems and information an important mechanism. To clarify relations between various systems, to promote the smooth implementation of reforms, which require a region in all aspects of the reform to get this information. For the pilot reform of the country was full of knowledge, information, experiences and lessons learned, and then used to guide the country's reforms, the country to show the way to do model. This is the significance of the special economic zones and important role in the host [9]. In conclusion, Comrade Deng Xiaoping, the region is to land in China to learn the advanced systems and mechanisms, new a new kind of modern enterprises and government institutions; is the use of foreign capital, technology and advanced management, the formation of a new industrial system, to boost the national economy, greatly emancipated productivity. Opening up of the SAR, bonded, large coastal open economic and technological development, and the subsequent opening up of inland areas and border owe a great deal! [http://www.mbt6shoes.com] Wholesale mbt shoes
[Pig Wiki] Update of "TuringCompletePig" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "TuringCompletePig" page has been changed by AlanGates. http://wiki.apache.org/pig/TuringCompletePig?action=diff&rev1=4&rev2=5 -- } }}} + === Other Thoughts === + Whichever way we do it, we need to consider what built in variables we need in the system. For example, it would be really nice to have a + status variable so that you could do something like: + + {{{ + ... + store X into 'foo'; + if ($status == 0) { -- or "success" or whatever + ... + } else { + ... + } + }}} +
[Pig Wiki] Update of "AvoidingSedes" by ThejasNair
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "AvoidingSedes" page has been changed by ThejasNair. http://wiki.apache.org/pig/AvoidingSedes?action=diff&rev1=4&rev2=5 -- == Delaying/Avoiding deserialization at runtime == These approaches (except 5) does not involve major changes to core pig code. Load functions, or serialization between map and reduce can be separately changed to improve performance. 1. '''!LoadFunctions make use of public interface !LoadPushDown.pushDownProjection.''' Don't deserialize columns not that are not in required . This should always improve performance. !PigStorage indirectly works this way, if a column is not used, the optimizer removes the casting(ie deserialization) of the column from the type-casting foreach statement which comes after the load. - 1. '''!LoadFunction return a custom tuple, which deserializes fields only when tuple.get(i) is called.''' This can be useful if the first operator after load is a filter operator - the whole filter expression might not have to be evaluated and that deserialization of all columns might not have to be done. Assuming the first approach is already implemented, then this approach is likely to have some overhead with queries where all tuple.get(i) is called on all columns/rows. + 1. '''!LoadFunction returns a custom tuple, which deserializes fields only when tuple.get(i) is called.''' This can be useful if the first operator after load is a filter operator - the whole filter expression might not have to be evaluated and that deserialization of all columns might not have to be done. Assuming the first approach is already implemented, then this approach is likely to have some overhead with queries where all tuple.get(i) is called on all columns/rows. 1. '''!LoadFunction delays deserialization of map and bag types until a member function of java.util.Map or !DataBag is called. ''' The load function uses subclass of Map and DataBag which holds the serialized copy. This will help in delaying the deserialization further. This can't be done for scalar types because the classes pig uses for them are final; even if that were not the case we might not see much of performance gain because of the cost of creating an copy of the serialized data might be high compared to the cost of deserialization. This will only delay serialization up to the MR boundaries. {{{ Example of query where this will help -
[Pig Wiki] Update of "AvoidingSedes" by ThejasNair
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "AvoidingSedes" page has been changed by ThejasNair. http://wiki.apache.org/pig/AvoidingSedes?action=diff&rev1=3&rev2=4 -- == Delaying/Avoiding deserialization at runtime == - These approaches does not involve any changes to core pig code. Load functions, or serialization between map and reduce can be separately changed to improve performance. + These approaches (except 5) does not involve major changes to core pig code. Load functions, or serialization between map and reduce can be separately changed to improve performance. 1. '''!LoadFunctions make use of public interface !LoadPushDown.pushDownProjection.''' Don't deserialize columns not that are not in required . This should always improve performance. !PigStorage indirectly works this way, if a column is not used, the optimizer removes the casting(ie deserialization) of the column from the type-casting foreach statement which comes after the load. 1. '''!LoadFunction return a custom tuple, which deserializes fields only when tuple.get(i) is called.''' This can be useful if the first operator after load is a filter operator - the whole filter expression might not have to be evaluated and that deserialization of all columns might not have to be done. Assuming the first approach is already implemented, then this approach is likely to have some overhead with queries where all tuple.get(i) is called on all columns/rows. 1. '''!LoadFunction delays deserialization of map and bag types until a member function of java.util.Map or !DataBag is called. ''' The load function uses subclass of Map and DataBag which holds the serialized copy. This will help in delaying the deserialization further. This can't be done for scalar types because the classes pig uses for them are final; even if that were not the case we might not see much of performance gain because of the cost of creating an copy of the serialized data might be high compared to the cost of deserialization. This will only delay serialization up to the MR boundaries.
[Pig Wiki] Update of "AvoidingSedes" by ThejasNair
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "AvoidingSedes" page has been changed by ThejasNair. http://wiki.apache.org/pig/AvoidingSedes?action=diff&rev1=2&rev2=3 -- = Avoiding Serialization/De-serialization in pig = - Serialization/De-serialization is expensive and avoiding it will improve performance. + Serialization/De-serialization is expensive and avoiding it will improve performance. This wiki discusses ideas that can help with that. == Delaying/Avoiding deserialization at runtime ==
[Pig Wiki] Update of "AvoidingSedes" by ThejasNair
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "AvoidingSedes" page has been changed by ThejasNair. http://wiki.apache.org/pig/AvoidingSedes?action=diff&rev1=1&rev2=2 -- - = Avoiding Serialization/De-serialization in pig + = Avoiding Serialization/De-serialization in pig = Serialization/De-serialization is expensive and avoiding it will improve performance. - = Delaying/Avoiding deserialization at runtime + == Delaying/Avoiding deserialization at runtime == These approaches does not involve any changes to core pig code. Load functions, or serialization between map and reduce can be separately changed to improve performance. 1. '''!LoadFunctions make use of public interface !LoadPushDown.pushDownProjection.''' Don't deserialize columns not that are not in required . This should always improve performance. !PigStorage indirectly works this way, if a column is not used, the optimizer removes the casting(ie deserialization) of the column from the type-casting foreach statement which comes after the load. 1. '''!LoadFunction return a custom tuple, which deserializes fields only when tuple.get(i) is called.''' This can be useful if the first operator after load is a filter operator - the whole filter expression might not have to be evaluated and that deserialization of all columns might not have to be done. Assuming the first approach is already implemented, then this approach is likely to have some overhead with queries where all tuple.get(i) is called on all columns/rows.
[Pig Wiki] Update of "AvoidingSedes" by ThejasNair
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "AvoidingSedes" page has been changed by ThejasNair. http://wiki.apache.org/pig/AvoidingSedes -- New page: = Avoiding Serialization/De-serialization in pig Serialization/De-serialization is expensive and avoiding it will improve performance. = Delaying/Avoiding deserialization at runtime These approaches does not involve any changes to core pig code. Load functions, or serialization between map and reduce can be separately changed to improve performance. 1. '''!LoadFunctions make use of public interface !LoadPushDown.pushDownProjection.''' Don't deserialize columns not that are not in required . This should always improve performance. !PigStorage indirectly works this way, if a column is not used, the optimizer removes the casting(ie deserialization) of the column from the type-casting foreach statement which comes after the load. 1. '''!LoadFunction return a custom tuple, which deserializes fields only when tuple.get(i) is called.''' This can be useful if the first operator after load is a filter operator - the whole filter expression might not have to be evaluated and that deserialization of all columns might not have to be done. Assuming the first approach is already implemented, then this approach is likely to have some overhead with queries where all tuple.get(i) is called on all columns/rows. 1. '''!LoadFunction delays deserialization of map and bag types until a member function of java.util.Map or !DataBag is called. ''' The load function uses subclass of Map and DataBag which holds the serialized copy. This will help in delaying the deserialization further. This can't be done for scalar types because the classes pig uses for them are final; even if that were not the case we might not see much of performance gain because of the cost of creating an copy of the serialized data might be high compared to the cost of deserialization. This will only delay serialization up to the MR boundaries. {{{ Example of query where this will help - l = LOAD 'file1' AS (a : int, b : map [ ]); f = FOREACH l GENERATE udf1(a), b; -- Approach 2 will not help in delaying deserialization beyond this point. fil = FILTER f BY $0 > 5; dump fil; -- Serialization of column b can be delayed until here using this approach . }}} 1.#4 '''Set the property "pig.data.tuple.factory.name" to use a tuple that understands serialization format used for bags and maps used in approach 3, so that serialized data can be passed from loader across MR boundaries in the serialization format of load function. ''' The write() and readFields() functions of tuple returned by TupleFactory is used to serialize data between Map and Reduce. To use a new custom tuple, you need to use a custom TupleFactory that returns tuples of this type. But this approach will work only for a set of load functions in the query that share same serialization format for map and bags. 1. ''' Expose load function's sedes functionality in new interface and track lineage of columns''' This will the elegant and extensible way of doing what is proposed in approach 4. For each serialized column, if we know the deserialization function, we can delay deserialization across MR boundaries.
[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by Aniket Mokashi
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigErrorHandlingFunctionalSpecification" page has been changed by Aniket Mokashi. http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=146&rev2=147 -- ||1110||Unsupported query: You have an partition column () inside a in the filter condition.|| ||||Use of partition column/condition with non partition column/condition in filter expression is not supported.|| ||1112||Unsupported query: You have an partition column () in a construction like: (pcond and ...) or (pcond and ...) where pcond is a condition on a partition column.|| + ||1113||Unable to describe schema for nested expression || + ||1114||Unable to find schema for nested alias || ||2000||Internal error. Mismatch in group by arities. Expected: . Found: || ||2001||Unable to clone plan before compiling|| ||2002||The output file(s): already exists||
[Pig Wiki] Update of "Conferences" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "Conferences" page has been changed by AlanGates. http://wiki.apache.org/pig/Conferences?action=diff&rev1=1&rev2=2 -- interest to the Pig community that are not listed here please add them to the list. || '''Title''' || '''Date''' || '''Location'''|| '''More Information''' || '''Attending''' || '''Presenting''' || - || NoSQL Summer || Summer 2010 || Multiple world wide || http://nosqlsummer.org/ || || || + || NoSQL Summer|| Summer 2010 || Multiple world wide || http://nosqlsummer.org/ || || || - || Chicago Hadoop User Group || Jun 22 2010 || Chicago, IL USA || http://bit.ly/b6Ncl3|| || || || Bay Area Hadoop User Group || Jul 21 2010 || Sunnyvale, CA USA || http://www.meetup.com/hadoop/calendar/13546804/ || || || + || Apache Asia Roadshow|| Aug 14-15 2010 || Shanghai, China || http://roadshowasia.52ac.com/openconf.php || || || || Open SQL Camp || Aug 21-22 2010 || St. Augustin, Germany || http://bit.ly/9X21wr|| || || || VLDB|| Sep 13-17 2010 || Singapore || http://www.vldb2010.org/|| || || + || Surge || Sep 30 - Oct 1 2010 || Baltimore, MD USA || http://omniti.com/surge/2010|| || || || First International Mapreduce Workshop 2010 || Nov 30 - Dec 3 2010 || Indianapolis, IN USA || http://bit.ly/aXCflu|| || ||
[Pig Wiki] Update of "TuringCompletePig" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "TuringCompletePig" page has been changed by AlanGates. http://wiki.apache.org/pig/TuringCompletePig?action=diff&rev1=3&rev2=4 -- Object outfile = new String("result.data"); while (error != null && (Double)error > 1.0) { PigServer ps = new PigServer(); - ps.registerQuery("A = load infile;"); + ps.registerQuery("A = load " + infile + ";"); ps.registerQuery("B = group A all;"); ps.registerQuery("C = foreach B generate flatten(doSomeCalculation(A)) as (result, error);"); ps.registerQuery("error = foreach C generate error;");
[Pig Wiki] Update of "TuringCompletePig" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "TuringCompletePig" page has been changed by AlanGates. http://wiki.apache.org/pig/TuringCompletePig?action=diff&rev1=2&rev2=3 -- Thoughts? Preferences for one of the options I did not like? Comments welcome. + == Approach 2 == + And now for something completely different. + + After thinking on the above for a week or so it occurs to me that in dismissing making Pig Latin itself Turing complete I am conflating two tasks + that could be decoupled. The first is defining a grammar for the language and extending the parser. The second is building an execution engine to execute + Pig Latin scripts. It is the second that I am concerned is too much work. Defining the grammar and building the parser is relatively easy (as + we say in the Pig team at Yahoo, "parsers are easy"). + + So what if we did extend Pig Latin itself to be Turing complete, but the first pass over the language was to compile it down to Java code that made + use of the existing !PigServer class to execute the code? This meets all ten requirements given above (some extra work will need to be done to meet + requirement 8 on up front semantic checking, but it is possible). It deals with my initial concern that supporting Turing completeness in Pig Latin + is too much work. It also has the exceedingly nice feature that we do not have to pick any one scripting language. The more I talked to people the + more I discovered some wanted Python, some Ruby, some Perl, some Groovy, etc. This avoids that problem. And the extensions to Pig Latin themselves + will be simple enough that it should not be onerous for people to learn it. It also means that at some future time if we decide that we want more + control over how the language is executed we can make changes without people needing to switch from whatever scripting language we embed it in. + + A significant downside to this proposal is now users have to have a Java compiler along to run their Pig Latin scripts. + + The other concerns I gave above about making Pig Latin Turing complete are somewhat addressed, but not totally. It would be possible, though + painful, to use a Java debugger on the generated Java code. Syntax highlighting and completion files could be created for Vim, Emacs, Eclipse, and + whatever other favorite editors people have. + + === Specifics === + The grammar of the language should be kept as simple as possible. The goal is not to create a general purpose programming language. + Tasks requiring these features should still be written in UDFs in Java or a scripting language. + + Each Pig Latin file would be considered as a module. All functions would have global scope within that module and would be visible once the module is + imported. + + The type system would be existing Pig Latin types (we may need to add a list type). Types would be bound at run time (this is necessary to support + existing PL grammar where A = load ... is a declaration of A). + + The grammar would look something like: + + {{{ + program: + import + | register + | define + | func_definition + | block + + import: + IMPORT _modulename_ namespace_clause + + namespace_clause: + (empty) + | AS _namespacename_ + + register: + ... // as now + + define: + ... // as now + + func_definition: + DEF _functionname_ ( arg_list ) { block } + // not sure about this, having DEF and DEFINE different keywords. + // May want to reuse DEFINE here or DEFINE FUNCTION + + arg_list: + expr + | arg_list , expr + + block: + statement + | block statement + + statement: + ; + | assignment + | if + | while + | for + | return // only valid inside functions + | CONTINUE ; // only valid inside loops + | BREAK ; // only valid inside loops + | split + | store + | dump + | fs + + assignment: + _var_ = expr ; + | _var_ = LOAD _inputsrc_ ; + ... // GROUP, FILTER, etc. as now + + statement_or_block: + statement + | { block } + + if: + IF ( expr ) statement_or_block else + + else: + (empty) + | ELSE statement_or_block + + while: + WHILE ( expr ) statement_or_block + + for: + FOR ( assignment ; expr ; expr ) statement_or_block + + return: + RETURN ; + | RETURN expr ; + + // split, dump, store, fs as now + }}} + + So the example given initially would look like: + {{{ + error = 100.0; + infile = 'original.data'; + outfile = 'result.data'; + while (error > 1.0) { + A = load infile; + B = group A all; + C = foreach B generate flatten(doSomeCalculation(A)) as (result, error); + error = foreach C generate error; + store C into outfile; + if (error > 1.0) fs mv outfile
[Pig Wiki] Update of "Conferences" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "Conferences" page has been changed by AlanGates. http://wiki.apache.org/pig/Conferences -- New page: = Conferences and User Groups = This page lists upcoming conferences, user groups, meetups, etc. that the Pig team is aware of. The goal is for Pig users around the world to have a way to identify conferences and other meetings that might be of interest to them. Also, it can help Pig users find each other at these meetings. If you are going to any of these, and especially if you are scheduled to present at one, please note that here. If you are aware of conferences, user groups, meetups, etc. that are of interest to the Pig community that are not listed here please add them to the list. || '''Title''' || '''Date''' || '''Location'''|| '''More Information''' || '''Attending''' || '''Presenting''' || || NoSQL Summer || Summer 2010 || Multiple world wide || http://nosqlsummer.org/ || || || || Chicago Hadoop User Group || Jun 22 2010 || Chicago, IL USA || http://bit.ly/b6Ncl3|| || || || Bay Area Hadoop User Group || Jul 21 2010 || Sunnyvale, CA USA || http://www.meetup.com/hadoop/calendar/13546804/ || || || || Open SQL Camp || Aug 21-22 2010 || St. Augustin, Germany || http://bit.ly/9X21wr|| || || || VLDB|| Sep 13-17 2010 || Singapore || http://www.vldb2010.org/|| || || || First International Mapreduce Workshop 2010 || Nov 30 - Dec 3 2010 || Indianapolis, IN USA || http://bit.ly/aXCflu|| || ||
[Pig Wiki] Update of "PigMix" by daijy
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigMix" page has been changed by daijy. http://wiki.apache.org/pig/PigMix?action=diff&rev1=16&rev2=17 -- || PigMix_16 || 82.33|| 69.33 || 1.19 || || PigMix_17 || 286 || 229.33|| 1.25 || || Total || 2121.67 || 1929.67 || 1.10 || - ||Weighted Avg || 1.14544 || + || Weighted Avg |||| || 1.15 || == Features Tested ==
[Pig Wiki] Update of "PigMix" by daijy
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigMix" page has been changed by daijy. http://wiki.apache.org/pig/PigMix?action=diff&rev1=15&rev2=16 -- {{{ A = load 'page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); - B = order A by user parallel $mappers; + B = order A by user $parallelfactor; store B into 'page_views_sorted' using PigStorage('\u0001'); alpha = load 'users' using PigStorage('\u0001') as (name, phone, address, city, state, zip); - a1 = order alpha by name parallel $mappers; + a1 = order alpha by name $parallelfactor; store a1 into 'users_sorted' using PigStorage('\u0001'); a = load 'power_users' using PigStorage('\u0001') as (name, phone, address, city, state, zip); @@ -287, +287 @@ This script tests reading from a map, flattening a bag of maps, and use of bincond (features 2, 3, and 4). {{{ register pigperf.jar; - A = load '$page_views' using org.apache.pig.test.utils.datagen.PigPerformanceLoader() + A = load '$page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, (int)action as action, (map[])page_info as page_info, @@ -304, +304 @@ This script tests using a join small enough to do in fragment and replicate (feature 7). {{{ register pigperf.jar; - A = load '$page_views' using org.apache.pig.test.utils.datagen.PigPerformanceLoader() + A = load '$page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, estimated_revenue; @@ -321, +321 @@ something that pig could potentially optimize by not regrouping. {{{ register pigperf.jar; - A = load '$page_views' using org.apache.pig.test.utils.datagen.PigPerformanceLoader() + A = load '$page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, (double)estimated_revenue; @@ -340, +340 @@ This script covers foreach generate with a nested distinct (feature 10). {{{ register pigperf.jar; - A = load '$page_views' using org.apache.pig.test.utils.datagen.PigPerformanceLoader() + A = load '$page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, action; @@ -359, +359 @@ This script does an anti-join. This is useful because it is a use of cogroup that is not a regular join (feature 9). {{{ register pigperf.jar; - A = load '$page_views' using org.apache.pig.test.utils.datagen.PigPerformanceLoader() + A = load '$page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user; @@ -377, +377 @@ This script covers the case where the group by key is a significant percentage of the row (feature 12). {{{ register pigperf.jar; - A = load '$page_views' using org.apache.pig.test.utils.datagen.PigPerformanceLoader() + A = load '$page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, action, (int)timespent as timespent, query_term, ip_addr, timestamp; @@ -392, +392 @@ This script covers having a nested plan with splits (feature 11). {{{ register pigperf.jar; - A = load '$page_views' using org.apache.pig.test.utils.datagen.PigPerformanceLoader() as (user, action, timespent, query_term, + A = load '$page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, timestamp; C = group B by user $parallelfactor; @@ -409, +409 @@ This script covers group all (feature 13). {{{ register pigperf.jar; - A = load '$page_views' using org.apache.pig.test.utils.datagen.PigPerformanceLoader() + A = load '$page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = foreach A generate user, (int)timespent as timespent, (double)estimated_revenue as e
[Pig Wiki] Update of "PigMix" by daijy
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigMix" page has been changed by daijy. http://wiki.apache.org/pig/PigMix?action=diff&rev1=14&rev2=15 -- PigMix is a set of queries used test pig performance from release to release. There are queries that test latency (how long does it take to run this query?), and queries that test scalability (how many fields or records can pig handle before it fails?). In addition it includes a set of map reduce java programs to run equivalent map reduce jobs directly. These will be used to test the performance - gap between direct use of map reduce and using pig. + gap between direct use of map reduce and using pig. In Jun 2010, we release PigMix2, which include 5 more queries in addition to + the original 12 queries into PigMix to measure the performance of new Pig features. We will publish the result of both PigMix and PigMix2. == Runs == + === PigMix === The following table includes runs done of the pig mix. All of these runs have been done on a cluster with 26 slaves plus one machine acting as the name node and job tracker. The cluster was running hadoop version 0.18.1. (TODO: Need to get specific hardware info on those machines). @@ -140, +142 @@ || Total || 1407 || 1362.33 || 1.03 || || Weighted Avg || || || 1.09 || + === PigMix2 === + Run date: May 29, 2010, run against top of trunk as of that day. + || Test || Pig run time || Java run time || Multiplier || + || PigMix_1 || 122.33 || 117 || 1.05 || + || PigMix_2 || 50.33|| 42.67 || 1.18 || + || PigMix_3 || 189 || 100.33|| 1.88 || + || PigMix_4 || 75.67|| 61|| 1.24 || + || PigMix_5 || 64 || 138.67|| 0.46 || + || PigMix_6 || 65.67|| 69.33 || 0.95 || + || PigMix_7 || 88.33|| 84.33 || 1.05 || + || PigMix_8 || 39 || 47.67 || 0.82 || + || PigMix_9 || 274.33 || 215.33|| 1.27 || + || PigMix_10 || 333.33 || 311.33|| 1.07 || + || PigMix_11 || 151.33 || 157 || 0.96 || + || PigMix_12 || 70.67|| 97.67 || 0.72 || + || PigMix_13 || 80 || 33|| 2.42 || + || PigMix_14 || 69 || 86.33 || 0.80 || + || PigMix_15 || 80.33|| 69.33 || 1.16 || + || PigMix_16 || 82.33|| 69.33 || 1.19 || + || PigMix_17 || 286 || 229.33|| 1.25 || + || Total || 2121.67 || 1929.67 || 1.10 || + ||Weighted Avg || 1.14544 || == Features Tested == @@ -160, +184 @@ 1. union plus distinct 1. order by 1. multi-store query (that is, a query where data is scanned once, then split and grouped different ways). + 1. outer join + 1. merge join + 1. multiple distinct aggregates + 1. accumulative mode The data is generated so that it has a zipf type distribution for the group by and join keys, as this models most human generated data. @@ -207, +235 @@ between key value pairs and Ctrl-D between keys and values. Bags in the file are delimited by Ctrl-B between tuples in the bag. A special loader, !PigPerformance loader has been written to read this format. + PigMix2 include 4 more data set, which can be derived from the original dataset: + {{{ + A = load 'page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() + as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); + B = order A by user parallel $mappers; + store B into 'page_views_sorted' using PigStorage('\u0001'); + + alpha = load 'users' using PigStorage('\u0001') as (name, phone, address, city, state, zip); + a1 = order alpha by name parallel $mappers; + store a1 into 'users_sorted' using PigStorage('\u0001'); + + a = load 'power_users' using PigStorage('\u0001') as (name, phone, address, city, state, zip); + b = sample a 0.5; + store b into 'power_users_samples' using PigStorage('\u0001'); + + A = load 'page_views' as (user, action, timespent, query_term, ip_addr, timestamp, + estimated_revenue, page_info, page_links); + B = foreach A generate user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links, + user as user1, action as action1, timespent as timespent1, query_term as query_term1, ip_addr as ip_addr1, timestamp as timestamp1, estimated_revenue as estimated_revenue1, page_info as page_info1, page_links as page_links1, + user as user2, action as action2, timespent as timespent2, query_term as query_term2, ip_addr as ip_addr2, timestamp as timestamp2, estimated_revenue as estimated_revenue2, page_info as page_
[Pig Wiki] Update of "TuringCompletePig" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "TuringCompletePig" page has been changed by AlanGates. http://wiki.apache.org/pig/TuringCompletePig?action=diff&rev1=1&rev2=2 -- = Making Pig Latin Turing Complete = == Introduction == As more users adopt Pig and begin writing their data processing in Pig Latin and as they use Pig to process more and more complex - tasks, a consistent request from these users is to add branches, loops, and functions to Pig Latin. This will enable Pig Latin to + tasks, a consistent request from these users has been to add branches, loops, and functions to Pig Latin. This will enable Pig Latin to process a whole new class of problems. Consider, for example, an algorithm that needs to iterate until an error estimate is less than a given threshold. This might look like (this just suggests logic, not syntax): @@ -22, +22 @@ == Requirements == The following should be provided by this Turing complete Pig Latin: - 1. Branching. This will be satisfied by a standard `if` `else if` `else` functionality + 1. Branching. This will be satisfied by a standard `if / else if / else` functionality 1. Looping. This should include standard `while` and some form of `for`. for could be C style or Python style (foreach). Care needs to be taken to select syntax that does not cause confusion with the existing `foreach` operator in Pig Latin. 1. Functions. 1. Modules. @@ -49, +49 @@ * Which scripting language to choose? Perl, Python, and Ruby all have significant adoption and could make a claim to be the best choice. * Syntactic and semantic checking is usually delayed until an embedded bit of code is reached in the outer control flow. Given that Pig jobs can run for hours this can mean spending hours to discover a simple typo. - Consider for example if built a python class that wrapped !PigServer and then translated the above code snippet. + Consider for example if Pig provided a Jython class that wrapped !PigServer and then we translated the above code snippet. {{{ error = 100.0 @@ -68, +68 @@ grunt.exec("fs mv 'outfile' 'infile'") }}} - All of these references to `pig` and `grunt` as objects with command strings is undesirable. + All of these references to `pig` and `grunt` as objects with command strings are undesirable. So while I believe that embedding is a much better approach due to the lower work load and the plethora of tools available for other languages, I do not believe the above is an acceptable way to do it. Thus I would like to place three additional requirements on embedded Pig Latin beyond those given above for Turing complete Pig Latin: @@ -79, +79 @@ This overcomes two of the three drawbacks noted above. It does not provide for a way to do certain optimizations such as loop unrolling, but I think that is acceptable. + Having rejected the quote style of programming we could choose the Domain Specific Language (DSL) option, where we define Pig operators in the + target language. Again using Python as an example: + + {{{ +error = 100.0 +infile = 'original.data' +pig = PigServer() +grunt = Grunt() +while error > 1.0: +A = pig.load(infile, { 'loader' => 'piggybank.MyLoader'}); +B = A.group(pig.ALL); +C = B.foreach { + innerBag = doSomeCalculation(:A); + generate innerBag.flatten().as(:result, :error) +} + +PigIterator pi = pig.openIterator(C, 'outfile'); +output = grunt.fs.cat('outfile'"); +bla = output.partition("\t"); +error = bla(2) +if error >= 1.0: +grunt.fs.mv('outfile', 'infile'); + }}} + + This meets requirements 7 and 9 above. It can partially but not fully meet 8. It can check that we use the right operators and pass + them the right types. It cannot check the semantics of the operators, for example that `infile` exists and is readable. This might be ok, + because it might turn out that things that cannot be checked at script compile time should not be checked up front anyway. As an example, it should not + check for `infile` up front because the script may not have created it yet. + + This approach has the advantage that it will integrate very nicely with tools from the target language. Debuggers, IDE, etc. will all now + view some form of Pig Latin as native to their language. + + It does however have drawback, which is that what we would be creating a new dialect of Pig Latin. There would be a Pig Latin dialect used when writing it + directly, and a different dialect for embedding. This leads to confusion and duplication of effort. So I would like to suggest another + requirement: + + 1.#10 Pig Latin should appear the same in the embedded form as in the non-embedded form. +
[Pig Wiki] Update of "TuringCompletePig" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "TuringCompletePig" page has been changed by AlanGates. http://wiki.apache.org/pig/TuringCompletePig -- New page: = Making Pig Latin Turing Complete = == Introduction == As more users adopt Pig and begin writing their data processing in Pig Latin and as they use Pig to process more and more complex tasks, a consistent request from these users is to add branches, loops, and functions to Pig Latin. This will enable Pig Latin to process a whole new class of problems. Consider, for example, an algorithm that needs to iterate until an error estimate is less than a given threshold. This might look like (this just suggests logic, not syntax): {{{ error = 100.0; infile = 'original.data'; while (error > 1.0) { A = load 'infile'; B = group A all; C = foreach B generate flatten(doSomeCalculation(A)) as (result, error); error = foreach C generate error; store C into 'outfile'; if (error > 1.0) mv 'outfile' 'infile'; } }}} == Requirements == The following should be provided by this Turing complete Pig Latin: 1. Branching. This will be satisfied by a standard `if` `else if` `else` functionality 1. Looping. This should include standard `while` and some form of `for`. for could be C style or Python style (foreach). Care needs to be taken to select syntax that does not cause confusion with the existing `foreach` operator in Pig Latin. 1. Functions. 1. Modules. 1. The ability to use local in memory variables in the Pig Latin script. For example, in the snippet given above the way `infile` is defined above the `while` and then used in the `load`. 1. The ability to "store" results into local in memory variables. For example, in the snippet given above the way the error calculation from the data processing is stored into `error` in the line `error = foreach C generate error;`. == Approach == There are two possible approaches to this. One is to add all of these features to Pig Latin itself. This has several advantages: * All Pig Latin operations will be first class objects in the language. There will not be a need to do quoted programming, like what happens when JDBC is used to write SQL inside a Java program. * There will be opportunities to do optimizations that are not available in embedded programming, such as loop unrolling, etc. However, the cost of this approach is incredible. It means turning Pig Latin into a full scripting language. And it means all kinds of tools like debuggers, etc. will never be available for Pig Latin users because the Pig team will not have the resources or expertise to develop and maintain such tools. And finally, does the world need another scripting language that starts with P? The second possible approach to this is to embed Pig Latin into an existing scripting language, such as Perl, Python, Ruby, etc. The advantages of this are: * Most of the requirements noted above (branching, looping, functions, and modules) are present in these languages. * For any of these languages whole hosts of tools such as debuggers, IDEs, etc. exist and could be used. * Users do not have to learn a new language. There are a few significant drawbacks to this approach: * It leads to a quoted programming style which is unnatural and irritating for developers. * Which scripting language to choose? Perl, Python, and Ruby all have significant adoption and could make a claim to be the best choice. * Syntactic and semantic checking is usually delayed until an embedded bit of code is reached in the outer control flow. Given that Pig jobs can run for hours this can mean spending hours to discover a simple typo. Consider for example if built a python class that wrapped !PigServer and then translated the above code snippet. {{{ error = 100.0 infile = 'original.data' pig = PigServer() grunt = Grunt() while error > 1.0: pig.registerQuery("A = load 'infile'; \ B = group A all; \ C = foreach B generate flatten(doSomeCalculation(A)) as (result, error); \ PigIterator pi = pig.openIterator("C", 'outfile'); output = grunt.exec("fs cat 'outfile'"); bla = output.partition("\t"); error = bla(2) if error >= 1.0: grunt.exec("fs mv 'outfile' 'infile'") }}} All of these references to `pig` and `grunt` as objects with command strings is undesirable. So while I believe that embedding is a much better approach due to the lower work load and the plethora of tools available for other languages, I do not believe the above is an acceptable way to do it. Thus I would like to place three additional requirements on embedded Pig Latin beyond those given above for Turing complete Pig Latin: 1.#7 Pig Latin should appear as
[Pig Wiki] Update of "PigJournal" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigJournal" page has been changed by AlanGates. http://wiki.apache.org/pig/PigJournal?action=diff&rev1=6&rev2=7 -- '''Dependency:''' - '''References:''' + '''References:''' [[https://issues.apache.org/jira/browse/PIG-1434|PIG-1434]] '''Estimated Development Effort:''' Small
[Pig Wiki] Update of "PigJournal" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigJournal" page has been changed by AlanGates. http://wiki.apache.org/pig/PigJournal?action=diff&rev1=5&rev2=6 -- || Multiquery support || 0.3 || || || Add skewed join || 0.4 || || || Add merge join || 0.4 || || + || Add Zebra as contrib project || 0.4 || || || Support Hadoop 0.20 || 0.5 || || || Improved Sampling|| 0.6 || There is still room for improvement for order by sampling || || Change bags to spill after reaching fixed size || 0.6 || Also created bag backed by Hadoop iterator for single UDF cases || @@ -32, +33 @@ || Switch local mode to Hadoop local mode || 0.6 || || || Outer join for default, fragment-replicate, skewed || 0.6 || || || Make configuration available to UDFs || 0.6 || || + || Load Store Redesign || 0.7 || || + || Add Owl as contrib project || not yet released || || + || Pig Mix 2.0 || not yet released || || == Work in Progress == This covers work that is currently being done. For each entry the main JIRA for the work is referenced. - || Feature || JIRA || Comments || + || Feature || JIRA || Comments || - || Metadata || [[http://issues.apache.org/jira/browse/PIG-823|PIG-823]] || || + || Boolean Type || [[https://issues.apache.org/jira/browse/PIG-1429|PIG-1429]] || || - || Query Optimizer || [[http://issues.apache.org/jira/browse/PIG-1178|PIG-1178]] || || + || Query Optimizer || [[http://issues.apache.org/jira/browse/PIG-1178|PIG-1178]] || || - || Load Store Redesign || [[http://issues.apache.org/jira/browse/PIG-966|PIG-966]] || || - || Add SQL Support || [[http://issues.apache.org/jira/browse/PIG-824|PIG-824]] || || - || Change Pig internal representation of charrarry to Text || [[http://issues.apache.org/jira/browse/PIG-1017|PIG-1017]] || Patch ready, unclear when to commit to minimize disruption to users and destabilization to code base. || - || Integration with Zebra || [[http://issues.apache.org/jira/browse/PIG-833|PIG-833]] || || + || Cleanup of javadocs || [[https://issues.apache.org/jira/browse/PIG-1311|PIG-1311]] || || + || UDFs in scripting languages || [[https://issues.apache.org/jira/browse/PIG-928|PIG-928]] || || + || Ability to specify a custom partitioner || [[https://issues.apache.org/jira/browse/PIG-282|PIG-282]] || || + || Pig usage stats collection || [[https://issues.apache.org/jira/browse/PIG-1389|PIG-1389]], [[https://issues.apache.org/jira/browse/PIG-908|PIG-908]], [[https://issues.apache.org/jira/browse/PIG-864|PIG-864]], [[https://issues.apache.org/jira/browse/PIG-809|PIG-809]] || || + || Make Pig available via Maven || [[https://issues.apache.org/jira/browse/PIG-1334|PIG-1334]] || || == Proposed Future Work == @@ -68, +73 @@ Within each subsection order is alphabetical and does not imply priority. === Agreed Work, Agreed Approach === - Boolean Type - Boolean is currently supported internally as a type in Pig, but it is not exposed to users. Data cannot be of type boolean, nor can UDFs (other than - !FilterFuncs) return boolean. Users have repeatedly requested that boolean be made a full type. - - '''Category:''' New Functionality - - '''Dependency:''' Will affect all !LoadCasters, as they will have to provide byteToBoolean methods. - - '''References:''' - - '''Estimated Development Effort:''' small - Combiner Not Used with Limit or Filter Pig Scripts that have a foreach with a nested limit or filter do not use the combiner even when they could. Not all filters can use the combiner, but in some cases they can. I think all limits could at least apply the limit in the combiner, though the UDF itself may only be executed in the reducer. @@ -226, +219 @@ '''Estimated Development Effort:''' small - Pig Mix 2.0 - Pig Mix has
[Pig Wiki] Update of "PigInteroperability" by jeff zhan g
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigInteroperability" page has been changed by jeff zhang. http://wiki.apache.org/pig/PigInteroperability?action=diff&rev1=1&rev2=2 -- == Pig and Hive RCFiles == The !HiveColumnarLoader, available as part of PiggyBank in Pig 0.7.0. + == Pig and Voldemort == + The Pig LoadFunc for Voldemort. + See http://github.com/rsumbaly/voldemort/blob/hadoop/contrib/hadoop/src/java/voldemort/hadoop/pig/VoldemortStore.java +
[Pig Wiki] Update of "HowToRelease" by daijy
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "HowToRelease" page has been changed by daijy. http://wiki.apache.org/pig/HowToRelease?action=diff&rev1=21&rev2=22 -- ant clean ant test ant clean + ant jar + cd contrib/zebra + ant + cd ../.. + cd contrib/owl + ant + cd ../.. + cd contrib/piggybank/java + ant + cd ../../.. ant -Dversion=X.Y.Z -Djava5.home= -Dforrest.home= tar }}} 2. Test the tar file by unpacking the release and
[Pig Wiki] Trivial Update of "LoadStoreMigrationGuide" by newacct
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "LoadStoreMigrationGuide" page has been changed by newacct. http://wiki.apache.org/pig/LoadStoreMigrationGuide?action=diff&rev1=38&rev2=39 -- longend= Long.MAX_VALUE; private byte recordDel = '\n'; private byte fieldDel = '\t'; - private ByteArrayOutputStream mBuf = null; + private ByteArrayOutputStream mBuf; - private ArrayList mProtoTuple = null; + private ArrayList mProtoTuple; private static final String UTF8 = "UTF-8"; public SimpleTextLoader() { @@ -96, +96 @@ case 'x': case 'u': this.fieldDel = - Integer.valueOf(delimiter.substring(2)).byteValue(); + (byte)Integer.parseInt(delimiter.substring(2)); break; default: throw new RuntimeException("Unknown delimiter " + delimiter); } } else { - throw new RuntimeException("PigStorage delimeter must be a single character"); + throw new RuntimeException("PigStorage delimiter must be a single character"); } } @@ -141, +141 @@ this.end = end; // Since we are not block aligned we throw away the first - // record and cound on a different instance to read it + // record and count on a different instance to read it if (offset != 0) { getNext(); } @@ -179, +179 @@ === New Implementation === {{{ public class SimpleTextLoader extends LoadFunc { - protected RecordReader in = null; + protected RecordReader in; private byte fieldDel = '\t'; - private ArrayList mProtoTuple = null; + private ArrayList mProtoTuple; private TupleFactory mTupleFactory = TupleFactory.getInstance(); private static final int BUFFER_SIZE = 1024; @@ -207, +207 @@ case 'x': fieldDel = - Integer.valueOf(delimiter.substring(2), 16).byteValue(); + (byte)Integer.parseInt(delimiter.substring(2), 16); break; case 'u': this.fieldDel = - Integer.valueOf(delimiter.substring(2)).byteValue(); + (byte)Integer.parseInt(delimiter.substring(2)); break; default: throw new RuntimeException("Unknown delimiter " + delimiter); } } else { - throw new RuntimeException("PigStorage delimeter must be a single character"); + throw new RuntimeException("PigStorage delimiter must be a single character"); } } @@ -313, +313 @@ case 'x': case 'u': this.fieldDel = - Integer.valueOf(delimiter.substring(2)).byteValue(); + (byte)Integer.parseInt(delimiter.substring(2)); break; default: throw new RuntimeException("Unknown delimiter " + delimiter); } } else { - throw new RuntimeException("PigStorage delimeter must be a single character"); + throw new RuntimeException("PigStorage delimiter must be a single character"); } } @@ -496, +496 @@ case 'x': fieldDel = - Integer.valueOf(delimiter.substring(2), 16).byteValue(); + (byte)Integer.parseInt(delimiter.substring(2), 16); break; case 'u': this.fieldDel = - Integer.valueOf(delimiter.substring(2)).byteValue(); + (byte)Integer.parseInt(delimiter.substring(2)); break; default: throw new RuntimeException("Unknown delimiter " + delimiter); } } else { - throw new RuntimeException("PigStorage delimeter must be a single character"); + throw new RuntimeException("PigStorage delimiter must be a single character"); } }
[Pig Wiki] Update of "HowToRelease" by daijy
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "HowToRelease" page has been changed by daijy. http://wiki.apache.org/pig/HowToRelease?action=diff&rev1=20&rev2=21 -- cd build md5sum pig-X.Y.Z.tar.gz > pig-X.Y.Z.tar.gz.md5 }}} + 4. If you do not have a gpg key pair, do the following steps: +a. Generating key pair using the following command. You can simply accept all default settings and give your name, email and Passphase. {{{ + gpg --gen-key + }}} +a. Export your public key. {{{ + gpg --armor --output pubkey.txt --export 'Your Name' + }}} +a. Open pubkey.txt, copy the full text and append it to the following files by pasting, then commit these changes: {{{ + https://svn.apache.org/repos/asf/hadoop/pig/branches/branch-X.Y.Z/KEYS + https://svn.apache.org/repos/asf/hadoop/pig/trunk/KEYS + }}} +a. Upload updated KEYS to Apache. {{{ + scp KEYS people.apache.org:/www/www.apache.org/dist/hadoop/pig/KEYS + }}} +a. Export your private key, keep it with you. {{{ + gpg --export-secret-key -a "Your Name" > private.key + }}} - 4. Sign the release (see [[http://www.apache.org/dev/mirror-step-by-step.html?Step-By-Step|Step-By-Step Guide to Mirroring Releases]] for more information). [TODO: add details on how to generate and store keys]{{{ + 5. Sign the release (see [[http://www.apache.org/dev/mirror-step-by-step.html?Step-By-Step|Step-By-Step Guide to Mirroring Releases]] for more information). {{{ gpg --armor --output pig-X.Y.Z.tar.gz.asc --detach-sig pig-X.Y.Z.tar.gz }}} + 6. Verify gpg signature. {{{ + gpg --import KEYS (if necessarily) + gpg --verify pig-X.Y.Z.tar.gz.asc pig-X.Y.Z.tar.gz + }}} - 5. Copy release files to a public place (usually into public_html in your home directory):{{{ + 7. Copy release files to a public place (usually into public_html in your home directory):{{{ ssh people.apache.org mkdir public_html/pig-X.Y.Z-candidate-0 scp -p pig-X.Y.Z.tar.gz* people.apache.org:public_html/pig-X.Y.Z-candidate-0 cd .. scp RELEASE_NOTES.txt people.apache.org:public_html/pig-X.Y.Z-candidate-0 }}} - 6. Call a release vote. The initial email should be sent to `pig-...@hadoop.apache.org`. Make sure to attache rat report to it. Here is a sample of email: {{{ + 8. Call a release vote. The initial email should be sent to `pig-...@hadoop.apache.org`. Make sure to attache rat report to it. Here is a sample of email: {{{ From: Olga Natkovich [mailto:ol...@yahoo-inc.com] Sent: Tuesday, November 25, 2008 3:59 PM To: pig-...@hadoop.apache.org @@ -170, +191 @@ }}} 6. Update the front page news in author/src/documentation/content/xdocs/index.xml. 7. Update the release news in author/src/documentation/content/xdocs/releases.xml. - 7. Update the documentation links in author/src/documentation/content/xdocs/site.xml + 8. Update the documentation links in author/src/documentation/content/xdocs/site.xml - 8. Copy in the release specific documentation {{{ + 9. Copy in the release specific documentation {{{ cd publish mkdir docs/rX.Y.Z - cp -pr /build/docs/* publish/docs/rX.Y.Z/ + cp -pr /docs/* publish/docs/rX.Y.Z/ svn add publish/docs/rX.Y.Z }}} - 9. Regenerate the site, review it and commit in HowToCommit. + 10. Regenerate the site, review it and commit in HowToCommit. - 10. Deploy your site changes.{{{ + 11. Deploy your site changes.{{{ ssh people.apache.org cd /www/hadoop.apache.org/pig svn up }}} - 10. Wait until you see your changes reflected on the Apache web site. + 12. Wait until you see your changes reflected on the Apache web site. - 11. Send announcements to the user and developer lists as well as (`annou...@haoop.apache.org`) once the site changes are visible. {{{ + 13. Send announcements to the user and developer lists as well as (`annou...@haoop.apache.org`) once the site changes are visible. {{{ Pig team is happy to announce Pig X.Y.Z release. Pig is Hadoop subproject which provides high-level data-flow language and execution framework for parallel computation on Hadoop clusters. @@ -192, +213 @@ The highlights of this release are ... The details of the release can be found at http://hadoop.apache.org/pig/releases.html. }}} - 12. In JIRA, mark the release as released. + 14. In JIRA, mark the release as released. a. Goto JIRA and click on Administration tab. a. Select the Pig project. a. Select Manage versions. @@ -200, +221 @@ a. If a description has not yet been added for the version you are releasing, select Edit Details and give a brief description of the release. a. If the next version does not exist (that is, if you are releasing version 0.x, if version 0.x+1 does not yet exist) create it using the Add Version box at the top of the page. - 13. In JIRA, mark the issues resolved in this release as closed. + 15. In J
[Pig Wiki] Update of "PoweredBy" by DanHarvey
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PoweredBy" page has been changed by DanHarvey. http://wiki.apache.org/pig/PoweredBy?action=diff&rev1=1&rev2=2 -- * [[http://twitter.com|Twitter]]<> * We use Pig extensively to process usage logs, mine tweet data, and more. * We have maintain [[http://github.com/kevinweil/elephant-bird|Elephant Bird]], a set of libraries for working with Pig, LZO compression, protocol buffers, and more. - * More details can be seen in this presentation: http://www.slideshare.net/kevinweil/nosql-at-twitter-nosql-eu-2010<> + * More details can be seen in this presentation: http://www.slideshare.net/kevinweil/nosql-at-twitter-nosql-eu-2010 + * [[http://www.yahoo.com/|Yahoo!]] * More than 100,000 CPUs in >25,000 computers running Hadoop * Our biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) @@ -31, +32 @@ * [[http://developer.yahoo.com/blogs/hadoop|Our Blog]] - Learn more about how we use Hadoop. * >40% of Hadoop Jobs within Yahoo are Pig jobs. + * [[http://www.mendeley.com|Mendeley]]<> + * We are creating a platform to aggregate research and allow researchers to get the most out of the web. + * We moved all our catalogue stats and analysis to HBase and Pig + * We are using Scribe in combination with Pig for all our server, application and user log processing. + * Pig helps our business analytics, user experience evaluation, feature feedback and more out of these logs +
[Pig Wiki] Update of "PigLatin" by test_abc
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigLatin" page has been changed by test_abc. http://wiki.apache.org/pig/PigLatin?action=diff&rev1=35&rev2=36 -- {{{ <1> <3> + <4> <5> }}}
[Pig Wiki] Update of "PigLatin" by OlgaN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigLatin" page has been changed by OlgaN. http://wiki.apache.org/pig/PigLatin?action=diff&rev1=34&rev2=35 -- <> <> <> + + '''THIS PAGE IS OBSOLETE. Please use documentation at http://hadoop.apache.org/pig/''' '''Note:''' For Pig 0.2.0 or later, some content on this page may no longer be applicable.
[Pig Wiki] Update of "Eclipse_Environment" by ThejasN air
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "Eclipse_Environment" page has been changed by ThejasNair. http://wiki.apache.org/pig/Eclipse_Environment?action=diff&rev1=15&rev2=16 -- * Window > Open Perspective > Java * Window > Show View > ''see the various options ...'' - Download jars and generate code - To download the required jars and generate code in src-gen, run 'ant jar' in trunk dir. - Update the Build Configuration - * run 'ant eclipse-files' in trunk/ dir. + * run 'ant eclipse-files' in trunk/ dir. * Refresh the project in eclipse You are all set now!
[Pig Wiki] Update of "HowToDocumentation" by OlgaN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "HowToDocumentation" page has been changed by OlgaN. http://wiki.apache.org/pig/HowToDocumentation?action=diff&rev1=11&rev2=12 -- * Run the "ant docs" command: ant docs -Djava5.home=''java5_path'' -Dforrest.home=''forrest_path'' * To check the *.html and *.pdf output, change to this directory: /trunk/docs + For releases, be sure to do the following: +* Update the doc tab + * Open the tabs.xml file (…/src/docs/src/documentation/content/xdocs/tabs.xml) + * Update the doc tab for the current release. For example, changeto +* Update the API link + * Open the site.xml file (…/src/docs/src/documentation/content/xdocs/site.xml) + * Update the external api reference for the current release. For example, change http://hadoop.apache.org/pig/docs/r0.6.0/api/"; /> to http://hadoop.apache.org/pig/docs/r0.7.0/api/"; /> +
[Pig Wiki] Update of "Eclipse_Environment" by ThejasN air
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "Eclipse_Environment" page has been changed by ThejasNair. http://wiki.apache.org/pig/Eclipse_Environment?action=diff&rev1=14&rev2=15 -- * Refresh the project in eclipse You are all set now! - The 'ant eclipse-files' target does not exist in revisions before r938733, and you have to follow the steps below - + The 'ant eclipse-files' target that generates eclipse configuration does not exist in revisions before r938733. So if you checked out an earlier version, you have to follow the steps below - After the Java project is created, update the build configuration.
[Pig Wiki] Update of "Eclipse_Environment" by ThejasN air
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "Eclipse_Environment" page has been changed by ThejasNair. http://wiki.apache.org/pig/Eclipse_Environment?action=diff&rev1=13&rev2=14 -- * Window > Open Perspective > Java * Window > Show View > ''see the various options ...'' + Download jars and generate code + To download the required jars and generate code in src-gen, run 'ant jar' in trunk dir. + Update the Build Configuration + * run 'ant eclipse-files' in trunk/ dir. + * Refresh the project in eclipse + You are all set now! + + The 'ant eclipse-files' target does not exist in revisions before r938733, and you have to follow the steps below - + After the Java project is created, update the build configuration. To update the build configuration:
[Pig Wiki] Trivial Update of "PigAbstractionLayer" by n ewacct
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigAbstractionLayer" page has been changed by newacct. http://wiki.apache.org/pig/PigAbstractionLayer?action=diff&rev1=4&rev2=5 -- * Created an entity handle for a container. * * @param name of the container -* @return a container descripto +* @return a container description * @throws DataStorageException if name does not conform to naming * convention enforced by the Data Storage. */ @@ -192, +192 @@ }}} === Data Storage Descriptors === - Descriptors are a represenation of entities in the Data Storage and are used to access and carry out operations on such entities. + Descriptors are a representation of entities in the Data Storage and are used to access and carry out operations on such entities. There are Element Descriptors and Container Descriptors. The latter are descriptors for entities that contain Data Storage Element Descriptors. {{{
[Pig Wiki] Update of "HowToRelease" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "HowToRelease" page has been changed by AlanGates. http://wiki.apache.org/pig/HowToRelease?action=diff&rev1=19&rev2=20 -- BUG FIXES PIG-342: Fix DistinctDataBag to recalculate size after it has spilled. (bdimcheff via gates) }}} - 2. Edit `src/docs/src/documentation/content/xdocs/site.xml`. In the external reference for api where the link contains `change_to_correct_version_number_after_branching` change this string to the + 2. Edit `src/docs/src/documentation/content/xdocs/site.xml`. In the external reference for api where the link contains the previous version number change this string to the correct version number. - correct version number. 3. Commit these changes to trunk:{{{ svn commit -m "Preparing for release X.Y.Z" }}}
[Pig Wiki] Update of "FrontPage" by DmitriyRyaboy
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "FrontPage" page has been changed by DmitriyRyaboy. http://wiki.apache.org/pig/FrontPage?action=diff&rev1=147&rev2=148 -- * [[http://hadoop.apache.org/pig/|Official Apache Pig Website]] * PigTalksPapers - Pig talks, papers, interviews + * PoweredBy - a (partial) list of companies using Pig == User Documentation ==
[Pig Wiki] Update of "PoweredBy" by DmitriyRyaboy
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PoweredBy" page has been changed by DmitriyRyaboy. http://wiki.apache.org/pig/PoweredBy -- New page: Applications and organizations using Pig include (alphabetically): * [[http://www.cooliris.com/|Cooliris]] - Cooliris transforms your browser into a lightning fast, cinematic way to browse photos and videos, both online and on your hard drive. * We have a 15-node Hadoop cluster where each machine has 8 cores, 8 GB ram, and 3-4 TB of storage. * We use Hadoop for all of our analytics, and we use Pig to allow PMs and non-engineers the freedom to query the data in an ad-hoc manner.<> * [[http://www.dropfire.com/|DropFire]] * We generate Pig Latin scripts that describe structural and semantic conversions between data contexts * We use Hadoop to execute these scripts for production-level deployments * Eliminates the need for explicit data and schema mappings during database integration * [[http://www.linkedin.com/|LinkedIn]] * 3x30 Nehalem-based node grids, with 2x4 cores, 16GB RAM, 8x1TB storage using ZFS in a JBOD configuration. * We use Hadoop and Pig for discovering People You May Know and other fun facts. * [[http://www.ning.com/|Ning]] * We use Hadoop to store and process our log file * We rely on Apache Pig for reporting, analytics, Cascading for machine learning, and on a proprietary [[/hadoop/JavaScript|JavaScript]] API for ad-hoc queries * We use commodity hardware, with 8 cores and 16 GB of RAM per machine * [[http://twitter.com|Twitter]]<> * We use Pig extensively to process usage logs, mine tweet data, and more. * We have maintain [[http://github.com/kevinweil/elephant-bird|Elephant Bird]], a set of libraries for working with Pig, LZO compression, protocol buffers, and more. * More details can be seen in this presentation: http://www.slideshare.net/kevinweil/nosql-at-twitter-nosql-eu-2010<> * [[http://www.yahoo.com/|Yahoo!]] * More than 100,000 CPUs in >25,000 computers running Hadoop * Our biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) * Used to support research for Ad Systems and Web Search * Also used to do scaling tests to support development of Hadoop on larger clusters * [[http://developer.yahoo.com/blogs/hadoop|Our Blog]] - Learn more about how we use Hadoop. * >40% of Hadoop Jobs within Yahoo are Pig jobs.
[Pig Wiki] Update of "Eclipse_Environment" by Ashutos hChauhan
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "Eclipse_Environment" page has been changed by AshutoshChauhan. http://wiki.apache.org/pig/Eclipse_Environment?action=diff&rev1=12&rev2=13 -- For Pig, you need the [[http://www.easyeclipse.org/site/plugins/javacc.html|JavaCC plugin]] and the[[http://subclipse.tigris.org/|Subclipse Subversion plugin]]. To download and install the plugins: + - 1. Open Eclipse + 1. Open Eclipse 1. Select Help > Software Updates... > Available Software 1. Add the two plugin sites by pressing Add Site... Button - * http://eclipse-javacc.sourceforge.net + 1. http://eclipse-javacc.sourceforge.net - * http://subclipse.tigris.org/update_1.4.x + 1. http://subclipse.tigris.org/update_1.4.x - 1.#4 Select the plugins that appear under these sites + 1. Select the plugins that appear under these sites 1. Press Install - and follow the prompts to download and install the plugins Add the Pig Trunk Repository To add the Pig trunk repository: + 1. Open Eclipse 1. Select file > New > Other... 1. Choose SVN, Repository Location > Next 1. Under the General tab: - * URL: http://svn.apache.org/repos/asf/hadoop/pig/trunk + 1. URL: http://svn.apache.org/repos/asf/hadoop/pig/trunk - * Use a custom label: Pig + 1. Use a custom label: Pig - 1.#5 Click Finish + 1. Click Finish To view the results: + * Window > Open Perspective > Other... > SVN Repository Exploring * Window > Show View > SVN Repositories Create a Java Project - First, create a directory on your development machine (for example "mypig") and checkout the Pig source from SVN: http://svn.apache.org/repos/asf/hadoop/pig/trunk Note: Windows users need to download and install TortoiseSVN (http://tortoiseSVN.tigris.org/) To create a Java project: + 1. Open Eclipse 1. Select file > New > Other ... 1. Select Java Project 1. On the New Java Project dialog: - * Project name: !PigProject + 1. Project name: !PigProject - * Select: Create project from existing source + 1. Select: Create project from existing source - * Directory: browse to the "mypig" directory on your development machine and select the Trunk directory + 1. Directory: browse to the "mypig" directory on your development machine and select the Trunk directory - 1.#5 Click Next + 1. Click Next 1. Click Finish To view the results: + * Window > Open Perspective > Java * Window > Show View > ''see the various options ...'' @@ -54, +58 @@ After the Java project is created, update the build configuration. To update the build configuration: + 1. Open Eclipse 1. Select Window > Open Perspective > Java (to open the !MyPig project) 1. Select Project > Properties 1. For the Java Build Path, check the settings as shown below. Source + {{{ lib-src/bzip2 lib-src/shock @@ -68, +74 @@ test -> Make sure nothing is excluded The default output folder should be bin. + }}} + Libraries - }}} - - - Libraries {{{ lib/hadoopXXX.jar lib/hbaseXXX-test.jar lib/hbaseXXX.jar + lib/Pig/zookeeper-hbase-xxx.jar build/ivy/lib/Pig/javacc.jar - build/ivy/lib/Pig/jline-XXX.jar + build/ivy/lib/Pig/jline-XXX.jar build/ivy/lib/Pig/jsch-xxx.jar build/ivy/lib/Pig/junit-xxx.jar + }}} + NOTE: - }}} - NOTE: For pig sources checked out from Apache before revision r771273, replace "build/ivy/lib/Pig" with "lib". Revision r771273 and above in apache svn use ivy to resolve dependencies need to build pig. + 1. For pig sources checked out from Apache before revision r771273, replace "build/ivy/lib/Pig" with "lib". Revision r771273 and above in apache svn use ivy to resolve dependencies need to build pig. + 1. If you are building piggybank you will need few extra jars. You can find all of those in build/ivy/lib/Pig/ once you run jar target of ant successfully. Order and Export + {{{ Should have be the following order: @@ -96, +104 @@ src JRE System Library all the jars from the "Libraries" tab - }}} - - Troubleshooting -* Build problems: Check if eclipse is using JDK version 1.6, pig needs it (Under Preferences/Java/Compiler). + * Build problems: Check if eclipse is using JDK version 1.6, pig needs it (Under Preferences/Java/Compiler). Tips -* To build using eclipse , open the ant window (Windows/Show View/Ant) , then drag and drop build.xml under your project to this window. Double click on jar in that will build pig.jar, on test will run unit tests. + * To build using eclipse , open the ant window (Windows/Show View/Ant) , then drag and drop build.xml under your project to this window. Double click on jar in that will build pig.jar, on test will run unit tests.
[Pig Wiki] Update of "HowToRelease" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "HowToRelease" page has been changed by AlanGates. http://wiki.apache.org/pig/HowToRelease?action=diff&rev1=18&rev2=19 -- BUG FIXES PIG-342: Fix DistinctDataBag to recalculate size after it has spilled. (bdimcheff via gates) }}} + 2. Edit `src/docs/src/documentation/content/xdocs/site.xml`. In the external reference for api where the link contains `change_to_correct_version_number_after_branching` change this string to the + correct version number. 3. Commit these changes to trunk:{{{ svn commit -m "Preparing for release X.Y.Z" }}} @@ -56, +58 @@ 7. Commit these changes to trunk:{{{ svn commit -m "Preparing for X.Y+1.0 development" }}} - - TODO: - 1. Add documentation update the process once we integrate the documentation into forrect. (Will need docs target in build.xml) == Updating Release Branch ==
[Pig Wiki] Update of "owl" by jaytang
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "owl" page has been changed by jaytang. http://wiki.apache.org/pig/owl?action=diff&rev1=15&rev2=16 -- || Feature || Status || - || Owl is a stand-alone table store, not tied to any particular data query or processing languages, currently supporting MR, Pig Latin, and Pig SQL || current || + || Owl is a stand-alone table store, not tied to any particular data query or processing languages, supporting MR, Pig Latin, and Pig SQL || current || || Owl has a flexible data partitioning model, with multiple levels of partitioning, physical and logical partitioning, and partition pruning for query optimization || current || || Owl has a flexible interface for pushing projections and filters all the way down || current || || Owl has a framework for storing data in many storage formats, and different storage formats can co-exist within the same table || current || @@ -39, +39 @@ == Prerequisite == - Owl depends on Pig for its tuple classes as its basic unit of data container, and Hadoop 20 for !OwlInputFormat. Its first release will require Pig 0.7 and Hadoop 20.2. Owl also requires a storage driver; Owl integrates with Zebra 0.7 out-of-the-box. + Owl depends on Pig for its tuple classes as its basic unit of data container, and Hadoop 20 for !OwlInputFormat. Its first release will require Pig 0.7 or later and Hadoop 20.2 or late. Owl integrates with Zebra 0.7 out-of-the-box. == Getting Owl == @@ -78, +78 @@ After installing Tomcat and MySQL, you will need these files: -* owl-<0.x.x>.war - owl web application +* owl-<0.x.x>.war - owl web application at contrib/owl/build -* owl-<0.x.x>.jar - owl client library ''!OwlInputFormat'' and ''!OwlDriver'' with all their dependent 3rd party libraries +* owl-<0.x.x>.jar - owl client library ''!OwlInputFormat'' and ''!OwlDriver'' with all their dependent 3rd party libraries at contrib/owl/build * mysql * mysql_schema.sql - owl database schema file at contrib/owl/setup/mysql * owlServerConfig.xml - owl server configuration file at contrib/owl/setup/mysql @@ -87, +87 @@ * oracle_schema.sql - owl database schema file at contrib/owl/setup/oracle * owlServerConfig.xml - owl server configuration file at contrib/owl/setup/oracle - Set up parameters in owlServerConfig: - -* update jdbc driver connection information in owlServerConfig.xml -* put this file on the same box where tomcat is installed - Create db schema in !MySql: * create a database "owl" in mysql * create db schema with "mysql_schema.sql" * make sure the user specified in jdbc connection string has full access to all objects in the newly created "owl" db + + Set up parameters in owlServerConfig: + +* update jdbc driver connection information in owlServerConfig.xml +* put this file on the same box where tomcat is installed Deploy Owl to Tomcat:
[Pig Wiki] Update of "owl" by jaytang
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "owl" page has been changed by jaytang. http://wiki.apache.org/pig/owl?action=diff&rev1=14&rev2=15 -- Sample code is attached to write a client application against owl: * Sample code using !OwlDriver API: [[attachment:TestOwlDriverSample.java]] + == Next Step == + + We recognize that Hive already addressed some of the above problems, and that there is significant overlap between Owl and Hive. Yet we also believe that Owl adds important new features that are necessary for managing very large tables. We look forward to collaborating with the Hive team on finding the right model for integration between the two systems and creating a unified data management system for Hadoop. +
[Pig Wiki] Update of "owl" by jaytang
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "owl" page has been changed by jaytang. http://wiki.apache.org/pig/owl?action=diff&rev1=13&rev2=14 -- == High Level Diagram == + + {{attachment:owl.jpg}} As one can see, Owl gives Hadoop users a uniform interface for organizing, discovering and managing data stored in many different formats, and to promote interoperability among different programming frameworks. Owl presents a single logical view of data organization and hides the complexity and evolutions in underlying physical data layout schemes. It gives Hadoop applications a stable foundation to build upon.
New attachment added to page owl on Pig Wiki
Dear Wiki user, You have subscribed to a wiki page "owl" for change notification. An attachment has been added to that page by jaytang. Following detailed information is available: Attachment name: owl.jpg Attachment size: 20679 Attachment link: http://wiki.apache.org/pig/owl?action=AttachFile&do=get&target=owl.jpg Page link: http://wiki.apache.org/pig/owl
[Pig Wiki] Update of "owl" by jaytang
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "owl" page has been changed by jaytang. http://wiki.apache.org/pig/owl?action=diff&rev1=12&rev2=13 -- The core M/R programming interface as we know it (the mapper, reducer, output collector, record reader and input format ) all deal with collection of abstract data objects, not files. However, the current set of !InputFormat implementations provided by job API are relatively primitive and are heavily coupled to file formats and HDFS paths to describe input and output locations. From an application programmer’s perspective, one has to think about both the abstract data and the physical representation and storage location, which is a disconnect from the abstract data API. In the meantime, the number of file formats and (de)serialization libraries have flourished in the Hadoop community. Some of these require certain metadata to operate/optimize. While providing optimization and performance enhancements, these file formats and SerDe libs don’t make it any easier to develop applications on and manage very big data sets. - == High Level Diagram == + == High Level Diagram == As one can see, Owl gives Hadoop users a uniform interface for organizing, discovering and managing data stored in many different formats, and to promote interoperability among different programming frameworks. Owl presents a single logical view of data organization and hides the complexity and evolutions in underlying physical data layout schemes. It gives Hadoop applications a stable foundation to build upon. @@ -34, +34 @@ || Owl has support for converting data between write-friendly and read-friendly formats || future || || Owl has support for addressing HDFS NameNode limitations by decreasing the number of files needed to store very large data sets || future || || Owl provides a security model for secure data access || future || - == Prerequisite == @@ -102, +101 @@ * deploy owl war file to Tomcat * set up -Dorg.apache.hadoop.owl.xmlconfig= for the Tomcat deployment - == Developing on Owl == + == Developing on Owl == Owl has two major public APIs. ''Owl Driver'' provides management APIs against three core Owl abstractions: "Owl Table", "Owl Database", and "Partition". This API is backed up by an internal Owl metadata store that runs on Tomcat and a relational database. ''!OwlInputFormat'' provides a data access API and is modeled after the traditional Hadoop !InputFormat. In the future, we plan to support ''!OwlOutputFormat'' and thus the notion of "Owl Managed Table" where Owl controls the data flow into and out of "Owl Tables". Owl also supports Pig integration with OwlPigLoader/Storer module.
[Pig Wiki] Update of "owl" by jaytang
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "owl" page has been changed by jaytang. http://wiki.apache.org/pig/owl?action=diff&rev1=11&rev2=12 -- = Apache Owl Wiki = - The goal of Owl is to provide a high level data management abstraction. !MapReduce and Pig applications interacting directly with HDFS directories and files must deal with low level data management issues such as storage format, serialization/compression schemes, data layout, and efficient data accesses, etc, often with different solutions. Owl aims to provide a standard way to addresses this issue and abstracts away the complexities of reading/writing huge amount of data from/to HDFS. + == Vision == - Owl provides a tabular view of data on Hadoop and thus supports the notion of ''Owl Tables''. Conceptually, it is similar to a relation database table. An Owl Table has these characteristics: + Owl provides a more natural abstraction for Map-Reduce and Map-Reduce-based technologies (e.g., Pig, SQL) by allowing developers to express large datasets as tables, which in turn consist of rows and columns. Owl tables are similar, but not identical to familiar database / data warehouse tables. + The core M/R programming interface as we know it (the mapper, reducer, output collector, record reader and input format ) all deal with collection of abstract data objects, not files. However, the current set of !InputFormat implementations provided by job API are relatively primitive and are heavily coupled to file formats and HDFS paths to describe input and output locations. From an application programmer’s perspective, one has to think about both the abstract data and the physical representation and storage location, which is a disconnect from the abstract data API. In the meantime, the number of file formats and (de)serialization libraries have flourished in the Hadoop community. Some of these require certain metadata to operate/optimize. While providing optimization and performance enhancements, these file formats and SerDe libs don’t make it any easier to develop applications on and manage very big data sets. -* lives in an Owl database name space and could contain multiple partitions -* has columns and rows and supports a unified table level schema -* interface to supports !MapReduce and Pig Latin and can easily work with other languages -* designed for efficient batch read/write operations, partitions can be added or removed from a table -* supports external tables (data already exists on file system) -* pluggable architecture for different storage format such as Zebra -* presents a logically partitioned view of data and supports very large data set via its multi-level flexible partitioning scheme -* efficient data access mechanisms over very large data set via partition and projection pruning - Owl has two major public APIs. ''Owl Driver'' provides management APIs against three core Owl abstractions: "Owl Table", "Owl Database", and "Partition". This API is backed up by an internal Owl metadata store that runs on Tomcat and a relational database. ''!OwlInputFormat'' provides a data access API and is modeled after the traditional Hadoop !InputFormat. In the future, we plan to support ''!OwlOutputFormat'' and thus the notion of "Owl Managed Table" where Owl controls the data flow into and out of "Owl Tables". Owl also supports Pig integration with OwlPigLoader/Storer module. + == High Level Diagram == - Initially, we like to open source Owl as a Pig contrib project. In the long term, Owl could become a separate Hadoop subproject as it provides a platform service all Hadoop applications. + As one can see, Owl gives Hadoop users a uniform interface for organizing, discovering and managing data stored in many different formats, and to promote interoperability among different programming frameworks. Owl presents a single logical view of data organization and hides the complexity and evolutions in underlying physical data layout schemes. It gives Hadoop applications a stable foundation to build upon. + + == Main Properties and Features == + + + || Feature || Status || + || Owl is a stand-alone table store, not tied to any particular data query or processing languages, currently supporting MR, Pig Latin, and Pig SQL || current || + || Owl has a flexible data partitioning model, with multiple levels of partitioning, physical and logical partitioning, and partition pruning for query optimization || current || + || Owl has a flexible interface for pushing projections and filters all the way down || current || + || Owl has a framework for storing data in many storage formats, and different storage formats can co-exist within the same table || current || + || Owl provides capability discovery mechanism to allow applications to tak
[Pig Wiki] Update of "FrontPage" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "FrontPage" page has been changed by AlanGates. http://wiki.apache.org/pig/FrontPage?action=diff&rev1=146&rev2=147 -- = Apache Pig Wiki = - [[http://incubator.apache.org/pig/|Apache Pig]] is a platform for analyzing large data sets. Pig's language, Pig Latin, lets you specify a sequence of data transformations such as merging data sets, filtering them, and applying functions to records or groups of records. Pig comes with many built-in functions but you can also create your own user-defined functions to do special-purpose processing. + [[http://hadoop.apache.org/pig/|Apache Pig]] is a platform for analyzing large data sets. Pig's language, Pig Latin, lets you specify a sequence of data transformations such as merging data sets, filtering them, and applying functions to records or groups of records. Pig comes with many built-in functions but you can also create your own user-defined functions to do special-purpose processing. Pig Latin programs run in a distributed fashion on a cluster (programs are complied into Map/Reduce jobs and executed using Hadoop). For quick prototyping, Pig Latin programs can also run in "local mode" without a cluster (all processing takes place in a single local JVM). @@ -20, +20 @@ '''Why Pig Latin instead of SQL?''' [[http://www.cs.cmu.edu/~olston/publications/sigmod08.pdf|Pig Latin: A Not-So-Foreign Language ...]] - '''Pig Has Grown Up!'''. On 10/22/08 Pig graduated from the [[http://incubator.apache.org/|Incubator]] and joined [[http://hadoop.apache.org/|Apache Hadoop]] as a subproject. - - '''Pig is Getting Faster!''' 2-6 times faster, for many queries. We've created a set of benchmarks and run them against the pig 0.1.0 release (modified to run on hadoop 0.18) and against the current trunk (previously `types` branch.) Joins and order bys in particular made large performance gains. For complete details see PigMix. - - '''Interested in Pig Guts?''' We are completely redesigning the Pig execution and optimization framework. For design details see PigOptimizationWishList and PigExecutionModel. - - '''Want to contribute but don't know where to kick in?''' Here is a [[http://wiki.apache.org/pig/ProposedProjects|list of project]] that we desired. We need new blood! + '''Want to contribute but don't know where to kick in?''' Here is our [[http://wiki.apache.org/pig/PigJournal|journal]] of projects we have worked on, are working on, + and hope to work on. Find a project that interests you and jump on in. '''Pig available as part of Amazon's Elastic !MapReduce''', as of August 2009. @@ -40, +35 @@ * [[http://hadoop.apache.org/pig/|User Documentation]] * [[http://www.cloudera.com/hadoop-training-pig-introduction|Online Pig Training]] - Complete with video lectures, exercises, and a pre-configured virtual machine. Developed by Cloudera and Yahoo! * PiggyBank - User-defined functions (UDFs) contributed by Pig users! + * PigTools - Tools Pig users have built around and on top of Pig. + * PigInteroperability - How to make Pig work with other platforms you may be using, such as HBase and Cassandra. == Developer Documentation == * How tos
[Pig Wiki] Update of "PigInteroperability" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigInteroperability" page has been changed by AlanGates. http://wiki.apache.org/pig/PigInteroperability -- New page: This page describes how Pig interoperates with other platforms, such as HBase and Hive. == Pig and Cassandra == http://issues.apache.org/jira/browse/CASSANDRA-910 A loader for loading Cassandra data into Pig. Works with Pig 0.7.0 (branched but not yet released as of 3/31/2010). == Pig and HBase == In Pig 0.6 and before, the built in HBaseStorage can be used to load data from Hbase. Work is ongoing to enhance this loader and make it a storage function also. See http://issues.apache.org/jira/browse/PIG-1205 == Pig and Hive RCFiles == The !HiveColumnarLoader, available as part of PiggyBank in Pig 0.7.0.
[Pig Wiki] Update of "PigTools" by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "PigTools" page has been changed by AlanGates. http://wiki.apache.org/pig/PigTools?action=diff&rev1=14&rev2=15 -- http://code.google.com/p/pig-eclipse Provides pig-latin editor in eclipse, have the feature of syntax highlighting. I just make it for interests, now it has less features than pigpen. + + === Elephant-Bird === + http://github.com/kevinweil/elephant-bird/ + + Twitter's library of LZO and/or Protocol Buffer-related Hadoop !InputFormats, !OutputFormats, Writables, Pig !LoadFuncs, HBase miscellanea, etc. The majority of these are in production at Twitter running over data every day. === Emacs Pig Latin Mode === http://github.com/cloudera/piglatin-mode
[Pig Wiki] Update of "owl" by jaytang
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The "owl" page has been changed by jaytang. http://wiki.apache.org/pig/owl?action=diff&rev1=10&rev2=11 -- * !OwlInputFormat API - org.apache.hadoop.owl.mapreduce Sample code is attached to write a client application against owl: + * Sample code using !OwlDriver API: [[attachment:TestOwlDriverSample.java]]