[Pig Wiki] Update of "ProposedByLaws" by AlanGates

2010-10-01 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "ProposedByLaws" page has been changed by AlanGates.
http://wiki.apache.org/pig/ProposedByLaws?action=diff&rev1=3&rev2=4

--

  
  Voting can also be applied to changes already made to the Pig codebase. These
  typically take the form of a veto (-1) in reply to the commit message
- sent when the commit is made.  Note that this should be a rare occurance.
+ sent when the commit is made.  Note that this should be a rare occurrence.
  All efforts should be made to discuss issues when they are still patches 
before the code is committed.
  
  === Approvals ===


[Pig Wiki] Update of "ProposedByLaws" by AlanGates

2010-10-01 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "ProposedByLaws" page has been changed by AlanGates.
http://wiki.apache.org/pig/ProposedByLaws?action=diff&rev1=2&rev2=3

--

  In general votes should not be called at times when it is
  known that interested members of the project will be unavailable.
  
- || '''Action''' || '''Description''' || '''Approval''' || '''Binding Votes''' 
|| '''Length''' ||
+ || '''Action''' || '''Description''' || '''Approval''' || '''Binding Votes''' 
|| '''Minimum Length''' ||
  || Code Change || A change made to a codebase of the project and committed by 
a committer. This includes source code, documentation, website content, etc. || 
Lazy approval (not counting the vote of the contributor), moving to lazy 
majority if a -1 is received || Active committers || 1 ||
  || Release Plan || Defines the timetable and actions for a release. The plan 
also nominates a Release Manager. || Lazy majority || Active committers || 3 ||
  || Product Release || When a release of one of the project's products is 
ready, a vote is required to accept the release as an official release of the 
project. || Lazy Majority || Active PMC members || 3 ||


[Pig Wiki] Update of "ProposedByLaws" by AlanGates

2010-10-01 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "ProposedByLaws" page has been changed by AlanGates.
http://wiki.apache.org/pig/ProposedByLaws?action=diff&rev1=1&rev2=2

--

  perception of an action in the wider Pig community. For PMC decisions,
  only the votes of PMC members are binding.
  
- Voting can also be applied to changes made to the Pig codebase. These
+ Voting can also be applied to changes already made to the Pig codebase. These
  typically take the form of a veto (-1) in reply to the commit message
- sent when the commit is made.
+ sent when the commit is made.  Note that this should be a rare occurance.
+ All efforts should be made to discuss issues when they are still patches 
before the code is committed.
  
  === Approvals ===
  These are the types of approvals that can be sought. Different actions
@@ -171, +172 @@

  === Actions ===
  This section describes the various actions which are undertaken within
  the project, the corresponding approval required for that action and
- those who have binding votes over the action.
+ those who have binding votes over the action.  It also specifies the minimum 
length of time that a vote must remain open, measured in business days.
+ In general votes should not be called at times when it is
+ known that interested members of the project will be unavailable.
  
- || '''Action''' || '''Description''' || '''Approval''' || '''Binding Votes''' 
||
+ || '''Action''' || '''Description''' || '''Approval''' || '''Binding Votes''' 
|| '''Length''' ||
- || Code Change || A change made to a codebase of the project and committed by 
a committer. This includes source code, documentation, website content, etc. || 
Lazy approval || Active committers ||
+ || Code Change || A change made to a codebase of the project and committed by 
a committer. This includes source code, documentation, website content, etc. || 
Lazy approval (not counting the vote of the contributor), moving to lazy 
majority if a -1 is received || Active committers || 1 ||
- || Release Plan || Defines the timetable and actions for a release. The plan 
also nominates a Release Manager. || Lazy majority || Active committers ||
+ || Release Plan || Defines the timetable and actions for a release. The plan 
also nominates a Release Manager. || Lazy majority || Active committers || 3 ||
- || Product Release || When a release of one of the project's products is 
ready, a vote is required to accept the release as an official release of the 
project. || Lazy Majority || Active PMC members ||
+ || Product Release || When a release of one of the project's products is 
ready, a vote is required to accept the release as an official release of the 
project. || Lazy Majority || Active PMC members || 3 ||
- || Adoption of New Codebase || When the codebase for an existing, released 
product is to be replaced with an alternative codebase. If such a vote fails to 
gain approval, the existing code base will continue.  This also covers the 
creation of new sub-projects within the project. || 2/3 majority || Active PMC 
members '''NOTE''': Change from Hadoop proposal which had Active committers ||
+ || Adoption of New Codebase || When the codebase for an existing, released 
product is to be replaced with an alternative codebase. If such a vote fails to 
gain approval, the existing code base will continue.  This also covers the 
creation of new sub-projects within the project. || 2/3 majority || Active PMC 
members '''NOTE''': Change from Hadoop proposal which had Active committers || 
6 ||
- || New Committer || When a new committer is proposed for the project. || Lazy 
consensus || Active PMC members ||
+ || New Committer || When a new committer is proposed for the project. || Lazy 
consensus || Active PMC members || 3 ||
- || New PMC Member || When a committer is proposed for the PMC. || Lazy 
consensus || Active PMC members ||
+ || New PMC Member || When a committer is proposed for the PMC. || Lazy 
consensus || Active PMC members || 3 ||
- || Committer Removal || When removal of commit privileges is sought.  
'''Note:''' Such actions will also be referred to the ASF board by the PMC 
chair. || Consensus || Active PMC members (excluding the committer in question 
if a member of the PMC). ||
+ || Committer Removal || When removal of commit privileges is sought.  
'''Note:''' Such actions will also be referred to the ASF board by the PMC 
chair. || Consensus || Active PMC members (excluding the committer in question 
if a member of the PMC). || 6 ||
- || PMC Member Removal || When removal of a PMC member is sought.  '''Note:''' 
 Such actions will also be referred to the ASF board by the PMC chair. || 
Consensus || Active PMC members (excluding the member in question). ||
+ || PMC Member Removal || When removal of a PMC member is sought.  '''Note:''' 
 Such actions will also be referred to the ASF board b

[Pig Wiki] Update of "PigLatin" by jsha

2010-09-30 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigLatin" page has been changed by jsha.
http://wiki.apache.org/pig/PigLatin?action=diff&rev1=36&rev2=37

--

  <>
  <>
  
- '''THIS PAGE IS OBSOLETE. Please use documentation at 
http://hadoop.apache.org/pig/'''
+ '''THIS PAGE IS OBSOLETE. Please use Pig Latin documentation at 
http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref1.html'''
  
  '''Note:''' For Pig 0.2.0 or later, some content on this page may no longer 
be applicable.
  


[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by nirajrai

2010-09-29 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by nirajrai.
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=158&rev2=159

--

  ||2254 ||Currently merged cogroup is not supported after blocking operators. 
||
  ||2255 ||POSkewedJoin operator has " + compiledInputs.length + " inputs. It 
should have 2. ||
  ||2256 ||Cannot remove and reconnect node with multiple inputs/outputs ||
+ ||2257 ||An unexpected exception caused the validation to stop ||
  
  ||2998 ||Unexpected internal error. ||
  ||2999 ||Unhandled internal error. ||


[Pig Wiki] Update of "ProposedByLaws" by AlanGates

2010-09-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "ProposedByLaws" page has been changed by AlanGates.
http://wiki.apache.org/pig/ProposedByLaws

--

New page:
The following is a proposal for by laws for the Apache Pig project.  I took 
this almost verbatim from a proposal made by Owen O'Malley for the Hadoop
Project.  Places where I modified it I have tagged with '''NOTE'''.

= Apache Pig Project Bylaws =
This document defines the bylaws under which the Apache Pig project
operates. It defines the roles and responsibilities of the
project, who may vote, how voting works, how conflicts are resolved,
etc.

Pig is a project of the
[[http://www.apache.org/foundation/|Apache Software Foundation]].  The 
foundation holds the copyright on Apache
code including the code in the Pig codebase. The
[[http://www.apache.org/foundation/faq.html|foundation FAQ]]
explains the operation and background of the foundation.

Pig is typical of Apache projects in that it operates under a set of
principles, known collectively as the 'Apache Way'. If you are
new to Apache development, please refer to the
[[http://incubator.apache.org|Incubator project]]
for more information on how Apache projects operate.

== Roles and Responsibilities ==
Apache projects define a set of roles with associated rights and
responsibilities. These roles govern what tasks an individual may perform
within the project. The roles are defined in the following sections.

=== Users ===
The most important participants in the project are people who use our
software. The majority of our contributors start out as users and guide
their development efforts from the user's perspective.

Users contribute to the Apache projects by providing feedback to
contributors in the form of bug reports and feature suggestions. As
well, users participate in the Apache community by helping other users
on mailing lists and user support forums.

=== Contributors ===
'''NOTE''': Changed from "Developers" in Hadoop proposal to "Contributors", and 
throughout 

All of the volunteers who are contributing time, code, documentation,
or resources to the Pig Project. A contributor that makes sustained,
welcome contributions to the project may be invited to become a
Committer, though the exact timing of such invitations depends on many factors.

=== Committers ===
The project's Committers are responsible for the project's
technical management. Committers have access to a specified
set of subproject's subversion repositories. Committers on
subprojects may cast binding votes on any technical discussion
regarding that subproject.

Committer access is by invitation only and must be approved by lazy
consensus of the active PMC members. A Committer is considered emeritus
by their own declaration or by not contributing in any form to the
project for over six months. An emeritus committer may request
reinstatement of commit access from the PMC which will be sufficient
to restore him or her to active committer status.

'''NOTE''':  Change from Hadoop proposal, added phrase "which will be 
sufficient..." and removed
"Such reinstatement is subject to lazy consensus of active PMC members."

Commit access can be revoked by a unanimous vote of all the
active PMC members (except the committer in question if they
are also a PMC member).

All Apache committers are required to have a signed Contributor License
Agreement (CLA) on file with the Apache Software Foundation. There is a
[[http://www.apache.org/dev/committers.html|Committer FAQ]]
which provides more details on the requirements for Committers

A committer who makes a sustained contribution to the project may be
invited to become a member of the PMC. The form of contribution is
not limited to code. It can also include code review, helping out
users on the mailing lists, documentation, etc.

=== Project Management Committee ===
The PMC is responsible to the board and the ASF for the management
and oversight of the Apache Pig codebase. The responsibilities
of the PMC include
 * Deciding what is distributed as products of the Apache Pig project.  In 
particular all releases must be approved by the PMC.
 * Maintaining the project's shared resources, including the codebase 
repository, mailing lists, websites.
 * Speaking on behalf of the project.
 * Resolving license disputes regarding products of the project.
 * Nominating new PMC members and committers.
 * Maintaining these bylaws and other guidelines of the project.

Membership of the PMC is by invitation only and must be approved by a
lazy consensus of active PMC members. A PMC member is considered
'emeritus' by their own declaration or by not contributing in
any form to the project for over six months. An emeritus member may
request reinstatement to the PMC, which will be sufficient
to restore him or her to active PMC member.

'''NOTE''': Change from Hadoop proposal, added phrase "which will

[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by daijy

2010-09-23 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by daijy.
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=157&rev2=158

--

  ||2253 ||Side loaders in cogroup must implement IndexableLoadFunc. ||
  ||2254 ||Currently merged cogroup is not supported after blocking operators. 
||
  ||2255 ||POSkewedJoin operator has " + compiledInputs.length + " inputs. It 
should have 2. ||
+ ||2256 ||Cannot remove and reconnect node with multiple inputs/outputs ||
  
  ||2998 ||Unexpected internal error. ||
  ||2999 ||Unhandled internal error. ||


[Pig Wiki] Update of "SemanticsCleanup" by AlanGates

2010-09-21 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "SemanticsCleanup" page has been changed by AlanGates.
http://wiki.apache.org/pig/SemanticsCleanup?action=diff&rev1=2&rev2=3

--

  || [[https://issues.apache.org/jira/browse/PIG-1584|PIG-1584]] || Grammar || 
Cogroup inner does not match the semantics of inner join.  It is also not clear 
what value the inner keyword has for cogroup. Consider removing it. || ||
  || [[https://issues.apache.org/jira/browse/PIG-1538|PIG-1538]] || Nested 
types || Remove two level access || Maybe, if we can find a way to ignore calls 
to Schema.isTwoLevelAccessRequired(). ||
  || [[https://issues.apache.org/jira/browse/PIG-1536|PIG-1536]] || Schema || 
Pick one semantic for schema merges and use it consistently throughout Pig || 
no ||
+ || [[https://issues.apache.org/jira/browse/PIG-1371|PIG-1371]] || Nested 
types || unknown || ||
  || [[https://issues.apache.org/jira/browse/PIG-1341|PIG-1341]] || Dynamic 
type binding || Close as won't fix || yes ||
  || [[https://issues.apache.org/jira/browse/PIG-1281|PIG-1281]] || Dynamic 
type binding || In situations where a Hadoop shuffle key is assumed to be of 
type bytearray wrap the value in a tuple so that if the type is actually 
something else Hadoop can still process it. || yes ||
  || [[https://issues.apache.org/jira/browse/PIG-1277|PIG-1277]] || Nested 
types || Unknown || ||
+ || [[https://issues.apache.org/jira/browse/PIG-1222|PIG-1222]] || Dynamic 
type binding || The issue here is that Pig thinks the field is a bytearray 
while BinStorage actually produces a String.  Need a way to handle these issues 
on the fly. || ||
  || [[https://issues.apache.org/jira/browse/PIG-1188|PIG-1188]] || Schema || 
Make sure Pig handles missing data in Tuples by returning a null rather than 
failing. || yes ||
  || [[https://issues.apache.org/jira/browse/PIG-1112|PIG-1112]] || Schema || 
When user provides AS to flatten of undefined bag or tuple, the contents of 
that AS are taken to be the schema of the bag or tuple. || yes ||
  || [[https://issues.apache.org/jira/browse/PIG-1065|PIG-1065]] || Dynamic 
type binding ||  In situations where a Hadoop shuffle key is assumed to be of 
type bytearray wrap the value in a tuple so that if the type is actually 
something else Hadoop can still process it. || yes ||
  || [[https://issues.apache.org/jira/browse/PIG-999|PIG-999]] || Dynamic type 
binding ||  In situations where a Hadoop shuffle key is assumed to be of type 
bytearray wrap the value in a tuple so that if the type is actually something 
else Hadoop can still process it. || yes ||
+ || [[https://issues.apache.org/jira/browse/PIG-847|PIG-847]] || Nested types 
|| Remove two level access || maybe ||
+ || [[https://issues.apache.org/jira/browse/PIG-828|PIG-828]] || Nested types 
|| According to the rules of Pig Latin, this should produce a bag with one 
field.  Need to make sure that is what Pig is trying to do in this case. || yes 
||
  || [[https://issues.apache.org/jira/browse/PIG-767|PIG-767]] || Nested types 
|| Remove two level access; bring DUMP and DESCRIBE output into sync. || no ||
+ || [[https://issues.apache.org/jira/browse/PIG-749|PIG-749]] || Schema || 
Related to PIG-1112 || yes ||
  || [[https://issues.apache.org/jira/browse/PIG-730|PIG-730]] || Nested types 
|| Make sure schema of union is the same as schema before union (suspect his is 
a two level access issue) || unclear ||
  || [[https://issues.apache.org/jira/browse/PIG-723|PIG-723]] || Nested types 
|| Suspect this is a two level access issue || unclear ||
  || [[https://issues.apache.org/jira/browse/PIG-696|PIG-696]] || Dynamic type 
binding || Class cast exceptions such as this should result in a null value and 
a warning, not a failure. || yes ||
  || [[https://issues.apache.org/jira/browse/PIG-694|PIG-694]] || Nested types 
|| Determine the semantics for merging tuples and bags. || unclear ||
+ || [[https://issues.apache.org/jira/browse/PIG-678|PIG-678]] || Grammar || 
Decide whether we want to support this extension. || yes ||
  || [[https://issues.apache.org/jira/browse/PIG-621|PIG-621]] || Dynamic type 
binding || Class cast exceptions such as this should result in a null value and 
a warning, not a failure. || yes ||
  || [[https://issues.apache.org/jira/browse/PIG-435|PIG-435]] || Schema || 
Decide definitely on what it means when users declare a schema for a load. || 
unclear ||
  || [[https://issues.apache.org/jira/browse/PIG-333|PIG-333]] || Dynamic type 
binding || Since it is specified that MIN and MAX treat unknown types as 
double, all the actual string data should be converted to NULLs, rather than 
cause errors. || yes ||
  || [[https://issues.apache.org/jira/browse/PIG-313|PIG-313]] || Grammar || I 
propose that we continue not supporting this.  But we should detect it at 
compile time rather than at runtime. || yes ||
+ 
+ Bugs I need to 

[Pig Wiki] Update of "SemanticsCleanup" by AlanGates

2010-09-20 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "SemanticsCleanup" page has been changed by AlanGates.
http://wiki.apache.org/pig/SemanticsCleanup?action=diff&rev1=1&rev2=2

--

  The bugs have been placed into the following categories:
   * Schema:  These are related to schemas that are improperly inferred, etc.
   * Grammar:  Places where the grammar is unclear or produces unexpected 
results.
-  * Two Level Access:  The concept of two level access was introduced long ago 
to deal with oddities in bag schemas.  Ideally we will remove this.  At least 
we have to improve it.
+  * Nested Types:  Issues dealing with bags, tuples, and maps.
+  * Dynamic Type Binding:  In certain situations Pig assumes a value to be of 
type byte array when it does not know the actual type, and handles whatever 
actual type it is at runtime.  There are situations where this does not work 
properly.
  
  == Bug Table ==
- || *JIRA* || *Category* || *Proposed Solution* ||
+ || '''JIRA''' || '''Category''' || '''Proposed Solution''' || '''Backward 
Compatible''' ||
- || [[https://issues.apache.org/jira/browse/PIG-1627|PIG-1627]] || Schema || 
Flattening a bag with an unknown schema should produce a record with an unknown 
schema ||
+ || [[https://issues.apache.org/jira/browse/PIG-1627|PIG-1627]] || Schema || 
Flattening a bag with an unknown schema should produce a record with an unknown 
schema || no ||
- || [[https://issues.apache.org/jira/browse/PIG-1584|PIG-1584]] || Grammar || 
Cogroup inner does not match the semantics of inner join.  It is also not clear 
what value the inner keyword has for cogroup. ||
+ || [[https://issues.apache.org/jira/browse/PIG-1584|PIG-1584]] || Grammar || 
Cogroup inner does not match the semantics of inner join.  It is also not clear 
what value the inner keyword has for cogroup. Consider removing it. || ||
- || [[https://issues.apache.org/jira/browse/PIG-1538|PIG-1538]] || Two level 
access || Remove two level access ||
+ || [[https://issues.apache.org/jira/browse/PIG-1538|PIG-1538]] || Nested 
types || Remove two level access || Maybe, if we can find a way to ignore calls 
to Schema.isTwoLevelAccessRequired(). ||
- || [[https://issues.apache.org/jira/browse/PIG-1536|PIG-1536]] || Schema || 
Pig one semantic for schema merges and use it consistently throughout Pig ||
+ || [[https://issues.apache.org/jira/browse/PIG-1536|PIG-1536]] || Schema || 
Pick one semantic for schema merges and use it consistently throughout Pig || 
no ||
+ || [[https://issues.apache.org/jira/browse/PIG-1341|PIG-1341]] || Dynamic 
type binding || Close as won't fix || yes ||
+ || [[https://issues.apache.org/jira/browse/PIG-1281|PIG-1281]] || Dynamic 
type binding || In situations where a Hadoop shuffle key is assumed to be of 
type bytearray wrap the value in a tuple so that if the type is actually 
something else Hadoop can still process it. || yes ||
+ || [[https://issues.apache.org/jira/browse/PIG-1277|PIG-1277]] || Nested 
types || Unknown || ||
+ || [[https://issues.apache.org/jira/browse/PIG-1188|PIG-1188]] || Schema || 
Make sure Pig handles missing data in Tuples by returning a null rather than 
failing. || yes ||
+ || [[https://issues.apache.org/jira/browse/PIG-1112|PIG-1112]] || Schema || 
When user provides AS to flatten of undefined bag or tuple, the contents of 
that AS are taken to be the schema of the bag or tuple. || yes ||
+ || [[https://issues.apache.org/jira/browse/PIG-1065|PIG-1065]] || Dynamic 
type binding ||  In situations where a Hadoop shuffle key is assumed to be of 
type bytearray wrap the value in a tuple so that if the type is actually 
something else Hadoop can still process it. || yes ||
+ || [[https://issues.apache.org/jira/browse/PIG-999|PIG-999]] || Dynamic type 
binding ||  In situations where a Hadoop shuffle key is assumed to be of type 
bytearray wrap the value in a tuple so that if the type is actually something 
else Hadoop can still process it. || yes ||
+ || [[https://issues.apache.org/jira/browse/PIG-767|PIG-767]] || Nested types 
|| Remove two level access; bring DUMP and DESCRIBE output into sync. || no ||
+ || [[https://issues.apache.org/jira/browse/PIG-730|PIG-730]] || Nested types 
|| Make sure schema of union is the same as schema before union (suspect his is 
a two level access issue) || unclear ||
+ || [[https://issues.apache.org/jira/browse/PIG-723|PIG-723]] || Nested types 
|| Suspect this is a two level access issue || unclear ||
+ || [[https://issues.apache.org/jira/browse/PIG-696|PIG-696]] || Dynamic type 
binding || Class cast exceptions such as this should result in a null value and 
a warning, not a failure. || yes ||
+ || [[https://issues.apache.org/jira/browse/PIG-694|PIG-694]] || Nested types 
|| Determine the semantics for merging tuples and bags. || unclear ||
+ || [[https://issues.apache.org/jira/browse/PIG-621|PIG-621]] || Dynamic type 
binding |

[Pig Wiki] Update of "SemanticsCleanup" by AlanGates

2010-09-20 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "SemanticsCleanup" page has been changed by AlanGates.
http://wiki.apache.org/pig/SemanticsCleanup

--

New page:
== Introduction ==
A number of bugs have been filed against Pig that roughly fall under the area 
of poorly defined or undefined semantics.  In the 0.9 Pig release
we would like to take on a number of these issues, clarifying semantics where 
they are unclear, defining them where they are undefined, and
correctly them where they are clearly wrong.  This page will classifies the 
existing bugs and indicates what we believe the proper fix is for
them.

== Categories ==
The bugs have been placed into the following categories:
 * Schema:  These are related to schemas that are improperly inferred, etc.
 * Grammar:  Places where the grammar is unclear or produces unexpected results.
 * Two Level Access:  The concept of two level access was introduced long ago 
to deal with oddities in bag schemas.  Ideally we will remove this.  At least 
we have to improve it.

== Bug Table ==
|| *JIRA* || *Category* || *Proposed Solution* ||
|| [[https://issues.apache.org/jira/browse/PIG-1627|PIG-1627]] || Schema || 
Flattening a bag with an unknown schema should produce a record with an unknown 
schema ||
|| [[https://issues.apache.org/jira/browse/PIG-1584|PIG-1584]] || Grammar || 
Cogroup inner does not match the semantics of inner join.  It is also not clear 
what value the inner keyword has for cogroup. ||
|| [[https://issues.apache.org/jira/browse/PIG-1538|PIG-1538]] || Two level 
access || Remove two level access ||
|| [[https://issues.apache.org/jira/browse/PIG-1536|PIG-1536]] || Schema || Pig 
one semantic for schema merges and use it consistently throughout Pig ||



[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by nirajrai

2010-09-17 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by nirajrai.
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=156&rev2=157

--

  ||2252 ||Base loader in Cogroup must implement CollectableLoadFunc. ||
  ||2253 ||Side loaders in cogroup must implement IndexableLoadFunc. ||
  ||2254 ||Currently merged cogroup is not supported after blocking operators. 
||
- ||2255 ||Base loader in Cogroup must implement CollectableLoadFunc. ||
- ||2256 ||POSkewedJoin operator has " + compiledInputs.length + " inputs. It 
should have 2. ||
+ ||2255 ||POSkewedJoin operator has " + compiledInputs.length + " inputs. It 
should have 2. ||
  
  ||2998 ||Unexpected internal error. ||
  ||2999 ||Unhandled internal error. ||


[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by nirajrai

2010-09-17 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by nirajrai.
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=155&rev2=156

--

  ||2253 ||Side loaders in cogroup must implement IndexableLoadFunc. ||
  ||2254 ||Currently merged cogroup is not supported after blocking operators. 
||
  ||2255 ||Base loader in Cogroup must implement CollectableLoadFunc. ||
+ ||2256 ||POSkewedJoin operator has " + compiledInputs.length + " inputs. It 
should have 2. ||
  
  ||2998 ||Unexpected internal error. ||
  ||2999 ||Unhandled internal error. ||


[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by nirajrai

2010-09-17 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by nirajrai.
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=154&rev2=155

--

  ||2247 ||Cannot determine skewed join schema ||
  ||2248 ||twoLevelAccessRequired==true is not supported with" +"and 
isSubNameMatch==true. ||
  ||2249 ||While using 'collected' on group; data must be loaded via loader 
implementing CollectableLoadFunc. ||
+ ||2250 ||Blocking operators are not allowed before Collected Group. Consider 
dropping using 'collected'. ||
+ ||2251 ||Merge Cogroup work on two or more relations. To use map-side 
group-by on single relation, use 'collected' qualifier. ||
+ ||2252 ||Base loader in Cogroup must implement CollectableLoadFunc. ||
+ ||2253 ||Side loaders in cogroup must implement IndexableLoadFunc. ||
+ ||2254 ||Currently merged cogroup is not supported after blocking operators. 
||
+ ||2255 ||Base loader in Cogroup must implement CollectableLoadFunc. ||
  
  ||2998 ||Unexpected internal error. ||
  ||2999 ||Unhandled internal error. ||


[Pig Wiki] Update of "PigJournal" by AlanGates

2010-09-17 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigJournal" page has been changed by AlanGates.
http://wiki.apache.org/pig/PigJournal?action=diff&rev1=11&rev2=12

--

  || Feature || JIRA
|| Comments ||
  || Boolean Type|| 
[[https://issues.apache.org/jira/browse/PIG-1429|PIG-1429]] || ||
  || Make Illustrate Work|| 
[[https://issues.apache.org/jira/browse/PIG-502|PIG-502]], 
[[https://issues.apache.org/jira/browse/PIG-534|PIG-534]], 
[[https://issues.apache.org/jira/browse/PIG-903|PIG-903]], 
[[https://issues.apache.org/jira/browse/PIG-1066|PIG-1066]] || ||
- || Better Parser and Scanner Technology|| many || ||
+ || Better Parser and Scanner Technology|| 
[[https://issues.apache.org/jira/browse/PIG-1618|PIG-1618]] || ||
  || Clarify Pig Latin Semantics || many || ||
  || Extending Pig to Include Branching, Looping, and Functions || 
TuringCompletePig || ||
  


[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by nirajrai

2010-09-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by nirajrai.
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=153&rev2=154

--

  ||2245 ||Cannot get schema from loadFunc ||
  ||2246 ||Error merging schema ||
  ||2247 ||Cannot determine skewed join schema ||
+ ||2248 ||twoLevelAccessRequired==true is not supported with" +"and 
isSubNameMatch==true. ||
- ||2248 ||While using 'collected' on group; data must be loaded via loader 
implementing CollectableLoadFunc. ||
+ ||2249 ||While using 'collected' on group; data must be loaded via loader 
implementing CollectableLoadFunc. ||
- 
  
  ||2998 ||Unexpected internal error. ||
  ||2999 ||Unhandled internal error. ||


[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by nirajrai

2010-09-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by nirajrai.
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=152&rev2=153

--

  ||2245 ||Cannot get schema from loadFunc ||
  ||2246 ||Error merging schema ||
  ||2247 ||Cannot determine skewed join schema ||
+ ||2248 ||While using 'collected' on group; data must be loaded via loader 
implementing CollectableLoadFunc. ||
  
  
  ||2998 ||Unexpected internal error. ||


[Pig Wiki] Update of "PigJournal" by AlanGates

2010-09-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigJournal" page has been changed by AlanGates.
http://wiki.apache.org/pig/PigJournal?action=diff&rev1=10&rev2=11

--

  || Make configuration available to UDFs || 0.6
  || ||
  || Load Store Redesign  || 0.7
  || ||
  || Pig Mix 2.0  || not yet released   
  || ||
+ || Rewrite Logical Optimizer|| not yet released   
  || ||
+ || Cleanup of javadocs  || not yet released   
  || ||
+ || UDFs in scripting languages  || not yet released   
  || ||
+ || Ability to specify a custom partitioner  || not yet released   
  || ||
+ || Pig usage stats collection   || not yet released   
  || ||
+ || Make Pig available via Maven || not yet released   
  || ||
+ || Standard UDFs Pig Should Provide || not yet released   
  || ||
+ || Add Scalars To Pig Latin || not yet released   
  || ||
+ || Run Map Reduce Jobs Directly From Pig|| not yet released   
  || ||
  
  == Work in Progress ==
  This covers work that is currently being done.  For each entry the main JIRA 
for the work is referenced.
  
  || Feature || JIRA
|| Comments ||
  || Boolean Type|| 
[[https://issues.apache.org/jira/browse/PIG-1429|PIG-1429]] || ||
+ || Make Illustrate Work|| 
[[https://issues.apache.org/jira/browse/PIG-502|PIG-502]], 
[[https://issues.apache.org/jira/browse/PIG-534|PIG-534]], 
[[https://issues.apache.org/jira/browse/PIG-903|PIG-903]], 
[[https://issues.apache.org/jira/browse/PIG-1066|PIG-1066]] || ||
+ || Better Parser and Scanner Technology|| many || ||
+ || Clarify Pig Latin Semantics || many || ||
+ || Extending Pig to Include Branching, Looping, and Functions || 
TuringCompletePig || ||
+ 
- || Query Optimizer || 
[[http://issues.apache.org/jira/browse/PIG-1178|PIG-1178]]  || ||
- || Cleanup of javadocs || 
[[https://issues.apache.org/jira/browse/PIG-1311|PIG-1311]] || ||
- || UDFs in scripting languages || 
[[https://issues.apache.org/jira/browse/PIG-928|PIG-928]]   || ||
- || Ability to specify a custom partitioner || 
[[https://issues.apache.org/jira/browse/PIG-282|PIG-282]]   || ||
- || Pig usage stats collection  || 
[[https://issues.apache.org/jira/browse/PIG-1389|PIG-1389]], 
[[https://issues.apache.org/jira/browse/PIG-908|PIG-908]], 
[[https://issues.apache.org/jira/browse/PIG-864|PIG-864]], 
[[https://issues.apache.org/jira/browse/PIG-809|PIG-809]] || ||
- || Make Pig available via Maven|| 
[[https://issues.apache.org/jira/browse/PIG-1334|PIG-1334]] || ||
- || Standard UDFs Pig Should Provide|| 
[[https://issues.apache.org/jira/browse/PIG-1405|PIG-1405]] || ||
- || Add Scalars To Pig Latin|| 
[[https://issues.apache.org/jira/browse/PIG-1434|PIG-1434]] || ||
- || Run Map Reduce Jobs Directly From Pig   || 
[[https://issues.apache.org/jira/browse/PIG-506|PIG-506]]   || ||
  
  == Proposed Future Work ==
  Work that the Pig project proposes to do in the future is further broken into 
three categories:
@@ -74, +79 @@

  Within each subsection order is alphabetical and does not imply priority.
  
  === Agreed Work, Agreed Approach ===
-  Make Illustrate Work 
- Illustrate has become Pig's ignored step-child.  Users find it very useful, 
but developers have not kept it up to date with new features (e.g. it does not 
work with merge join).  Also, the way it is currently
- implemented it has code in many of Pig's physical operators.  This means the 
code is more complex and burdened with branches, making it harder to maintain.  
It also means that when doing new development it is
- easy to forget about illustrate.  Illustrate needs to be redesigned in such a 
way that it does not add complexity to physical operators and that as new 
operators are developed it is necessary and easy to add
- illustrate functionality to them.  Tests for illustrate also need to be added 
to the test suite so that it is no broken unintentionally.
- 
- '''Category:'''  Usability
- 
- '''Dependency:''' 
- 
- '''References:''' 
- 
- '''Estimated Development Effort:'''  medium
- 
   Combiner Not Used with Limit or Filter 
  Pig Scripts that have a foreach with a nested limit or filter do not use the 
combiner even when they could.  Not all filters can use the combiner, but in 
some cases
  they can.  I think all limits could at least apply the limit i

[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by daijy

2010-09-07 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by daijy.
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=151&rev2=152

--

  ||2242 ||TypeCastInserter invoked with an invalid operator ||
  ||2243 ||Attempt to remove operator that is still connected to other 
operators ||
  ||2244 ||Hadoop does not return any error message ||
+ ||2245 ||Cannot get schema from loadFunc ||
+ ||2246 ||Error merging schema ||
+ ||2247 ||Cannot determine skewed join schema ||
  
  
  ||2998 ||Unexpected internal error. ||


[Pig Wiki] Update of "Howl/HowlCliFuncSpec" by Ashutosh Chauhan

2010-09-02 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Howl/HowlCliFuncSpec" page has been changed by AshutoshChauhan.
http://wiki.apache.org/pig/Howl/HowlCliFuncSpec

--

New page:
== Howl CLI Functional Specification ==
This wiki page outlines what is supported from Howl CLI.

In http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL Hive's DDL spec 
outlines various allowed operations. This wiki will talk about which of those 
are allowed and are not allowed from Howl CLI. Among those which are allowed 
how are they different from Hive's CLI.

 CREATE TABLE 
 * STORED AS clause which is currently defined as:

[STORED AS file_format] file_format:

 . : SEQUENCEFILE | TEXTFILE | RCFILE  | INPUTFORMAT input_format_classname 
OUTPUTFORMAT output_format_classname

will be modified to support

[STORED AS file_format] file_format:

 . : RCFILE  | INPUTFORMAT input_format_classname OUTPUTFORMAT 
output_format_classname INPUTDRIVER input_driver_classname OUTPUTDRIVER 
output_driver_classname
  * CREATE TABLE command must contain "STORED AS" clause, if it doesnt it will 
result in an exception "Operation not supported. Create table doesn't contain 
STORED AS clause. Please provide one."
  * If table is partitioned, then user provides partition columns. These 
columns can only be of type String.
  * CLUSTERED BY clause is not supported. If provided will result in an 
exception "Operation not supported. CLUSTERED BY is not supported."

CREATE TABLE AS SELECT

 * Not Supported. Throws an exception with message "Operation Not Supported".

CREATE TABLE LIKE

 * Allowed only if existing table was created using Howl. Else, throws an 
exception "Operation not supported. Table table name should have been created 
through Howl. Seems like its not."

DROP TABLE

 * Behavior same as of Hive.

 ALTER TABLE 
ALTER TABLE table_name ADD partition_spec [ LOCATION 'location1' ] 
partition_spec [ LOCATION 'location2' ] ...

 . partition_spec:
  . : PARTITION (partition_col = partition_col_value, partition_col = 
partiton_col_value, ...)
   * Allowed only if TABLE table_name was created using Howl. Else, throws an 
exception "Operation not supported. Partitions can be added only to tables 
through Howl."

Alter Table File Format

ALTER TABLE table_name SET FILEFORMAT file_format

Here file_format must be same as the one described above in CREATE TABLE. Else, 
throw an exception "Operation not supported. Not a valid file format."

 * CLUSTERED BY clause is not supported. If provided will result in an 
exception "Operation not supported. CLUSTERED BY is not supported."

Change Column Name/Type/Position/Comment

ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type 
[COMMENT col_comment] [FIRST|AFTER column_name]

 * Not supported. Throws an exception with message "Operation Not Supported".

Add/Replace Columns

ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT 
col_comment], ...)

 * ADD Columns is allowed. Behavior same as of Hive.
 * Replace column is not supported. Throws an exception with message "Operation 
Not Supported".

Alter Table Touch

ALTER TABLE table_name TOUCH; ALTER TABLE table_name TOUCH PARTITION 
partition_spec;

 * Not Supported. Throws an exception with message "Operation Not Supported".

= CREATE VIEW =
 * Not Supported. Throws an exception with message "Operation Not Supported".

= DROP VIEW =
 * Not Supported. Throws an exception with message "Operation Not Supported".

= ALTER VIEW =
 * Not Supported. Throws an exception with message "Operation Not Supported".

= SHOW TABLES =
 * Behavior same as of Hive.

= SHOW PARTITIONS =
 * Behavior same as of Hive.

= SHOW FUNCTIONS =
 * Not Supported. Throws an exception with message "Operation Not Supported".

= DESCRIBE =
 * Behavior same as of Hive.

Any other commands apart from one listed above will result in an exception with 
message "Operation Not Supported".

 User Interface for Howl 
It will support following four command line options:

 * -g : Usage is -g mygroup This indicates to Howl that table that needs to be 
created must have group as "mygroup"
 * -p : Usage is -p rwxr-xr-x This indicates to Howl that table that needs to 
be created must have permissions as "rwxr-xr-x"
 * -f : Usage is -f myscript.howl This indicates to howl that myscript.howl is 
a file which contains DDL commands it needs to execute.
 * -e : Usage is -e 'create table mytable(a int);' This indicates to Howl to 
treat the following string as DDL command and execute it.

Notes:

 * -g and -p options are not mandatory. If not supplied and command contains a 
CREATE TABLE which is successful, user will be told with what permissions and 
in which group her table is created. This will be printed on stdout. Message 
will read as "Table tablename is created

[Pig Wiki] Update of "HowlSecurity" by AlanGates

2010-09-01 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "HowlSecurity" page has been changed by AlanGates.
http://wiki.apache.org/pig/HowlSecurity

--

New page:
This page will outline design of Howl Security. 

== Related Hive Work ==
[[https://issues.apache.org/jira/browse/HIVE-78|Jira for authorization support 
in Hive]]

== Authorization ==

Initially the thought is that Howl will have authorization implemented at some 
level to provide security. The initial implementation will be based on HDFS 
directory permissions. This may be enhanced/replaced by a role based model in a 
later release.
   
=== Permissions ===
The initial idea for authorization in Howl is to use the HDFS permissions to 
authorize metadata operations. To be able to do this, we would like to extend 
createTable() to add the ability to record a different group from the user's 
primary group and to record the complete Unix permissions on the table 
directory. Also, we would like to have a way for partition directories to 
inherit permissions and group information based on the table directory. To keep 
the metastore backward compatible for use with Hive, I propose having conf 
variables to achieve these objectives:
 * `table.group.name` : value will indicate the name of the Unix group for the 
table directory. This will be used by `createTable()` to perform a chgrp to the 
value provided. This property will provide the user the ability to choose from 
one of the many Unix groups he is part of to associate with the table.
 * `table.permissions` : value will be of the form `rwxrwxrwx` to indicate 
read-write-execute permissions on the table directory. This will be used by 
`createTable()` to perform a chmod to the value provided. This will let the 
user decide what permissions he wants on the table.
 * `partitions.inherit.permissions` : a value of true will indicate that 
partitions inherit the group name and permissions of the table level directory. 
 This will be used by `addPartition()` to perform a chgrp and chmod to the 
values as on the table directory.

Conf properties are preferable over API changes since the complete 
authorization design for Hive is not finalized yet. These properties can be 
deprecated/removed when that is in place. These properties would also be useful 
to some installation of vanilla Hive since at least DFS level authorization can 
now be achieved by Hive without the user having to manually perform chgrp and 
chmod operations on DFS.

=== Reading data(Select)/Writing data (Insert) ===
This will simply be governed by the dfs permission at the time of the read and 
will result in runtime errors if the user does not have permissions.

=== Create table ===

 Internal/External table without location specified 
If the user has permissions to the directory pointed by 
`hive.metastore.warehouse.dir` then he can create the table. 

 Internal/External table with location specified 
If the user has permissions to the location specified then he can create the 
table.

=== Drop Table ===
A user can drop a table (internal or external) only if he has write permissions 
to the table directory. A user could have write permission either by virtue of 
him being the owner of the table or through the group he belongs
to. So if the permissions on the table directory allow him to write to it, he 
can drop the table.

=== Partition permissions ===
Partition directories will inherit the permissions/(owner,group) of the table 
directory.

=== Alter table ===
A user can "alter" table if he has write permissions on the table directory. So 
any of the following alter table commands are allowed only if the user has 
write permissions on the table directory:
 * `ALTER TABLE table_name ADD partition_spec [ LOCATION 'location1' ] 
partition_spec [ LOCATION 'location2' ] ...`
 * `ALTER TABLE table_name DROP partition_spec, partition_spec,...`
 * `ALTER TABLE table_name RENAME TO new_table_name`
 * `ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name 
column_type [COMMENT col_comment] [FIRST|AFTER column_name]`
 * `ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT 
col_comment], ...)`
 * `ALTER TABLE table_name SET TBLPROPERTIES table_properties`
 * `ALTER TABLE table_name SET SERDE serde_class_name [WITH SERDEPROPERTIES 
serde_properties]`
 * `ALTER TABLE table_name SET SERDEPROPERTIES serde_properties`
 * `ALTER TABLE table_name SET FILEFORMAT file_format`
 * `ALTER TABLE table_name CLUSTERED BY (col_name, col_name, ...) [SORTED BY 
(col_name, ...)] INTO num_buckets BUCKETS`
 * `ALTER TABLE table_name TOUCH;`
 * `ALTER TABLE table_name TOUCH PARTITION partition_spec;`

=== Show tables ===
Since the top level warehouse dir will have read/write permissions for 
everyone, show tables will show all tables to all users.

=== Show Table/Partitions Extended ===
A user can issue "show table/parti

[Pig Wiki] Update of "PigJournal" by AlanGates

2010-09-01 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigJournal" page has been changed by AlanGates.
http://wiki.apache.org/pig/PigJournal?action=diff&rev1=9&rev2=10

--

  '''Estimated Development Effort:'''  medium
  
  === Agreed Work, Unknown Approach ===
+  Support Append in Pig 
+ Appending to HDFS files is supported in Hadoop 0.21.  None of Pig's standard 
load functions support append.  We need to decide if append is added to 
+ the language itself (is there an APPEND modifier to the STORE command?) or if 
each store function needs to decide how to indicate or allow appending on its 
own.  !PigStorage
+ should support append as users are likely to want it.
+ 
+ '''Category:'''  New Functionality
+ 
+ '''Dependency:''' Hadoop 0.21 or later
+ 
+ '''References:'''
+ 
+ '''Estimated Development Effort:'''  small
+ 
+ 
   Move Piggybank out of Contrib 
  Currently Pig hosts Piggybank (our repository of user contributed UDFs) as 
part of our contrib.  This is not ideal for a couple of reasons.  One, it means 
those who
  wish to share their UDFs have to go through the rigor of the patch process.  
Two, since contrib is tied to releases of the main product, there is no way for 
users


[Pig Wiki] Update of "HowlJournal" by AlanGates

2010-08-24 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "HowlJournal" page has been changed by AlanGates.
http://wiki.apache.org/pig/HowlJournal?action=diff&rev1=1&rev2=2

--

  
  '''Authorization'''<> The initial proposal is to use HDFS permissions to 
determine whether Howl operations can be executed.  For example, it would not 
be possible to drop a table unless the user had write permissions on the 
directory holding that table.  We need to determine how to extend this model to 
data not stored in HDFS (e.g. Hbase) and objects that do not exist in HDFS 
(e.g. views).  See HowlSecurity for more information.
  
+ '''Dynamic Partitioning'''<> Currently Howl can only store data into one 
partition at a time.  It needs to support
+ spraying to multiple partitions in one write.
+ 
  '''Non-partition Predicate Pushdown'''<> Since in the future storage 
formats (such as RCFile) should support predicate pushdown, Howl needs to be 
able to push predicates into the storage layer when appropriate.
  
  '''Notification'''<> Add ability for systems such as work flow to be 
notified when new data arrives in Howl.  This will be designed around a few 
systems receiving notification, not large numbers of users receiving 
notifications (i.e. we will not be building a general purpose publish/subscribe 
system).  One solution to this might be an RSS feed or similar simple service.


[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by nirajrai

2010-08-23 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by nirajrai.
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=149&rev2=150

--

  ||2241||UID is not found in the schema ||
  ||2242||TypeCastInserter invoked with an invalid operator||
  ||2243||Attempt to remove operator that is still connected to other 
operators||
+ |2244||hadoop does not return any error message||
  ||2998||Unexpected internal error.||
  ||2999||Unhandled internal error.||
  ||3000||IOException caught while compiling POMergeJoin||


[Pig Wiki] Update of "NativeMapReduce" by ThejasNair

2010-08-20 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "NativeMapReduce" page has been changed by ThejasNair.
http://wiki.apache.org/pig/NativeMapReduce?action=diff&rev1=11&rev2=12

--

  A = load 'WordcountInput.txt';
  B = MAPREDUCE wordcount.jar Store A into 'inputDir' Load 'outputDir' as 
(word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`;
  }}}
+ 
+ Note that the files specified as input and output locations in MAPREDUCE 
statement will NOT be deleted by pig automatically. User has to delete them 
manually.
  
  == Comparison with similar features ==
  === Pig Streaming ===


[Pig Wiki] Update of "NativeMapReduce" by Aniket Mokash i

2010-08-20 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "NativeMapReduce" page has been changed by Aniket Mokashi.
http://wiki.apache.org/pig/NativeMapReduce?action=diff&rev1=10&rev2=11

--

  == Introduction ==
  Pig needs to provide a way to natively run map reduce jobs written in java 
language.
  There are some advantages of this-
-  1. The advantages of the ''mapreduce'' keyword are that the user need not be 
worried about coordination between the jobs, pig will take care of it.
+  1. The advantages of the ''mapreduce'' statement are that the user need not 
be worried about coordination between the jobs, pig will take care of it.
   2. User can make use of existing java applications without being a java 
programmer.
  
  == Syntax ==
@@ -25, +25 @@

  
  params are extra parameters required for native mapreduce job.
  
- mymr.jar is any mapreduce jar file which can be run through '''"hadoop -jar 
mymr.jar params"''' command. Thus, the contract for ''inputLocation'' and 
''outputLocation'' is typically managed through ''params''. 
+ mymr.jar is any mapreduce jar file which can be run through '''"hadoop jar 
mymr.jar params"''' command. Thus, the contract for ''inputLocation'' and 
''outputLocation'' is typically managed through ''params''. 
  
  For Example, to run wordcount mapreduce program from Pig, we write
  {{{


[Pig Wiki] Update of "Howl" by AlanGates

2010-08-19 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Howl" page has been changed by AlanGates.
http://wiki.apache.org/pig/Howl?action=diff&rev1=2&rev2=3

--

  be changed.  And old data will not need to be converted.  If there is a 
monthly Pig Latin script that roles up daily raw events, Howl will handle the 
fact that some of the
  data is stored in text and some in RCFile and present a single stream to Pig 
for processing.
  
+ == Join Us ==
+ Currently Howl's code is hosted at github:  http://github.com/yahoo/howl
+ 
+ Howl issues are discussed on howl...@yahoogroups.com.  You can join it by 
sending mail to howldev-subscr...@yahoogroups.com
+ 


[Pig Wiki] Update of "HowlJournal" by AlanGates

2010-08-19 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "HowlJournal" page has been changed by AlanGates.
http://wiki.apache.org/pig/HowlJournal

--

New page:
= Howl Journal =

This document tracks the development of Howl.  It summarizes work that has been 
done in previous releases, what is currently being worked on, and proposals for
future work in Howl.

== Completed Work ==

|| Feature|| 
Available in  || Comments ||
|| Read/write of data from Map Reduce || Not 
yet released  ||  ||
|| Read/write of data from Pig|| Not 
yet released  ||  ||
|| Read from Hive || Not 
yet released  ||  ||
|| Support pushdown of columns to be projected into storage format|| Not 
yet released  ||  ||
|| Support for RCFile storage || Not 
yet released  ||  ||

== Work in Progress ==

|| Feature   || Description ||
|| Add a CLI || This will allow users to use Howl without installing 
all of Hive.  The syntax will match that of Hive's DDL. ||
|| Partition pruning || Currently, when asked to return information about a 
table Hive's metastore returns all partitions in the table.  This has a couple 
of issues.  One, for tables with large numbers of partitions it means the 
metadata operation of fetching information about the table is very expensive.  
Two, it makes more sense to have the partition pruning logic in one place 
(Howl) rather than in Hive, Pig, and MR. ||


== Proposed Work ==
'''Authentication'''<> Integrate Howl with security work done on Hadoop so 
that users can be properly authenticated.

'''Authorization'''<> The initial proposal is to use HDFS permissions to 
determine whether Howl operations can be executed.  For example, it would not 
be possible to drop a table unless the user had write permissions on the 
directory holding that table.  We need to determine how to extend this model to 
data not stored in HDFS (e.g. Hbase) and objects that do not exist in HDFS 
(e.g. views).  See HowlSecurity for more information.

'''Non-partition Predicate Pushdown'''<> Since in the future storage 
formats (such as RCFile) should support predicate pushdown, Howl needs to be 
able to push predicates into the storage layer when appropriate.

'''Notification'''<> Add ability for systems such as work flow to be 
notified when new data arrives in Howl.  This will be designed around a few 
systems receiving notification, not large numbers of users receiving 
notifications (i.e. we will not be building a general purpose publish/subscribe 
system).  One solution to this might be an RSS feed or similar simple service.

'''Schema Evolution'''<>  Currently schema evolution in Hive is limited to 
adding columns at the end of the non-partition keys columns.  It may be 
desirable to support other forms of schema evolution, such as adding columns in 
other parts of the record, or making it so that new partitions for a table no 
longer contain a given column.

'''Support data read across partitions with different storage formats'''<> 
This work is done except that only one storage format is currently supported.

'''Support for more file formats'''<> Additional file formats such as 
sequence file, text, etc. need to be added.

'''Utility APIs'''<> Grid managers will want to build tools that use Howl 
to help manage their grids.  For example, one might build a tool to do 
replication between two grids.  Such tools will want to use Howl's metadata.  
Howl needs to provide an appropriate API for these types of tools.


[Pig Wiki] Update of "NativeMapReduce" by Aniket Mokash i

2010-08-18 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "NativeMapReduce" page has been changed by Aniket Mokashi.
http://wiki.apache.org/pig/NativeMapReduce?action=diff&rev1=9&rev2=10

--

- = Under Construction =
  #format wiki
  #language en
  
@@ -53, +52 @@

  }}}
  
  === Pig Plans ===
- Logical Plan- Logical Plan creates a LONative operator with an internal plan 
that consists of a store and a load operator. The store operator cannot be 
attached to X at this level as it would start storing X at inputLocation for 
every plan that includes X, which is not intended. Although we can LOLoad 
operator for Y at this point, we delay this to physical plan and track this 
with LONative operator. Since Y has dataflow dependency on X, we make a 
connection between operators corresponding to these aliased at logical plan.
+ Logical Plan- Logical Plan creates a LONative operator with an internal plan 
that consists of a store and a load operator. The store operator cannot be 
attached to X at this level as it would start storing X at inputLocation for 
every plan that includes X, which is not intended. Although we can LOLoad 
operator for Y at this point, we delay this until the mapreduce plan and track 
this with LONative operator. Since Y has dataflow dependency on X, we make a 
connection between operators corresponding to these aliased at logical plan.
  
  {{{
  X = ... ;
@@ -68, +67 @@

  |
  ...
  }}}  
- TypeCastInserter-
+ 
+ TypeCastInserter- This is a mandatory optimizer that adds a foreach and a 
cast operator after a load so that if a field is loaded from a load we can 
convert it to required type. In absence of this, we fail with a cast exception 
after load is completed. Currently, we apply this optimizer on LOLoad and 
LOStream as they can be loaded "AS schema". As, mapreduce clause corresponds to 
a load operation, this optimization is also applicable to LONative operator.
+ A test case for this scenario is-
+ {{{
+ B = mapreduce 'mapreduce.jar' Store A into 'input' Load 'output' as 
(name:chararray, count:int) `wordcount input output`;
+ C = foreach B generate count+1;
+ }}}
  
  Physical Plan- Logical plan is visited to convert internal plan of load store 
combination into corresponding physical plan operators and connections are 
maintained as per the logical plan.
  {{{
@@ -85, +90 @@

  ...
  }}} 
  
- MapReduce Plan- While compiling the mapreduce plan, with MRCompiler, we 
introduce 
+ MapReduce Plan- While compiling the mapreduce plan, with MRCompiler, we 
introduce a new MapReduceOper, NativeMapReduceOper that tracks the presence of 
native mapreduce job inside the plan. It also holds required parameters and 
jarname.
  {{{
  X = ... ;
  |
  |
- ||--- (POStore) Store X into 
'inputLocation'
+ |--- (POStore) Store X into 'inputLocation'
+ 
+ --- MR boundary -
- Y = MapReduce ... ;  |
+ Y = MapReduce ... ;
-   (PONative)   --  innnerPlan ---|
+  (NativeMapReduceOper)
- mymr.jar |
+ mymr.jar  
- params   |--- (POLoad) Load 'outputLocation'
+ params
+ --- MR boundary -
+ Y = (POLoad) Load 'outputLocation'
  |
  |
  ...
  }}}
- Inside the JobControlCompiler's compile method if we find the native 
mapreduce operator we run the org.apache.hadoop.util.RunJar's Main method with 
the specified parameters.
+ Inside the JobControlCompiler's compile method if we find the native 
mapreduce operator we run the org.apache.hadoop.util.RunJar's Main method with 
the specified parameters. We also make sure all the dependencies of job are 
obeyed for the native jobs.
  
  === Security Manager ===
- hadoop jar command is equivalent to invoking org.apache.hadoop.util.RunJar's 
main function with required arguments. RunJar internally can invoke several 
levels of driver classes before executing the hadoop job (for example- 
hadoop-example.jar). With the 
+ hadoop jar command is equivalent to invoking org.apache.hadoop.util.RunJar's 
main function with required arguments. RunJar internally can invoke several 
levels of driver classes before executing the hadoop job (for example- 
hadoop-example.jar). To detect failure or success of the job we need to detect 
the innermost error value and return it to Pig. To achieve this we install our 
own RunJarSecurityManager that delegates the security management to current 
security manager and captures the innermost exit code.
  
  === Pig Stats ===
+ Pig Stats are populated by assuming Native job as a single instance of 
mapreduce job and progress is also reported with the same assumption. As the 
native job is not under control of pig, except for the exit code, it is 

[Pig Wiki] Update of "NativeMapReduce" by Aniket Mokash i

2010-08-17 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "NativeMapReduce" page has been changed by Aniket Mokashi.
http://wiki.apache.org/pig/NativeMapReduce?action=diff&rev1=8&rev2=9

--

+ = Under Construction =
  #format wiki
  #language en
  
@@ -18, +19 @@

  To support native mapreduce job pig will support following syntax-
  {{{
  X = ... ;
- Y = MAPREDUCE ('mymr.jar' [, 'other.jar' ...]) STORE X INTO 'storeLocation' 
USING storeFunc LOAD 'loadLocation' USING loadFunc AS schema [params, ... ];
+ Y = MAPREDUCE 'mymr.jar' [('other.jar', ...)] STORE X INTO 'inputLocation' 
USING storeFunc LOAD 'outputLocation' USING loadFunc AS schema [`params, ... `];
  }}}
  
- This stores '''X''' into the '''storeLocation''' using '''storeFunc''', which 
is then used by native mapreduce to read its data. After we run mymr.jar's 
mapreduce, we load back the data from '''loadLocation''' into alias '''Y''' 
using '''loadFunc'''.
+ This stores '''X''' into the '''inputLocation''' using '''storeFunc''', which 
is then used by native mapreduce to read its data. After we run mymr.jar's 
mapreduce, we load back the data from '''outputLocation''' into alias '''Y''' 
using '''loadFunc''' as '''schema'''.
  
  params are extra parameters required for native mapreduce job.
  
- '''mymr.jar is any mapreduce jar file which can be run through "hadoop -jar 
mymr.jar params" command.'''
+ mymr.jar is any mapreduce jar file which can be run through '''"hadoop -jar 
mymr.jar params"''' command. Thus, the contract for ''inputLocation'' and 
''outputLocation'' is typically managed through ''params''. 
  
  For Example, to run wordcount mapreduce program from Pig, we write
  {{{
  A = load 'WordcountInput.txt';
- B = MAPREDUCE wordcount.jar Store A into 'inputDir' Load 'outputDir' as 
(word:chararray, count: int) org.myorg.WordCount inputDir outputDir;
+ B = MAPREDUCE wordcount.jar Store A into 'inputDir' Load 'outputDir' as 
(word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`;
  }}}
  
  == Comparison with similar features ==
@@ -45, +46 @@

  With native job support, pig can support native map reduce jobs written in 
java language that can convert a data set into a different data set after 
applying a custom map reduce functions of any complexity.
  
  == Implementation Details ==
+ 
  {{{
  X = ... ;
- Y = MAPREDUCE ('mymr.jar' [, 'other.jar' ...]) STORE X INTO 'storeLocation' 
USING storeFunc LOAD 'loadLocation' USING loadFunc [params, ... ];
+ Y = MAPREDUCE 'mymr.jar' [('other.jar', ...)] STORE X INTO 'inputLocation' 
USING storeFunc LOAD 'outputLocation' USING loadFunc AS schema [`params, ... `];
  }}}
- Logical Plan- Logical Plan creates a LONative operator with an internal plan 
that consists of a store and a load operator. The store operator cannot be 
attached to X at this level as it would start storing X at storeLocation for 
every plan that includes X which is not intended. Although we can LOLoad 
operator for Y at this point, we delay this to physical plan and track this 
with LONative operator. Also, since Y has dependency on X, we add plan of Y 
whenever we see plan for X in ''registerQuery''.
  
- Physical Plan- Physical Plan adds the internal store to the physical plan and 
connects it to X and also adds the load to the plan with alias Y. Also, it 
creates a dependency between map reduce job for X and native map reduce job, 
and also between native map reduce job and plan having Y (which is a POLoad 
operator). We also create a MapReduceOper (customized) for the native map 
reduce job.
+ === Pig Plans ===
+ Logical Plan- Logical Plan creates a LONative operator with an internal plan 
that consists of a store and a load operator. The store operator cannot be 
attached to X at this level as it would start storing X at inputLocation for 
every plan that includes X, which is not intended. Although we can LOLoad 
operator for Y at this point, we delay this to physical plan and track this 
with LONative operator. Since Y has dataflow dependency on X, we make a 
connection between operators corresponding to these aliased at logical plan.
  
- MapReduce Plan- Inside the JobControlCompiler's compile method if we find the 
native mapreduce operator we can create a thread and run the Main method of 
native map reduce job with the specified parameters. Alternatively, we can call 
into native map reduce job's getJobConf method to get the job conf for the 
native job, then we can add pig specific parameters to this job and then add 
the job inside pig's jobcontrol.
+ {{{
+ X = ... ;
+ |
+ |
+ ||--- (LOStore) Store X into 
'inputLocation'
+ Y = MapReduce ... ;  |
+   (LONative)   --  innnerPlan ---|
+ mymr.jar |
+ params   |--- (LOLoad) Load 'outputLocation'
+ |
+ |
+

[Pig Wiki] Update of "UDFsUsingScriptingLanguages" by A niket Mokashi

2010-08-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "UDFsUsingScriptingLanguages" page has been changed by Aniket Mokashi.
http://wiki.apache.org/pig/UDFsUsingScriptingLanguages?action=diff&rev1=3&rev2=4

--

   '''schemaFunction''' defines delegate function and is not registered to pig.
   
  When no decorator is specified, pig assumes the output datatype as bytearray 
and converts the output generated by script function to bytearray. This is 
consistent with pig's behavior in case of Java UDFs.
+ 
  ''Sample Schema String'' - y:{t:(word:chararray,num:long)}, variable names 
inside schema string are not used anywhere, they are used just to make syntax 
identifiable to the parser.
  
  == Inline Scripts ==
@@ -92, +93 @@

  def percent(num, total):
return num * 100 / total
  
- #CommaFormat-
+ 
+ # String Functions #
+ 
+ #commaFormat- format a number with commas, 12345-> 12,345
  @outputSchema("t:(numformat:chararray)")
  def commaFormat(num):
return '{:,}'.format(num)
  
- 
- # String Functions #
- 
- 
+ #concatMultiple- concat multiple words
+ @outputSchema("t:(numformat:chararray)")
+ def concatMult4(word1, word2, word3, word4):
+   return word1+word2+word3+word4
  
  ###
  # Data Type Functions #
  ###
+ #collectBag- collect elements of a bag into other bag
+ #This is useful UDF after group operation
+ @outputSchema("bag:{(y:{t:(word:chararray)}}")
+ def collectBag(bag):
+   outBag = []
+   for word in bag:
+ tup=(len(bag), word[1])
+ outBag.append(tup)
+   return outBag
  
+ # Few comments- 
+ # pig mandates that a bag should be a bag of tuples, python UDFs should 
follow this pattern.
+ # tuple in python are immutable, appending to a tuple is not possible.
  
  }}}
- 
  == Performance ==
  === Jython ===
  


[Pig Wiki] Update of "UDFsUsingScriptingLanguages" by A niket Mokashi

2010-08-13 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "UDFsUsingScriptingLanguages" page has been changed by Aniket Mokashi.
http://wiki.apache.org/pig/UDFsUsingScriptingLanguages?action=diff&rev1=2&rev2=3

--

  {{{
  Register 'test.py' using jython as myfuncs;
  }}}
- This uses org.apache.pig.scripting.jython.JythonScriptEngine to interpret the 
python script. Users can use custom script engines to support multiple 
languages and ways to interpret them. Currently, pig identifies jython as a 
keyword and ships the required scriptengine (jython) to interpret it.
+ This uses org.apache.pig.scripting.jython.JythonScriptEngine to interpret the 
python script. Users can develop and use custom script engines to support 
multiple programming languages and ways to interpret them. Currently, pig 
identifies jython as a keyword and ships the required scriptengine (jython) to 
interpret it.
  
  Following syntax is also supported -
  {{{
@@ -52, +52 @@

  }}}
  Registering test.py with pig makes under myfuncs namespace creates functions 
- myfuncs.helloworld(), myfuncs.complex(2), myfuncs.square(2.0) available as 
UDFs. These UDFs can be used with
  {{{
- b = foreach a generate myfuncs.helloworld, myfuncs.square(3);
+ b = foreach a generate myfuncs.helloworld(), myfuncs.square(3);
  }}}
  
  === Decorators and Schemas ===
- For annotating python script so that pig can identify their return types, we 
use decorators to define output schema for a script UDF. 
+ For annotating python script so that pig can identify their return types, we 
use python decorators to define output schema for a script UDF.
   '''outputSchema''' defines schema for a script udf in a format that pig 
understands and is able to parse. 
   
   '''outputFunctionSchema''' defines a script delegate function that defines 
schema for this function depending upon the input type. This is needed for 
functions that can accept generic types and perform generic operations on these 
types. A simple example is ''square'' which can accept multiple types. 
SchemaFunction for this type is a simple identity function (same schema as 
input).
   
   '''schemaFunction''' defines delegate function and is not registered to pig.
- 
   
- When no decorator is specified, pig assumes the output datatype as bytearray 
and converts the output generated by script function to bytearray. This is 
consistent with pig's behavior in other cases. 
+ When no decorator is specified, pig assumes the output datatype as bytearray 
and converts the output generated by script function to bytearray. This is 
consistent with pig's behavior in case of Java UDFs.
- 
- ''Sample Schema String'' - y:{t:(word:chararray,num:long)}, variable names 
are not used anywhere they are just to make syntax consistent.
+ ''Sample Schema String'' - y:{t:(word:chararray,num:long)}, variable names 
inside schema string are not used anywhere, they are used just to make syntax 
identifiable to the parser.
  
  == Inline Scripts ==
+ As of today, Pig doesn't support UDFs using inline scripts. This feature is 
being tracked at [[#ref4|PIG-1471]].
+ 
+ == Sample Script UDFs ==
+ Simple tasks like string manipulation, mathematical computations, 
reorganizing data types can be easily done using python scripts without having 
to develop long and complex UDFs in Java. The overall overhead of using 
scripting language is much less and development cost is almost negligible. 
Following are a few examples of UDFs developed in python that can be used with 
Pig.
+ {{{
+ mySampleLib.py
+ -
+ #!/usr/bin/python
+ 
+ ##
+ # Math functions #
+ ##
+ #Square - Square of a number of any data type
+ @outputSchemaFunction("squareSchema")
+ def square(num):
+   return ((num)*(num))
+ @schemaFunction("squareSchema")
+ def squareSchema(input):
+   return input
+ 
+ #Percent- Percentage
+ @outputSchema("t:(percent:double)")
+ def percent(num, total):
+   return num * 100 / total
+ 
+ #CommaFormat-
+ @outputSchema("t:(numformat:chararray)")
+ def commaFormat(num):
+   return '{:,}'.format(num)
+ 
+ 
+ # String Functions #
+ 
+ 
+ 
+ ###
+ # Data Type Functions #
+ ###
+ 
+ 
+ }}}
  
  == Performance ==
  === Jython ===
@@ -78, +117 @@

   1. <> PIG-928, "UDFs in scripting languages", 
https://issues.apache.org/jira/browse/PIG-928
   2. <> Jython, "The jython project", http://www.jython.org/
   3. <> Jruby, "100% pure-java implementation of ruby 
programming language", http://jruby.org/
+  4. <> PIG-1471, "inline UDFs in scripting languages", 
https://issues.apache.org/jira/browse/PIG-1471
  


[Pig Wiki] Update of "PigJournal" by AlanGates

2010-08-11 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigJournal" page has been changed by AlanGates.
http://wiki.apache.org/pig/PigJournal?action=diff&rev1=8&rev2=9

--

  '''Estimated Development Effort:'''  medium
  
  === Agreed Work, Unknown Approach ===
+  Move Piggybank out of Contrib 
+ Currently Pig hosts Piggybank (our repository of user contributed UDFs) as 
part of our contrib.  This is not ideal for a couple of reasons.  One, it means 
those who
+ wish to share their UDFs have to go through the rigor of the patch process.  
Two, since contrib is tied to releases of the main product, there is no way for 
users
+ to share functions for older versions or quickly disseminate their new 
functions.  If Piggybank were instead more similar to CPAN than users could 
upload their own
+ packages with little assistance from Pig committers and specify what versions 
of Pig the function is for.  This could be done via hosting site such as github.
+ 
+ '''Category:'''  Usability
+ 
+ '''Dependency:'''
+ 
+ '''References:'''
+ 
+ '''Estimated Development Effort:'''  small
+ 
+ 
   Clarify Pig Latin Semantics 
  There are areas of Pig Latin semantics that are not clear or not consistent.  
Take for example, a script like:
  


[Pig Wiki] Update of "PigJournal" by AlanGates

2010-08-11 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigJournal" page has been changed by AlanGates.
http://wiki.apache.org/pig/PigJournal?action=diff&rev1=7&rev2=8

--

  project is still open to input on whether and when such work should be done.
  
  == Completed Work ==
- The following table contains a list of features that have been completed, as 
of Pig 0.6
+ The following table contains a list of features that have been completed, as 
of Pig 0.7
  
  || Feature  || Available in 
Release || Comments ||
  || Describe Schema  || 0.1
  || ||
@@ -34, +34 @@

  || Outer join for default, fragment-replicate, skewed   || 0.6
  || ||
  || Make configuration available to UDFs || 0.6
  || ||
  || Load Store Redesign  || 0.7
  || ||
- || Add Owl as contrib project   || not yet released   
  || ||
  || Pig Mix 2.0  || not yet released   
  || ||
  
  == Work in Progress ==
  This covers work that is currently being done.  For each entry the main JIRA 
for the work is referenced.
  
- || Feature  || JIRA   
  || Comments ||
+ || Feature || JIRA
|| Comments ||
- || Boolean Type || 
[[https://issues.apache.org/jira/browse/PIG-1429|PIG-1429]] || ||
+ || Boolean Type|| 
[[https://issues.apache.org/jira/browse/PIG-1429|PIG-1429]] || ||
- || Query Optimizer  || 
[[http://issues.apache.org/jira/browse/PIG-1178|PIG-1178]]   || ||
+ || Query Optimizer || 
[[http://issues.apache.org/jira/browse/PIG-1178|PIG-1178]]  || ||
- || Cleanup of javadocs  || 
[[https://issues.apache.org/jira/browse/PIG-1311|PIG-1311]] || ||
+ || Cleanup of javadocs || 
[[https://issues.apache.org/jira/browse/PIG-1311|PIG-1311]] || ||
- || UDFs in scripting languages  || 
[[https://issues.apache.org/jira/browse/PIG-928|PIG-928]]   || ||
+ || UDFs in scripting languages || 
[[https://issues.apache.org/jira/browse/PIG-928|PIG-928]]   || ||
- || Ability to specify a custom partitioner  || 
[[https://issues.apache.org/jira/browse/PIG-282|PIG-282]]   || ||
+ || Ability to specify a custom partitioner || 
[[https://issues.apache.org/jira/browse/PIG-282|PIG-282]]   || ||
- || Pig usage stats collection   || 
[[https://issues.apache.org/jira/browse/PIG-1389|PIG-1389]], 
[[https://issues.apache.org/jira/browse/PIG-908|PIG-908]], 
[[https://issues.apache.org/jira/browse/PIG-864|PIG-864]], 
[[https://issues.apache.org/jira/browse/PIG-809|PIG-809]] || ||
+ || Pig usage stats collection  || 
[[https://issues.apache.org/jira/browse/PIG-1389|PIG-1389]], 
[[https://issues.apache.org/jira/browse/PIG-908|PIG-908]], 
[[https://issues.apache.org/jira/browse/PIG-864|PIG-864]], 
[[https://issues.apache.org/jira/browse/PIG-809|PIG-809]] || ||
- || Make Pig available via Maven || 
[[https://issues.apache.org/jira/browse/PIG-1334|PIG-1334]] || ||
+ || Make Pig available via Maven|| 
[[https://issues.apache.org/jira/browse/PIG-1334|PIG-1334]] || ||
- 
+ || Standard UDFs Pig Should Provide|| 
[[https://issues.apache.org/jira/browse/PIG-1405|PIG-1405]] || ||
+ || Add Scalars To Pig Latin|| 
[[https://issues.apache.org/jira/browse/PIG-1434|PIG-1434]] || ||
+ || Run Map Reduce Jobs Directly From Pig   || 
[[https://issues.apache.org/jira/browse/PIG-506|PIG-506]]   || ||
  
  == Proposed Future Work ==
  Work that the Pig project proposes to do in the future is further broken into 
three categories:
@@ -73, +74 @@

  Within each subsection order is alphabetical and does not imply priority.
  
  === Agreed Work, Agreed Approach ===
+  Make Illustrate Work 
+ Illustrate has become Pig's ignored step-child.  Users find it very useful, 
but developers have not kept it up to date with new features (e.g. it does not 
work with merge join).  Also, the way it is currently
+ implemented it has code in many of Pig's physical operators.  This means the 
code is more complex and burdened with branches, making it harder to maintain.  
It also means that when doing new development it is
+ easy to forget about illustrate.  Illustrate needs to be redesigned in such a 
way that it does not add complexity to physical operators and that as new 
operators are developed it is necessary and easy to add
+ illustrate functionality to them.  Tests for illustrate also need to be added 
to th

[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by daijy

2010-08-04 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by daijy.
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=148&rev2=149

--

  ||2216||Cannot get field schema||
  ||2217||Problem setFieldSchema||
  ||2218||Invalid resource schema: bag schema must have tuple as its field||
+ ||2219||Attempt to disconnect operators which are not connected||
+ ||2220||Plan in inconssistent state, connected in fromEdges but not toEdges||
+ ||2221||No more walkers to pop||
+ ||||Expected LogicalExpressionVisitor to visit expression node||
+ ||2223||Expected LogicalPlanVisitor to visit relational node||
+ ||2224||Found LogicalExpressionPlan with more than one root||
+ ||2225||Projection with nothing to reference||
+ ||2226||Cannot fine reference for ProjectExpression||
+ ||2227||LogicalExpressionVisitor expects to visit expression plans||
+ ||2228||Could not find a related project Expression for Dereference||
+ ||2229||Couldn't find matching uid for project expression||
+ ||2230||Cannot get column from project||
+ ||2231||Unable to set index on newly create POLocalRearrange||
+ ||2232||Cannot get schema||
+ ||2233||Cannot get predecessor||
+ ||2234||Cannot get group key schema||
+ ||2235||Expected an ArrayList of Expression Plans||
+ ||2236||User defined load function should implement the LoadFunc interface||
+ ||2237||Unsupported operator in inner plan||
+ ||2238||Expected list of expression plans||
+ ||2239||Structure of schema change||
+ ||2240||LogicalPlanVisitor can only visit logical plan||
+ ||2241||UID is not found in the schema ||
+ ||2242||TypeCastInserter invoked with an invalid operator||
+ ||2243||Attempt to remove operator that is still connected to other 
operators||
  ||2998||Unexpected internal error.||
  ||2999||Unhandled internal error.||
  ||3000||IOException caught while compiling POMergeJoin||


[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by PradeepKamath

2010-08-02 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by 
PradeepKamath.
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=147&rev2=148

--

  ||1112||Unsupported query: You have an partition column () in a 
construction like: (pcond  and ...) or (pcond and ...) where pcond is a 
condition on a partition column.||
  ||1113||Unable to describe schema for nested expression ||
  ||1114||Unable to find schema for nested alias ||
+ ||1115||Place holder for Howl related errors||
  ||2000||Internal error. Mismatch in group by arities. Expected: . 
Found: ||
  ||2001||Unable to clone plan before compiling||
  ||2002||The output file(s):   already exists||


FrontPage reverted to revision 148 on Pig Wiki

2010-07-30 Thread Apache Wiki
Dear wiki user,

You have subscribed to a wiki page "Pig Wiki" for change notification.

The page FrontPage has been reverted to revision 148 by daijy.
The comment on this change is: remove spam.
http://wiki.apache.org/pig/FrontPage?action=diff&rev1=149&rev2=150

--

* PigDeveloperCookbook
   * Road map
* ProposedRoadMap (2007 document from Yahoo!)
-   * PigJournal (features currently being worked on, ideas for future 
[[http://www.essaybank.com|essay]] development)
+   * PigJournal (features currently being worked on, ideas for future 
development)
   * Specification Proposals
* PigTypesFunctionalSpec
* PigTypesDesign


[Pig Wiki] Update of "FrontPage" by SafiaYardley

2010-07-29 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "FrontPage" page has been changed by SafiaYardley.
http://wiki.apache.org/pig/FrontPage?action=diff&rev1=148&rev2=149

--

* PigDeveloperCookbook
   * Road map
* ProposedRoadMap (2007 document from Yahoo!)
-   * PigJournal (features currently being worked on, ideas for future 
development)
+   * PigJournal (features currently being worked on, ideas for future 
[[http://www.essaybank.com|essay]] development)
   * Specification Proposals
* PigTypesFunctionalSpec
* PigTypesDesign


[Pig Wiki] Update of "FAQ" by daijy

2010-07-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "FAQ" page has been changed by daijy.
http://wiki.apache.org/pig/FAQ?action=diff&rev1=6&rev2=7

--

  C = JOIN A by url, B by url PARALLEL 50. 
  }}}
  
- Even if you do not specify the parallel clause, the framework uses a default 
number of reducers, in the order of 0.9*(number of nodes allocated by user 
-1)*n where n is the number of maximum reduce slots, for running your M/R jobs 
which result from statements such as GROUP, COGROUP, JOIN, and ORDER BY. For 
example, when allocating 3 machines you get about 0.9*2*4 = 7 reducers for 
operating on your parallel jobs. 
+ Besides PARALLEL clause, you can also use "set default_parallel" statement in 
Pig script, or set "mapred.reduce.tasks" system property to specify default 
parallel to use. If none of these values are set, Pig will only use 1 reducers. 
(In Pig 0.8, we change the default reducer from 1 to a number calculated by a 
simple heuristic for foolproof purpose)
  
  '''Q: Can I do a numerical comparison while filtering?'''
  


[Pig Wiki] Update of "NativeMapReduce" by Aniket Mokash i

2010-07-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "NativeMapReduce" page has been changed by Aniket Mokashi.
http://wiki.apache.org/pig/NativeMapReduce?action=diff&rev1=6&rev2=7

--

  == Introduction ==
  Pig needs to provide a way to natively run map reduce jobs written in java 
language.
  There are some advantages of this-
-  1. The advantages of the ''native'' keyword are that the user need not be 
worried about coordination between the jobs, pig will take care of it.
+  1. The advantages of the ''mapreduce'' keyword are that the user need not be 
worried about coordination between the jobs, pig will take care of it.
   2. User can make use of existing java applications without being a java 
programmer.
  
  == Syntax ==
  To support native mapreduce job pig will support following syntax-
  {{{
  X = ... ;
- Y = NATIVE ('mymr.jar' [, 'other.jar' ...]) STORE X INTO 'storeLocation' 
USING storeFunc LOAD 'loadLocation' USING loadFunc [params, ... ];
+ Y = MAPREDUCE ('mymr.jar' [, 'other.jar' ...]) STORE X INTO 'storeLocation' 
USING storeFunc LOAD 'loadLocation' USING loadFunc [params, ... ];
  }}}
  
  This stores '''X''' into the '''storeLocation''' using '''storeFunc''', which 
is then used by native mapreduce to read its data. After we run mymr.jar's 
mapreduce, we load back the data from '''loadLocation''' into alias '''Y''' 
using '''loadFunc'''.
  
- params are extra parameters required for native mapreduce job (TBD).
+ params are extra parameters required for native mapreduce job.
  
- mymr.jar is complaint with pig specification (see below).
+ '''mymr.jar is any mapreduce jar file which can be run through "hadoop -jar 
mymr.jar params" command.'''
  
  == Comparison with similar features ==
  === Pig Streaming ===
@@ -38, +38 @@

  
  With native job support, pig can support native map reduce jobs written in 
java language that can convert a data set into a different data set after 
applying a custom map reduce functions of any complexity.
  
- == Native Mapreduce job specification ==
- Native Mapreduce job needs to conform to some specification defined by Pig. 
This is required because Pig specifies the input and output directory in the 
script for this job and is responsible for managing the coordination of the 
native job with the remaining pig mapreduce jobs. Pig also might need to 
provide some extra configuration like job name, input/output formats, 
parallelism to the native job. For communicating such parameters to the native 
job, it should be according to specification provided by Pig.
- 
- Following are some of the approaches of achieving this-
-  1. '''Ordered inputLoc/outputLoc parameters'''- This is simplistic approach 
wherein native programs follow up a convention so that their first and second 
parameters are treated as input and output respectively. Pig ''native'' command 
takes the parameters required by the native mapreduce job and passes it to 
native job as command line arguments. It is upto the native program to use 
these parameters for operations it performs.
- Thus, only following lines of code are mandatory inside the native program.
- {{{
- FileInputFormat.setInputPaths(conf, new Path(args[0]));  
- FileOutputFormat.setOutputPath(conf, new Path(args[1]));
- }}}
-  1.#2 '''getJobConf Function'''- Native jobs implement '''getJobConf''' 
method which returns ''org.apache.hadoop.mapred.JobConf'' object so that pig 
can construct a ''job'' and schedule that inside pigs ''jobcontrol'' job. This 
also provides a way to add more pig specific parameters to this job before it 
is submitted. Most of the current native hadoop program create JobConf's and 
run hadoop jobs with ''JobClient.runJob(conf)''. These applications need to 
change their code to a getJobConf function so that pig can hook into them to 
get the conf. This will also allow pig to set the input and output directory 
for the native job.
- For example-
- {{{
- public JobConf getJobConf() {
- JobConf conf = new JobConf(WordCount.class);
- conf.setJobName("wordcount");
- 
- conf.setOutputKeyClass(Text.class);
- conf.setOutputValueClass(IntWritable.class);
- 
- conf.setMapperClass(Map.class);
- conf.setCombinerClass(Reduce.class);
- conf.setReducerClass(Reduce.class);
- 
- conf.setInputFormat(TextInputFormat.class);
- conf.setOutputFormat(TextOutputFormat.class);
- 
- FileInputFormat.setInputPaths(conf, new Path(args[0]));
- FileOutputFormat.setOutputPath(conf, new Path(args[1]));
- }
- public static void main(String[] args) throws Exception { 
- JobClient.runJob(getJobConf());
- }
- }}}
  == Implementation Details ==
  {{{
  X = ... ;
- Y = NATIVE ('mymr.jar' [, 'other.jar' ...]) STORE X INTO 'storeLocation' 
USING storeFunc LOAD 'loadLocation' USING loadFunc [params, ... ];
+ Y = MAPREDUCE ('mymr.jar' [, 'other.jar' ...]) STORE X INTO 'storeLocat

[Pig Wiki] Update of "Conferences" by AlanGates

2010-07-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Conferences" page has been changed by AlanGates.
http://wiki.apache.org/pig/Conferences?action=diff&rev1=3&rev2=4

--

  || NoSQL Summer|| Summer 2010 || 
Multiple world wide   || http://nosqlsummer.org/ || 
   ||  ||
  || Bay Area Hadoop User Group  || Jul 21 2010 || 
Sunnyvale, CA USA || http://www.meetup.com/hadoop/calendar/13546804/ || 
   ||  ||
  || Apache Asia Roadshow|| Aug 14-15 2010  || 
Shanghai, China   || http://roadshowasia.52ac.com/openconf.php   || 
   ||  ||
+ || Seattle Hadoop Day  || Aug 14-15 2010  || 
Seattle, WA USA   || http://hadoopday2010.eventbrite.com/|| 
   ||  ||
  || Open SQL Camp   || Aug 21-22 2010  || St. 
Augustin, Germany || http://bit.ly/9X21wr|| 
   ||  ||
  || VLDB|| Sep 13-17 2010  || 
Singapore || http://www.vldb2010.org/|| 
   ||  ||
  || Surge   || Sep 30 - Oct 1 2010 || 
Baltimore, MD USA || http://omniti.com/surge/2010|| 
   ||  ||
  || XLDB|| Oct 6 - 7 2010  || 
Menlo Park, CA USA|| http://www.xldb.org/4   || 
Alan Gates (Yahoo) ||  ||
+ || Hadoop World NYC|| Oct 12 2010 || New 
York City, NY USA || http://bit.ly/9WlnJZ|| 
   ||  ||
  || First International Mapreduce Workshop 2010 || Nov 30 - Dec 3 2010 || 
Indianapolis, IN USA  || http://bit.ly/aXCflu|| 
   ||  ||
  


[Pig Wiki] Update of "NativeMapReduce" by Aniket Mokash i

2010-07-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "NativeMapReduce" page has been changed by Aniket Mokashi.
http://wiki.apache.org/pig/NativeMapReduce?action=diff&rev1=5&rev2=6

--

- = Page under construction =
- 
  #format wiki
  #language en
  
@@ -18, +16 @@

  
  == Syntax ==
  To support native mapreduce job pig will support following syntax-
- 
  {{{
  X = ... ;
  Y = NATIVE ('mymr.jar' [, 'other.jar' ...]) STORE X INTO 'storeLocation' 
USING storeFunc LOAD 'loadLocation' USING loadFunc [params, ... ];
@@ -35, +32 @@

  Purpose of [[#ref2|pig streaming]] is to send data through an external script 
or program to transform a dataset into a different dataset based on a custom 
script written in any programming/scripting language. Pig streaming uses 
support of hadoop streaming to achieve this. Pig can register custom programs 
in a script, inline in the stream clause or using a define clause. Pig also 
provides a level of data guarantees on the data processing, provides feature 
for job management, provides ability to use distributed cache for the scripts 
(configurable). Streaming application run locally on individual mapper and 
reducer nodes for transforming the data.
  
  === Hive Transforms ===
- With [[#ref3|hive transforms]], users can also plug in their own custom 
mappers and reducers in the data stream. Basically, it is also an application 
of custom streaming supported by hadoop. Thus, these mappers and reducers can 
be written in any scripting languages and can be registered to distributed 
cache to help performance. To support custom map reduce programs written in 
java ([[#ref4|bezo's blog]]), we can use our custom mappers and reducers as 
data streaming functions and use them to transform the data using 'java -cp 
mymr.jar'. This will not invoke a map reduce task but will attempt to transform 
the data during the map or the reduce task (locally).
+ With [[#ref3|hive transforms]], users can also plug in their own custom 
mappers and reducers in the data stream. Basically, it is also an application 
of custom streaming supported by hadoop. Thus, these mappers and reducers can 
be written in any scripting languages and can be registered to distributed 
cache to help performance. To support custom map reduce programs written in 
java ([[#ref4|bizo's blog]]), we can use our custom mappers and reducers as 
data streaming functions and use them to transform the data using 'java -cp 
mymr.jar'. This will not invoke a map reduce task but will attempt to transform 
the data during the map or the reduce task (locally).
  
  Thus, both these features can transform data submitted to a map reduce job 
(mapper) into a different data set and/or transform data produced by a 
mapreduce job (reducer) into a different data set. But we should notice that 
data tranformation takes on a single machine and does not take advantage of map 
reduce framework itself. Also, these blocks only allow custom transformations 
inside the data pipeline and does not break the pipeline.
  
@@ -45, +42 @@

  Native Mapreduce job needs to conform to some specification defined by Pig. 
This is required because Pig specifies the input and output directory in the 
script for this job and is responsible for managing the coordination of the 
native job with the remaining pig mapreduce jobs. Pig also might need to 
provide some extra configuration like job name, input/output formats, 
parallelism to the native job. For communicating such parameters to the native 
job, it should be according to specification provided by Pig.
  
  Following are some of the approaches of achieving this-
-  1. Ordered inputLoc/outputLoc parameters- This is simplistic approach 
wherein native programs follow up a convention so that their first and second 
parameters are treated as input and output respectively. Pig ''native'' command 
takes the parameters required by the native mapreduce job and passes it to 
native job as command line arguments. It is upto the native program to use 
these parameters for operations it performs.
+  1. '''Ordered inputLoc/outputLoc parameters'''- This is simplistic approach 
wherein native programs follow up a convention so that their first and second 
parameters are treated as input and output respectively. Pig ''native'' command 
takes the parameters required by the native mapreduce job and passes it to 
native job as command line arguments. It is upto the native program to use 
these parameters for operations it performs.
  Thus, only following lines of code are mandatory inside the native program.
  {{{
  FileInputFormat.setInputPaths(conf, new Path(args[0]));  
  FileOutputFormat.setOutputPath(conf, new Path(args[1]));
  }}}
-  2. getJobConf Function- Native jobs implement '''getJobConf''' method which 
returns org.apache.hadoop.mapred.JobConf object so that pig can schedule the 
job. This also provides a wa

[Pig Wiki] Update of "NativeMapReduce" by Aniket Mokash i

2010-07-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "NativeMapReduce" page has been changed by Aniket Mokashi.
http://wiki.apache.org/pig/NativeMapReduce?action=diff&rev1=4&rev2=5

--

  With native job support, pig can support native map reduce jobs written in 
java language that can convert a data set into a different data set after 
applying a custom map reduce functions of any complexity.
  
  == Native Mapreduce job specification ==
- Native Mapreduce job needs to conform to some specification defined by Pig. 
This is required as Pig specifies the input and output directory in the script 
for this job and is responsible for managing the coordination of the native job 
with the remaining pig mapreduce jobs. Pig also might need to provide some 
extra configuration like job name, input/output formats, parallelism to the 
native job. For communicating such parameters to the native job, it should 
provide some way of communication.
+ Native Mapreduce job needs to conform to some specification defined by Pig. 
This is required because Pig specifies the input and output directory in the 
script for this job and is responsible for managing the coordination of the 
native job with the remaining pig mapreduce jobs. Pig also might need to 
provide some extra configuration like job name, input/output formats, 
parallelism to the native job. For communicating such parameters to the native 
job, it should be according to specification provided by Pig.
  
  Following are some of the approaches of achieving this-
   1. Ordered inputLoc/outputLoc parameters- This is simplistic approach 
wherein native programs follow up a convention so that their first and second 
parameters are treated as input and output respectively. Pig ''native'' command 
takes the parameters required by the native mapreduce job and passes it to 
native job as command line arguments. It is upto the native program to use 
these parameters for operations it performs.
@@ -51, +51 @@

  FileInputFormat.setInputPaths(conf, new Path(args[0]));  
  FileOutputFormat.setOutputPath(conf, new Path(args[1]));
  }}}
+  2. getJobConf Function- Native jobs implement '''getJobConf''' method which 
returns org.apache.hadoop.mapred.JobConf object so that pig can schedule the 
job. This also provides a way to add more pig specific parame
- 
-  2. getJobConf Function-
  
  
  


[Pig Wiki] Update of "NativeMapReduce" by Aniket Mokash i

2010-07-13 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "NativeMapReduce" page has been changed by Aniket Mokashi.
http://wiki.apache.org/pig/NativeMapReduce?action=diff&rev1=3&rev2=4

--

  
  == Comparison with similar features ==
  === Pig Streaming ===
- Purpose of [[#ref2|pig streaming]] is to send data through an external script 
or program to transform a dataset into a different dataset based on a custom 
script written in any programming/scripting language. Pig streaming uses 
support of hadoop streaming to achieve this. Pig can register custom programs 
in a script, inline in the stream clause or using a define clause. Pig also 
provides a level of data guarantees on the data processing, provides feature 
for job management, provides ability to use distributed cache for the scripts 
(configurable). Streaming application run locally on individual mapper and 
reducer nodes.
+ Purpose of [[#ref2|pig streaming]] is to send data through an external script 
or program to transform a dataset into a different dataset based on a custom 
script written in any programming/scripting language. Pig streaming uses 
support of hadoop streaming to achieve this. Pig can register custom programs 
in a script, inline in the stream clause or using a define clause. Pig also 
provides a level of data guarantees on the data processing, provides feature 
for job management, provides ability to use distributed cache for the scripts 
(configurable). Streaming application run locally on individual mapper and 
reducer nodes for transforming the data.
  
  === Hive Transforms ===
  With [[#ref3|hive transforms]], users can also plug in their own custom 
mappers and reducers in the data stream. Basically, it is also an application 
of custom streaming supported by hadoop. Thus, these mappers and reducers can 
be written in any scripting languages and can be registered to distributed 
cache to help performance. To support custom map reduce programs written in 
java ([[#ref4|bezo's blog]]), we can use our custom mappers and reducers as 
data streaming functions and use them to transform the data using 'java -cp 
mymr.jar'. This will not invoke a map reduce task but will attempt to transform 
the data during the map or the reduce task (locally).
  
  Thus, both these features can transform data submitted to a map reduce job 
(mapper) into a different data set and/or transform data produced by a 
mapreduce job (reducer) into a different data set. But we should notice that 
data tranformation takes on a single machine and does not take advantage of map 
reduce framework itself. Also, these blocks only allow custom transformations 
inside the data pipeline and does not break the pipeline.
  
- With native job support, pig can support native map reduce jobs written in 
java language that can convert a data set into a different data set after 
applying a custom map reduce function of any complexity.
+ With native job support, pig can support native map reduce jobs written in 
java language that can convert a data set into a different data set after 
applying a custom map reduce functions of any complexity.
  
  == Native Mapreduce job specification ==
+ Native Mapreduce job needs to conform to some specification defined by Pig. 
This is required as Pig specifies the input and output directory in the script 
for this job and is responsible for managing the coordination of the native job 
with the remaining pig mapreduce jobs. Pig also might need to provide some 
extra configuration like job name, input/output formats, parallelism to the 
native job. For communicating such parameters to the native job, it should 
provide some way of communication.
- Native Mapreduce job needs to conform to some specification defined by Pig. 
Pig specifies the input and output directory in the script for this job and is 
responsible for managing the coordination of the native job with the remaining 
pig mapreduce jobs. To allow pig to communicate with native map reduce job
- 1. Ordered inputLoc/outputLoc parameters- 
  
+ Following are some of the approaches of achieving this-
+  1. Ordered inputLoc/outputLoc parameters- This is simplistic approach 
wherein native programs follow up a convention so that their first and second 
parameters are treated as input and output respectively. Pig ''native'' command 
takes the parameters required by the native mapreduce job and passes it to 
native job as command line arguments. It is upto the native program to use 
these parameters for operations it performs.
+ Thus, only following lines of code are mandatory inside the native program.
+ {{{
+ FileInputFormat.setInputPaths(conf, new Path(args[0]));  
+ FileOutputFormat.setOutputPath(conf, new Path(args[1]));
+ }}}
+ 
- 2. getJobConf Function-
+  2. getJobConf Function-
+ 
+ 
  
  == Implementation Details ==
- 
+ Logical Plan- 
  
  == References ==
   1. <> PI

[Pig Wiki] Update of "TuringCompletePig" by AlanGates

2010-07-13 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "TuringCompletePig" page has been changed by AlanGates.
http://wiki.apache.org/pig/TuringCompletePig?action=diff&rev1=5&rev2=6

--

  }
  }}}
  
+ == Approach 3 ==
+ At the Pig contributor workshop in June 2010 Dmitriy Ryaboy proposed that we 
go the DSL route in Java.  Thus the example given above becomes something like:
+ 
+ {{{
+ 
+ public class Main {
+ 
+ public static void main(String[] args) {
+ float error = 100.0;
+ String infile = "original.data";
+ PigBuilder pig = new PigBuilder();
+ while (error > 1.0) {
+ PigRelation A = pig.load(infile, "piggybank.MyLoader");
+ PigRelation B = A.group(pig.ALL);
+ // It's not entirely clear to me how nested foreach works in this 
scenario
+ PigRelation C = B.foreach(new MyFunc("A"));
+
+ PigIterator pi = pig.openIterator(C, "outfile");
+ Tuple t = pi.next();
+ error = t.get(1);
+ if (error >= 1.0) {
+ pig.fs.mv('outfile', 'infile');
+ }
+ }
+ }
+ }
+ }}}
+ 
+ This would be accomplished by creating a public interface for Pig operators 
(here called !PigBuilder, but I'm not proposing that as the actual name) that 
would
+ construct a logical plan and execute it when openIterator is called, much as 
!PigServer does today.  Another way to look at this is !PigServer could be 
changed to
+ expose Pig operators instead of just strings as it does today.  
+ 
+ The beauty of doing this in Java is it facilitates it being used in scripting 
languages as well.  Since Java packages can be directly imported into Jython, 
JRuby,
+ Groovy, and other languages this immediately provides a scripting interface 
in the language of the users choice.
+ 
+ This does violate requirement 10 above (that Pig Latin should appear the same 
in embedded and non-embedded form), but the cross language functionality may be 
worth
+ it.
+ 


[Pig Wiki] Update of "PigTalksPapers" by AlanGates

2010-07-13 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigTalksPapers" page has been changed by AlanGates.
http://wiki.apache.org/pig/PigTalksPapers?action=diff&rev1=10&rev2=11

--

   * Pig poster at USENIX 2008: 
[[http://www.cs.cmu.edu/~olston/usenix08-poster.ppt|ppt]]
   * An interview with one of Yahoo's most prominent Pig users, including his 
take on Pig Latin vs. SQL: 
[[http://developer.yahoo.net/blogs/theater/archives/2008/04/_size75.html|video]]
  
+ == Contributor Workshops ==
+  * June 2010 [[attachment:PigContributorWorkshop.pptx|slides]]
+ 


New attachment added to page PigTalksPapers on Pig Wiki

2010-07-13 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page "PigTalksPapers" for change notification. An 
attachment has been added to that page by AlanGates. Following detailed 
information is available:

Attachment name: PigContributorWorkshop.pptx
Attachment size: 165564
Attachment link: 
http://wiki.apache.org/pig/PigTalksPapers?action=AttachFile&do=get&target=PigContributorWorkshop.pptx
Page link: http://wiki.apache.org/pig/PigTalksPapers


[Pig Wiki] Update of "NativeMapReduce" by Aniket Mokash i

2010-07-12 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "NativeMapReduce" page has been changed by Aniket Mokashi.
http://wiki.apache.org/pig/NativeMapReduce?action=diff&rev1=2&rev2=3

--

  <>
  <>
  
- This document captures the specification for native map reduce jobs and 
proposal for executing native mapreduce jobs inside pig script. This is tracked 
at [[#ref1|Jira]].
+ This document captures the specification for native map reduce jobs and 
proposal for executing native mapreduce jobs inside pig script. This is tracked 
at [[#ref1|PIG-506]].
  
  == Introduction ==
- Pig needs to provide a way to natively run map reduce jobs written in java 
language. 
+ Pig needs to provide a way to natively run map reduce jobs written in java 
language.
  There are some advantages of this-
   1. The advantages of the ''native'' keyword are that the user need not be 
worried about coordination between the jobs, pig will take care of it.
   2. User can make use of existing java applications without being a java 
programmer.
@@ -24, +24 @@

  Y = NATIVE ('mymr.jar' [, 'other.jar' ...]) STORE X INTO 'storeLocation' 
USING storeFunc LOAD 'loadLocation' USING loadFunc [params, ... ];
  }}}
  
- This stores '''X''' into the '''storeLocation''' which is used by native 
mapreduce to read its data. After we run mymr.jar's mapreduce we load back the 
data from '''loadLocation''' into alias '''Y'''.
+ This stores '''X''' into the '''storeLocation''' using '''storeFunc''', which 
is then used by native mapreduce to read its data. After we run mymr.jar's 
mapreduce, we load back the data from '''loadLocation''' into alias '''Y''' 
using '''loadFunc'''.
+ 
+ params are extra parameters required for native mapreduce job (TBD).
+ 
+ mymr.jar is complaint with pig specification (see below).
  
  == Comparison with similar features ==
  === Pig Streaming ===
+ Purpose of [[#ref2|pig streaming]] is to send data through an external script 
or program to transform a dataset into a different dataset based on a custom 
script written in any programming/scripting language. Pig streaming uses 
support of hadoop streaming to achieve this. Pig can register custom programs 
in a script, inline in the stream clause or using a define clause. Pig also 
provides a level of data guarantees on the data processing, provides feature 
for job management, provides ability to use distributed cache for the scripts 
(configurable). Streaming application run locally on individual mapper and 
reducer nodes.
  
- === Hive Transform ===
+ === Hive Transforms ===
+ With [[#ref3|hive transforms]], users can also plug in their own custom 
mappers and reducers in the data stream. Basically, it is also an application 
of custom streaming supported by hadoop. Thus, these mappers and reducers can 
be written in any scripting languages and can be registered to distributed 
cache to help performance. To support custom map reduce programs written in 
java ([[#ref4|bezo's blog]]), we can use our custom mappers and reducers as 
data streaming functions and use them to transform the data using 'java -cp 
mymr.jar'. This will not invoke a map reduce task but will attempt to transform 
the data during the map or the reduce task (locally).
+ 
+ Thus, both these features can transform data submitted to a map reduce job 
(mapper) into a different data set and/or transform data produced by a 
mapreduce job (reducer) into a different data set. But we should notice that 
data tranformation takes on a single machine and does not take advantage of map 
reduce framework itself. Also, these blocks only allow custom transformations 
inside the data pipeline and does not break the pipeline.
+ 
+ With native job support, pig can support native map reduce jobs written in 
java language that can convert a data set into a different data set after 
applying a custom map reduce function of any complexity.
  
  == Native Mapreduce job specification ==
- Native Mapreduce job needs to conform to some specification defined by Pig. 
Pig specifies the input and output directory for this job and is responsible 
for 
+ Native Mapreduce job needs to conform to some specification defined by Pig. 
Pig specifies the input and output directory in the script for this job and is 
responsible for managing the coordination of the native job with the remaining 
pig mapreduce jobs. To allow pig to communicate with native map reduce job
+ 1. Ordered inputLoc/outputLoc parameters- 
  
+ 2. getJobConf Function-
  
  == Implementation Details ==
  
@@ -42, +54 @@

   1. <> PIG-506, "Does pig need a NATIVE keyword?", 
https://issues.apache.org/jira/browse/PIG-506
   2. <> Pig Wiki, "Pig Streaming Functional Specification", 
http://wiki.apache.org/pig/PigStreamingFunctionalSpec
   3. <> Hive Wiki, "Transform/Map-Reduce Syntax", 
http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform
+  4. <> Bizos blog, "hi

[Pig Wiki] Update of "Conferences" by AlanGates

2010-07-12 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Conferences" page has been changed by AlanGates.
http://wiki.apache.org/pig/Conferences?action=diff&rev1=2&rev2=3

--

  scheduled to present at one, please note that here.  If you are aware of 
conferences, user groups, meetups, etc. that are of
  interest to the Pig community that are not listed here please add them to the 
list.
  
- || '''Title''' || '''Date'''  || 
'''Location'''|| '''More Information'''  || 
'''Attending''' || '''Presenting''' ||
+ || '''Title''' || '''Date'''  || 
'''Location'''|| '''More Information'''  || 
'''Attending'''|| '''Presenting''' ||
- || NoSQL Summer|| Summer 2010 || 
Multiple world wide   || http://nosqlsummer.org/ || 
||  ||
+ || NoSQL Summer|| Summer 2010 || 
Multiple world wide   || http://nosqlsummer.org/ || 
   ||  ||
- || Bay Area Hadoop User Group  || Jul 21 2010 || 
Sunnyvale, CA USA || http://www.meetup.com/hadoop/calendar/13546804/ || 
||  ||
+ || Bay Area Hadoop User Group  || Jul 21 2010 || 
Sunnyvale, CA USA || http://www.meetup.com/hadoop/calendar/13546804/ || 
   ||  ||
- || Apache Asia Roadshow|| Aug 14-15 2010  || 
Shanghai, China   || http://roadshowasia.52ac.com/openconf.php   || 
||  ||
+ || Apache Asia Roadshow|| Aug 14-15 2010  || 
Shanghai, China   || http://roadshowasia.52ac.com/openconf.php   || 
   ||  ||
- || Open SQL Camp   || Aug 21-22 2010  || St. 
Augustin, Germany || http://bit.ly/9X21wr|| 
||  ||
+ || Open SQL Camp   || Aug 21-22 2010  || St. 
Augustin, Germany || http://bit.ly/9X21wr|| 
   ||  ||
- || VLDB|| Sep 13-17 2010  || 
Singapore || http://www.vldb2010.org/|| 
||  ||
+ || VLDB|| Sep 13-17 2010  || 
Singapore || http://www.vldb2010.org/|| 
   ||  ||
- || Surge   || Sep 30 - Oct 1 2010 || 
Baltimore, MD USA || http://omniti.com/surge/2010|| 
||  ||
+ || Surge   || Sep 30 - Oct 1 2010 || 
Baltimore, MD USA || http://omniti.com/surge/2010|| 
   ||  ||
+ || XLDB|| Oct 6 - 7 2010  || 
Menlo Park, CA USA|| http://www.xldb.org/4   || 
Alan Gates (Yahoo) ||  ||
- || First International Mapreduce Workshop 2010 || Nov 30 - Dec 3 2010 || 
Indianapolis, IN USA  || http://bit.ly/aXCflu|| 
||  ||
+ || First International Mapreduce Workshop 2010 || Nov 30 - Dec 3 2010 || 
Indianapolis, IN USA  || http://bit.ly/aXCflu|| 
   ||  ||
  


[Pig Wiki] Update of "UDFsUsingScriptingLanguages" by A niket Mokashi

2010-07-08 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "UDFsUsingScriptingLanguages" page has been changed by Aniket Mokashi.
http://wiki.apache.org/pig/UDFsUsingScriptingLanguages?action=diff&rev1=1&rev2=2

--

  @schemaFunction("squareSchema")
  def squareSchema(input):
return input
+ 
+ # No decorator - bytearray
+ def concat(str):
+   return str+str
  }}}
  Registering test.py with pig makes under myfuncs namespace creates functions 
- myfuncs.helloworld(), myfuncs.complex(2), myfuncs.square(2.0) available as 
UDFs. These UDFs can be used with
  {{{


[Pig Wiki] Update of "UDFsUsingScriptingLanguages" by A niket Mokashi

2010-07-08 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "UDFsUsingScriptingLanguages" page has been changed by Aniket Mokashi.
http://wiki.apache.org/pig/UDFsUsingScriptingLanguages

--

New page:
#format wiki
#language en

<>
<>

This document captures the specification for using UDFs using scripting 
languages, this document will document the syntax, usage details and 
performance numbers for using this feature. This is tracked at 
[[#ref1|PIG-928]].

== UDFs Using Scripting Languages ==
Pig needs to support user defined functions written in different scripting 
languages such as Python, Ruby, Groovy. Pig can make use of different modules 
such as [[#ref2|jython]], [[#ref3|jruby]] which make these scripts available 
for java to use. Pig needs to support ways to register functions from script 
files written in different scripting languages as well as inline functions to 
define these functions in pig script.

== Syntax ==

=== Registering scripts ===
{{{
Register 'test.py' using jython as myfuncs;
}}}
This uses org.apache.pig.scripting.jython.JythonScriptEngine to interpret the 
python script. Users can use custom script engines to support multiple 
languages and ways to interpret them. Currently, pig identifies jython as a 
keyword and ships the required scriptengine (jython) to interpret it.

Following syntax is also supported -
{{{
Register 'test.py' using org.apache.pig.scripting.jython.JythonScriptEngine as 
myfuncs;
}}}

myfuncs is the namespace created for all the functions inside test.py.

A typical test.py looks as follows -
{{{
#!/usr/bin/python

@outputSchema("x:{t:(word:chararray)}")
def helloworld():  
  return ('Hello, World')

@outputSchema("y:{t:(word:chararray,num:long)}")
def complex(word):  
  return (str(word),long(word)*long(word))

@outputSchemaFunction("squareSchema")
def square(num):
  return ((num)*(num))

@schemaFunction("squareSchema")
def squareSchema(input):
  return input
}}}
Registering test.py with pig makes under myfuncs namespace creates functions - 
myfuncs.helloworld(), myfuncs.complex(2), myfuncs.square(2.0) available as 
UDFs. These UDFs can be used with
{{{
b = foreach a generate myfuncs.helloworld, myfuncs.square(3);
}}}

=== Decorators and Schemas ===
For annotating python script so that pig can identify their return types, we 
use decorators to define output schema for a script UDF. 
 '''outputSchema''' defines schema for a script udf in a format that pig 
understands and is able to parse. 
 
 '''outputFunctionSchema''' defines a script delegate function that defines 
schema for this function depending upon the input type. This is needed for 
functions that can accept generic types and perform generic operations on these 
types. A simple example is ''square'' which can accept multiple types. 
SchemaFunction for this type is a simple identity function (same schema as 
input).
 
 '''schemaFunction''' defines delegate function and is not registered to pig.

 
When no decorator is specified, pig assumes the output datatype as bytearray 
and converts the output generated by script function to bytearray. This is 
consistent with pig's behavior in other cases. 

''Sample Schema String'' - y:{t:(word:chararray,num:long)}, variable names are 
not used anywhere they are just to make syntax consistent.

== Inline Scripts ==

== Performance ==
=== Jython ===


== References ==
 1. <> PIG-928, "UDFs in scripting languages", 
https://issues.apache.org/jira/browse/PIG-928
 2. <> Jython, "The jython project", http://www.jython.org/
 3. <> Jruby, "100% pure-java implementation of ruby programming 
language", http://jruby.org/


[Pig Wiki] Update of "NativeMapReduce" by Aniket Mokash i

2010-07-06 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "NativeMapReduce" page has been changed by Aniket Mokashi.
The comment on this change is: Page under construction.
http://wiki.apache.org/pig/NativeMapReduce?action=diff&rev1=1&rev2=2

--

+ = Page under construction =
+ 
  #format wiki
  #language en
  
  <>
  <>
  
- This document captures the specification for native map reduce jobs and 
proposal for executing native mapreduce jobs inside pig script. This is tracked 
at *https://issues.apache.org/jira/browse/PIG-506.
+ This document captures the specification for native map reduce jobs and 
proposal for executing native mapreduce jobs inside pig script. This is tracked 
at [[#ref1|Jira]].
  
  == Introduction ==
  Pig needs to provide a way to natively run map reduce jobs written in java 
language. 
@@ -37, +39 @@

  
  
  == References ==
- 
   1. <> PIG-506, "Does pig need a NATIVE keyword?", 
https://issues.apache.org/jira/browse/PIG-506
   2. <> Pig Wiki, "Pig Streaming Functional Specification", 
http://wiki.apache.org/pig/PigStreamingFunctionalSpec
   3. <> Hive Wiki, "Transform/Map-Reduce Syntax", 
http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform


[Pig Wiki] Update of "NativeMapReduce" by Aniket Mokash i

2010-07-06 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "NativeMapReduce" page has been changed by Aniket Mokashi.
http://wiki.apache.org/pig/NativeMapReduce

--

New page:
#format wiki
#language en

<>
<>

This document captures the specification for native map reduce jobs and 
proposal for executing native mapreduce jobs inside pig script. This is tracked 
at *https://issues.apache.org/jira/browse/PIG-506.

== Introduction ==
Pig needs to provide a way to natively run map reduce jobs written in java 
language. 
There are some advantages of this-
 1. The advantages of the ''native'' keyword are that the user need not be 
worried about coordination between the jobs, pig will take care of it.
 2. User can make use of existing java applications without being a java 
programmer.

== Syntax ==
To support native mapreduce job pig will support following syntax-

{{{
X = ... ;
Y = NATIVE ('mymr.jar' [, 'other.jar' ...]) STORE X INTO 'storeLocation' USING 
storeFunc LOAD 'loadLocation' USING loadFunc [params, ... ];
}}}

This stores '''X''' into the '''storeLocation''' which is used by native 
mapreduce to read its data. After we run mymr.jar's mapreduce we load back the 
data from '''loadLocation''' into alias '''Y'''.

== Comparison with similar features ==
=== Pig Streaming ===

=== Hive Transform ===

== Native Mapreduce job specification ==
Native Mapreduce job needs to conform to some specification defined by Pig. Pig 
specifies the input and output directory for this job and is responsible for 


== Implementation Details ==


== References ==

 1. <> PIG-506, "Does pig need a NATIVE keyword?", 
https://issues.apache.org/jira/browse/PIG-506
 2. <> Pig Wiki, "Pig Streaming Functional Specification", 
http://wiki.apache.org/pig/PigStreamingFunctionalSpec
 3. <> Hive Wiki, "Transform/Map-Reduce Syntax", 
http://wiki.apache.org/hadoop/Hive/LanguageManual/Transform


[Pig Wiki] Update of "PoweredBy" by SeanTimm

2010-07-06 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PoweredBy" page has been changed by SeanTimm.
http://wiki.apache.org/pig/PoweredBy?action=diff&rev1=2&rev2=3

--

  Applications and organizations using Pig include (alphabetically):
+ 
+  * [[http://www.aol.com/|AOL]]
+   * AOL has multiple clusters from a few nodes to several hundred nodes.
+   * We use Hadoop for analytics and batch data processing for various 
applications.
+   * Hadoop is used by MapQuest, Ad, Search, Truveo, and Media groups.
+   * All of our jobs are written in Pig or native map reduce.
  
   * [[http://www.cooliris.com/|Cooliris]] - Cooliris transforms your browser 
into a lightning fast, cinematic way to browse photos and videos, both online 
and on your hard drive.
* We have a 15-node Hadoop cluster where each machine has 8 cores, 8 GB 
ram, and 3-4 TB of storage.


Page 0102 deleted from Pig Wiki

2010-07-05 Thread Apache Wiki
Dear wiki user,

You have subscribed to a wiki page "Pig Wiki" for change notification.

The page "0102" has been deleted by daijy.
The comment on this change is: delete spam.
http://wiki.apache.org/pig/0102


[Pig Wiki] Update of "0102" by 0102

2010-07-02 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "0102" page has been changed by 0102.
http://wiki.apache.org/pig/0102

--

New page:
Third, isolation of the local government departments and the administrative 
utility of the SAR and the constraints of economic development zones. On a 
regional reform, opening up and economic development constraints, in addition 
to the central and provincial government departments from the outside, but also 
from the management of the development of regional government departments. The 
special zone administrative system advantages: First, improve administrative 
efficiency. In the administrative examination and approval of a unified 
administration, prevented a project Touzi, a Qi Ye registration, etc., needed 
to many a department for approval and the time very long phenomenon. Second, to 
prevent government departments and utilities on the enterprise administrative 
fees and fines. Even some zones, the protection of businesses in the region, 
does not allow government departments and the administrative utilities to the 
development zone to the charges and fines. This is why the SAR and the 
operation and development zone enterprises to invest in an important reason for 
lower cost. 
Fourth, the structure and experience, including economic development, 
industrial growth and development zones for non-SAR, as well as the formation 
of the national pilot, demonstration, diffusion, lead, and other associated 
effects. From the SAR, to the Free Trade Zone, to economic and technological 
development zones, from the national economic and technological development 
zones, to the provincial and municipal economic and technological development 
zones, the government in a special area and the park systems and policies, and 
gradually from point to surface, from the coast to the interior, from the 
central zone to test and promote local level development zones. This pattern of 
reform and opening up has greatly liberated the productive forces, increasing 
the spread of industry and association, due to division of labor, industrial 
extension, production and supporting, etc., plus the logistics distribution, 
development led the Pearl River Delta, Yangtze River Delta, Bohai Bay economic 
development, but to the Midwest industrial and transport development. Fan Gang, 
On the role of the SAR model in the system when that began to reform a big 
issue is the lack of information, lack of knowledge and, as the reform and 
opening of the SAR, the responsibility and act as a rapid absorbing 
introduction of various relations, systems and information an important 
mechanism. To clarify relations between various systems, to promote the smooth 
implementation of reforms, which require a region in all aspects of the reform 
to get this information. For the pilot reform of the country was full of 
knowledge, information, experiences and lessons learned, and then used to guide 
the country's reforms, the country to show the way to do model. This is the 
significance of the special economic zones and important role in the host [9]. 
In conclusion, Comrade Deng Xiaoping, the region is to land in China to learn 
the advanced systems and mechanisms, new a new kind of modern enterprises and 
government institutions; is the use of foreign capital, technology and advanced 
management, the formation of a new industrial system, to boost the national 
economy, greatly emancipated productivity. Opening up of the SAR, bonded, large 
coastal open economic and technological development, and the subsequent opening 
up of inland areas and border owe a great deal! [http://www.mbt6shoes.com]  
  Wholesale mbt shoes


[Pig Wiki] Update of "TuringCompletePig" by AlanGates

2010-06-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "TuringCompletePig" page has been changed by AlanGates.
http://wiki.apache.org/pig/TuringCompletePig?action=diff&rev1=4&rev2=5

--

  }
  }}}
  
+ === Other Thoughts ===
+ Whichever way we do it, we need to consider what built in variables we need 
in the system.  For example, it would be really nice to have a
+ status variable so that you could do something like:
+ 
+ {{{
+ ...
+ store X into 'foo';
+ if ($status == 0) { -- or "success" or whatever
+ ...
+ } else {
+ ...
+ }
+ }}}
+ 


[Pig Wiki] Update of "AvoidingSedes" by ThejasNair

2010-06-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "AvoidingSedes" page has been changed by ThejasNair.
http://wiki.apache.org/pig/AvoidingSedes?action=diff&rev1=4&rev2=5

--

  == Delaying/Avoiding deserialization at runtime ==
  These approaches (except 5) does not involve major changes to core pig code. 
Load functions, or serialization between map and reduce can be separately 
changed to improve performance.
   1. '''!LoadFunctions make use of public interface 
!LoadPushDown.pushDownProjection.''' Don't deserialize columns not that are not 
in required . This should always improve performance. !PigStorage indirectly 
works this way, if a column is not used, the optimizer removes the casting(ie 
deserialization) of the column from the type-casting foreach statement which 
comes after the load.
-  1. '''!LoadFunction return a custom tuple, which deserializes fields only 
when tuple.get(i) is called.'''  This can be useful if the first operator 
after load is a filter operator - the whole filter expression might not have to 
be evaluated and that deserialization of all columns might not have to be done. 
Assuming the first approach is already implemented, then this approach is 
likely to have some overhead with queries where all tuple.get(i) is called on 
all columns/rows.
+  1. '''!LoadFunction returns a custom tuple, which deserializes fields only 
when tuple.get(i) is called.'''  This can be useful if the first operator after 
load is a filter operator - the whole filter expression might not have to be 
evaluated and that deserialization of all columns might not have to be done. 
Assuming the first approach is already implemented, then this approach is 
likely to have some overhead with queries where all tuple.get(i) is called on 
all columns/rows.
   1. '''!LoadFunction delays deserialization of map and bag types until a 
member function of java.util.Map or !DataBag is called. ''' The load function 
uses subclass of Map and DataBag which holds the serialized copy. This will 
help in delaying the deserialization further. This can't be done for scalar 
types because the classes pig uses for them are final; even if that were not 
the case we might not see much of performance gain because of the cost of 
creating an copy of the serialized data might be high compared to the cost of 
deserialization. This will only delay serialization up to the MR boundaries. 
  {{{
  Example of query where this will help -


[Pig Wiki] Update of "AvoidingSedes" by ThejasNair

2010-06-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "AvoidingSedes" page has been changed by ThejasNair.
http://wiki.apache.org/pig/AvoidingSedes?action=diff&rev1=3&rev2=4

--

  
  
  == Delaying/Avoiding deserialization at runtime ==
- These approaches does not involve any changes to core pig code. Load 
functions, or serialization between map and reduce can be separately changed to 
improve performance.
+ These approaches (except 5) does not involve major changes to core pig code. 
Load functions, or serialization between map and reduce can be separately 
changed to improve performance.
   1. '''!LoadFunctions make use of public interface 
!LoadPushDown.pushDownProjection.''' Don't deserialize columns not that are not 
in required . This should always improve performance. !PigStorage indirectly 
works this way, if a column is not used, the optimizer removes the casting(ie 
deserialization) of the column from the type-casting foreach statement which 
comes after the load.
   1. '''!LoadFunction return a custom tuple, which deserializes fields only 
when tuple.get(i) is called.'''  This can be useful if the first operator 
after load is a filter operator - the whole filter expression might not have to 
be evaluated and that deserialization of all columns might not have to be done. 
Assuming the first approach is already implemented, then this approach is 
likely to have some overhead with queries where all tuple.get(i) is called on 
all columns/rows.
   1. '''!LoadFunction delays deserialization of map and bag types until a 
member function of java.util.Map or !DataBag is called. ''' The load function 
uses subclass of Map and DataBag which holds the serialized copy. This will 
help in delaying the deserialization further. This can't be done for scalar 
types because the classes pig uses for them are final; even if that were not 
the case we might not see much of performance gain because of the cost of 
creating an copy of the serialized data might be high compared to the cost of 
deserialization. This will only delay serialization up to the MR boundaries. 


[Pig Wiki] Update of "AvoidingSedes" by ThejasNair

2010-06-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "AvoidingSedes" page has been changed by ThejasNair.
http://wiki.apache.org/pig/AvoidingSedes?action=diff&rev1=2&rev2=3

--

  = Avoiding Serialization/De-serialization in pig =
- Serialization/De-serialization is expensive and avoiding it will improve 
performance.
+ Serialization/De-serialization is expensive and avoiding it will improve 
performance. This wiki discusses ideas that can help with that.
  
  
  == Delaying/Avoiding deserialization at runtime ==


[Pig Wiki] Update of "AvoidingSedes" by ThejasNair

2010-06-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "AvoidingSedes" page has been changed by ThejasNair.
http://wiki.apache.org/pig/AvoidingSedes?action=diff&rev1=1&rev2=2

--

- = Avoiding Serialization/De-serialization in pig
+ = Avoiding Serialization/De-serialization in pig =
  Serialization/De-serialization is expensive and avoiding it will improve 
performance.
  
  
- = Delaying/Avoiding deserialization at runtime
+ == Delaying/Avoiding deserialization at runtime ==
  These approaches does not involve any changes to core pig code. Load 
functions, or serialization between map and reduce can be separately changed to 
improve performance.
   1. '''!LoadFunctions make use of public interface 
!LoadPushDown.pushDownProjection.''' Don't deserialize columns not that are not 
in required . This should always improve performance. !PigStorage indirectly 
works this way, if a column is not used, the optimizer removes the casting(ie 
deserialization) of the column from the type-casting foreach statement which 
comes after the load.
   1. '''!LoadFunction return a custom tuple, which deserializes fields only 
when tuple.get(i) is called.'''  This can be useful if the first operator 
after load is a filter operator - the whole filter expression might not have to 
be evaluated and that deserialization of all columns might not have to be done. 
Assuming the first approach is already implemented, then this approach is 
likely to have some overhead with queries where all tuple.get(i) is called on 
all columns/rows.


[Pig Wiki] Update of "AvoidingSedes" by ThejasNair

2010-06-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "AvoidingSedes" page has been changed by ThejasNair.
http://wiki.apache.org/pig/AvoidingSedes

--

New page:
= Avoiding Serialization/De-serialization in pig
Serialization/De-serialization is expensive and avoiding it will improve 
performance.


= Delaying/Avoiding deserialization at runtime
These approaches does not involve any changes to core pig code. Load functions, 
or serialization between map and reduce can be separately changed to improve 
performance.
 1. '''!LoadFunctions make use of public interface 
!LoadPushDown.pushDownProjection.''' Don't deserialize columns not that are not 
in required . This should always improve performance. !PigStorage indirectly 
works this way, if a column is not used, the optimizer removes the casting(ie 
deserialization) of the column from the type-casting foreach statement which 
comes after the load.
 1. '''!LoadFunction return a custom tuple, which deserializes fields only when 
tuple.get(i) is called.'''  This can be useful if the first operator after 
load is a filter operator - the whole filter expression might not have to be 
evaluated and that deserialization of all columns might not have to be done. 
Assuming the first approach is already implemented, then this approach is 
likely to have some overhead with queries where all tuple.get(i) is called on 
all columns/rows.
 1. '''!LoadFunction delays deserialization of map and bag types until a member 
function of java.util.Map or !DataBag is called. ''' The load function uses 
subclass of Map and DataBag which holds the serialized copy. This will help in 
delaying the deserialization further. This can't be done for scalar types 
because the classes pig uses for them are final; even if that were not the case 
we might not see much of performance gain because of the cost of creating an 
copy of the serialized data might be high compared to the cost of 
deserialization. This will only delay serialization up to the MR boundaries. 
{{{
Example of query where this will help -
l = LOAD 'file1' AS (a : int, b : map [ ]);
f = FOREACH l GENERATE udf1(a), b;   -- Approach 2 will not help in 
delaying deserialization beyond this point.
fil = FILTER f BY $0 > 5;
dump fil; -- Serialization of column b can be delayed until here using this 
approach .
}}}
 1.#4 '''Set the property "pig.data.tuple.factory.name" to use a tuple that 
understands serialization format used for bags and maps used in approach 3, so 
that serialized data can be passed from loader across MR boundaries in the 
serialization format of load function. ''' The write() and readFields() 
functions of tuple returned by TupleFactory is used to serialize data between 
Map and Reduce. To use a new custom tuple, you need to use a custom 
TupleFactory that returns tuples of this type. But this approach will work only 
for a set of load functions in the query that share same serialization format 
for map and bags.
 1. ''' Expose load function's sedes functionality in new interface and track 
lineage of columns''' This will the elegant and extensible way of doing what is 
proposed in approach 4. For each serialized column, if we know the 
deserialization function, we can delay deserialization across MR boundaries.


[Pig Wiki] Update of "PigErrorHandlingFunctionalSpecifica tion" by Aniket Mokashi

2010-06-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigErrorHandlingFunctionalSpecification" page has been changed by Aniket 
Mokashi.
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification?action=diff&rev1=146&rev2=147

--

  ||1110||Unsupported query: You have an partition column () inside a 
 in the 
filter condition.||
  ||||Use of partition column/condition with non partition column/condition 
in filter expression is not supported.||
  ||1112||Unsupported query: You have an partition column () in a 
construction like: (pcond  and ...) or (pcond and ...) where pcond is a 
condition on a partition column.||
+ ||1113||Unable to describe schema for nested expression ||
+ ||1114||Unable to find schema for nested alias ||
  ||2000||Internal error. Mismatch in group by arities. Expected: . 
Found: ||
  ||2001||Unable to clone plan before compiling||
  ||2002||The output file(s):   already exists||


[Pig Wiki] Update of "Conferences" by AlanGates

2010-06-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Conferences" page has been changed by AlanGates.
http://wiki.apache.org/pig/Conferences?action=diff&rev1=1&rev2=2

--

  interest to the Pig community that are not listed here please add them to the 
list.
  
  || '''Title''' || '''Date'''  || 
'''Location'''|| '''More Information'''  || 
'''Attending''' || '''Presenting''' ||
- || NoSQL Summer   || Summer 2010 || 
Multiple world wide   || http://nosqlsummer.org/ || 
||  ||
+ || NoSQL Summer|| Summer 2010 || 
Multiple world wide   || http://nosqlsummer.org/ || 
||  ||
- || Chicago Hadoop User Group   || Jun 22 2010 || 
Chicago, IL USA   || http://bit.ly/b6Ncl3|| 
||  ||
  || Bay Area Hadoop User Group  || Jul 21 2010 || 
Sunnyvale, CA USA || http://www.meetup.com/hadoop/calendar/13546804/ || 
||  ||
+ || Apache Asia Roadshow|| Aug 14-15 2010  || 
Shanghai, China   || http://roadshowasia.52ac.com/openconf.php   || 
||  ||
  || Open SQL Camp   || Aug 21-22 2010  || St. 
Augustin, Germany || http://bit.ly/9X21wr|| 
||  ||
  || VLDB|| Sep 13-17 2010  || 
Singapore || http://www.vldb2010.org/|| 
||  ||
+ || Surge   || Sep 30 - Oct 1 2010 || 
Baltimore, MD USA || http://omniti.com/surge/2010|| 
||  ||
  || First International Mapreduce Workshop 2010 || Nov 30 - Dec 3 2010 || 
Indianapolis, IN USA  || http://bit.ly/aXCflu|| 
||  ||
  


[Pig Wiki] Update of "TuringCompletePig" by AlanGates

2010-06-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "TuringCompletePig" page has been changed by AlanGates.
http://wiki.apache.org/pig/TuringCompletePig?action=diff&rev1=3&rev2=4

--

  Object outfile = new String("result.data");
  while (error != null && (Double)error > 1.0) {
  PigServer ps = new PigServer();
- ps.registerQuery("A = load infile;");
+ ps.registerQuery("A = load " + infile + ";");
  ps.registerQuery("B = group A all;");
  ps.registerQuery("C = foreach B generate 
flatten(doSomeCalculation(A)) as (result, error);");
  ps.registerQuery("error = foreach C generate error;");


[Pig Wiki] Update of "TuringCompletePig" by AlanGates

2010-06-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "TuringCompletePig" page has been changed by AlanGates.
http://wiki.apache.org/pig/TuringCompletePig?action=diff&rev1=2&rev2=3

--

  
  Thoughts?  Preferences for one of the options I did not like?  Comments 
welcome.
  
+ == Approach 2 ==
+ And now for something completely different.
+ 
+ After thinking on the above for a week or so it occurs to me that in 
dismissing making Pig Latin itself Turing complete I am conflating two tasks
+ that could be decoupled.  The first is defining a grammar for the language 
and extending the parser.  The second is building an execution engine to execute
+ Pig Latin scripts.  It is the second that I am concerned is too much work.  
Defining the grammar and building the parser is relatively easy (as
+ we say in the Pig team at Yahoo, "parsers are easy").
+ 
+ So what if we did extend Pig Latin itself to be Turing complete, but the 
first pass over the language was to compile it down to Java code that made
+ use of the existing !PigServer class to execute the code?  This meets all ten 
requirements given above (some extra work will need to be done to meet
+ requirement 8 on up front semantic checking, but it is possible).  It deals 
with my initial concern that supporting Turing completeness in Pig Latin
+ is too much work.  It also has the exceedingly nice feature that we do not 
have to pick any one scripting language.  The more I talked to people the
+ more I discovered some wanted Python, some Ruby, some Perl, some Groovy, etc. 
 This avoids that problem.  And the extensions to Pig Latin themselves
+ will be simple enough that it should not be onerous for people to learn it.  
It also means that at some future time if we decide that we want more
+ control over how the language is executed we can make changes without people 
needing to switch from whatever scripting language we embed it in.
+ 
+ A significant downside to this proposal is now users have to have a Java 
compiler along to run their Pig Latin scripts.
+ 
+ The other concerns I gave above about making Pig Latin Turing complete are 
somewhat addressed, but not totally.  It would be possible, though
+ painful, to use a Java debugger on the generated Java code.  Syntax 
highlighting and completion files could be created for Vim, Emacs, Eclipse, and
+ whatever other favorite editors people have.
+ 
+ === Specifics ===
+ The grammar of the language should be kept as simple as possible.  The goal 
is not to create a general purpose programming language.
+ Tasks requiring these features should still be written in UDFs in Java or a 
scripting language.
+ 
+ Each Pig Latin file would be considered as a module.  All functions would 
have global scope within that module and would be visible once the module is
+ imported.
+ 
+ The type system would be existing Pig Latin types (we may need to add a list 
type).  Types would be bound at run time (this is necessary to support
+ existing PL grammar where A = load ... is a declaration of A).
+ 
+ The grammar would look something like:
+ 
+ {{{
+ program:
+   import
+ | register
+ | define
+ | func_definition
+ | block
+ 
+ import:
+   IMPORT _modulename_ namespace_clause
+ 
+ namespace_clause:
+   (empty)
+ | AS _namespacename_
+ 
+ register:
+   ... // as now
+ 
+ define:
+   ... // as now
+ 
+ func_definition:
+   DEF _functionname_ ( arg_list ) { block }
+   // not sure about this, having DEF and DEFINE different keywords.
+   // May want to reuse DEFINE here or DEFINE FUNCTION
+ 
+ arg_list:
+   expr
+ | arg_list , expr
+ 
+ block:
+   statement
+ | block statement
+ 
+ statement:
+   ;
+ | assignment
+ | if
+ | while
+ | for
+ | return // only valid inside functions
+ | CONTINUE ; // only valid inside loops
+ | BREAK ; // only valid inside loops
+ | split
+ | store
+ | dump
+ | fs
+ 
+ assignment:
+   _var_ = expr ;
+ | _var_ = LOAD _inputsrc_ ;
+ ... // GROUP, FILTER, etc. as now
+ 
+ statement_or_block:
+   statement
+ | { block }
+ 
+ if:
+   IF ( expr ) statement_or_block else
+ 
+ else:
+   (empty)
+ | ELSE statement_or_block
+ 
+ while:
+   WHILE ( expr ) statement_or_block
+ 
+ for:
+   FOR ( assignment ; expr ; expr ) statement_or_block
+ 
+ return:
+   RETURN ;
+ | RETURN expr ;
+ 
+ // split, dump, store, fs as now
+ }}}
+ 
+ So the example given initially would look like:
+ {{{
+ error = 100.0;
+ infile = 'original.data';
+ outfile = 'result.data';
+ while (error > 1.0) {
+ A = load infile;
+ B = group A all;
+ C = foreach B generate flatten(doSomeCalculation(A)) as (result, 
error);
+ error = foreach C generate error;
+ store C into outfile;
+ if (error > 1.0) fs mv outfile 

[Pig Wiki] Update of "Conferences" by AlanGates

2010-06-18 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Conferences" page has been changed by AlanGates.
http://wiki.apache.org/pig/Conferences

--

New page:
= Conferences and User Groups =

This page lists upcoming conferences, user groups, meetups, etc. that the Pig 
team is aware of.  The goal is for Pig users around the world to have a way to
identify conferences and other meetings that might be of interest to them.  
Also, it can help Pig users find each other at these
meetings.

If you are going to any of these, and especially if you are
scheduled to present at one, please note that here.  If you are aware of 
conferences, user groups, meetups, etc. that are of
interest to the Pig community that are not listed here please add them to the 
list.

|| '''Title''' || '''Date'''  || 
'''Location'''|| '''More Information'''  || 
'''Attending''' || '''Presenting''' ||
|| NoSQL Summer   || Summer 2010 || 
Multiple world wide   || http://nosqlsummer.org/ || 
||  ||
|| Chicago Hadoop User Group   || Jun 22 2010 || 
Chicago, IL USA   || http://bit.ly/b6Ncl3|| 
||  ||
|| Bay Area Hadoop User Group  || Jul 21 2010 || 
Sunnyvale, CA USA || http://www.meetup.com/hadoop/calendar/13546804/ || 
||  ||
|| Open SQL Camp   || Aug 21-22 2010  || St. 
Augustin, Germany || http://bit.ly/9X21wr|| 
||  ||
|| VLDB|| Sep 13-17 2010  || 
Singapore || http://www.vldb2010.org/|| 
||  ||
|| First International Mapreduce Workshop 2010 || Nov 30 - Dec 3 2010 || 
Indianapolis, IN USA  || http://bit.ly/aXCflu|| 
||  ||


[Pig Wiki] Update of "PigMix" by daijy

2010-06-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigMix" page has been changed by daijy.
http://wiki.apache.org/pig/PigMix?action=diff&rev1=16&rev2=17

--

  || PigMix_16 || 82.33|| 69.33 || 1.19   ||
  || PigMix_17 || 286  || 229.33|| 1.25   ||
  || Total || 2121.67  || 1929.67   || 1.10   ||
- ||Weighted Avg ||  1.14544   ||
+ || Weighted Avg ||||   || 1.15   ||
  
  
  == Features Tested ==


[Pig Wiki] Update of "PigMix" by daijy

2010-06-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigMix" page has been changed by daijy.
http://wiki.apache.org/pig/PigMix?action=diff&rev1=15&rev2=16

--

  {{{
  A = load 'page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
  as (user, action, timespent, query_term, ip_addr, timestamp, 
estimated_revenue, page_info, page_links);
- B = order A by user parallel $mappers;
+ B = order A by user $parallelfactor;
  store B into 'page_views_sorted' using PigStorage('\u0001');
  
  alpha = load 'users' using PigStorage('\u0001') as (name, phone, address, 
city, state, zip);
- a1 = order alpha by name parallel $mappers;
+ a1 = order alpha by name $parallelfactor;
  store a1 into 'users_sorted' using PigStorage('\u0001');
  
  a = load 'power_users' using PigStorage('\u0001') as (name, phone, address, 
city, state, zip);
@@ -287, +287 @@

  This script tests reading from a map, flattening a bag of maps, and use of 
bincond (features 2, 3, and 4).
  {{{
  register pigperf.jar;
- A = load '$page_views' using 
org.apache.pig.test.utils.datagen.PigPerformanceLoader()
+ A = load '$page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
  as (user, action, timespent, query_term, ip_addr, timestamp,
  estimated_revenue, page_info, page_links);
  B = foreach A generate user, (int)action as action, (map[])page_info as 
page_info,
@@ -304, +304 @@

  This script tests using a join small enough to do in fragment and replicate 
(feature 7). 
  {{{
  register pigperf.jar;
- A = load '$page_views' using 
org.apache.pig.test.utils.datagen.PigPerformanceLoader()
+ A = load '$page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
  as (user, action, timespent, query_term, ip_addr, timestamp,
  estimated_revenue, page_info, page_links);
  B = foreach A generate user, estimated_revenue;
@@ -321, +321 @@

  something that pig could potentially optimize by not regrouping.
  {{{
  register pigperf.jar;
- A = load '$page_views' using 
org.apache.pig.test.utils.datagen.PigPerformanceLoader()
+ A = load '$page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
  as (user, action, timespent, query_term, ip_addr, timestamp,
  estimated_revenue, page_info, page_links);
  B = foreach A generate user, (double)estimated_revenue;
@@ -340, +340 @@

  This script covers foreach generate with a nested distinct (feature 10).
  {{{
  register pigperf.jar;
- A = load '$page_views' using 
org.apache.pig.test.utils.datagen.PigPerformanceLoader()
+ A = load '$page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
  as (user, action, timespent, query_term, ip_addr, timestamp,
  estimated_revenue, page_info, page_links);
  B = foreach A generate user, action;
@@ -359, +359 @@

  This script does an anti-join.  This is useful because it is a use of cogroup 
that is not a regular join (feature 9).
  {{{
  register pigperf.jar;
- A = load '$page_views' using 
org.apache.pig.test.utils.datagen.PigPerformanceLoader()
+ A = load '$page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
  as (user, action, timespent, query_term, ip_addr, timestamp,
  estimated_revenue, page_info, page_links);
  B = foreach A generate user;
@@ -377, +377 @@

  This script covers the case where the group by key is a significant 
percentage of the row (feature 12).
  {{{
  register pigperf.jar;
- A = load '$page_views' using 
org.apache.pig.test.utils.datagen.PigPerformanceLoader()
+ A = load '$page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
  as (user, action, timespent, query_term, ip_addr, timestamp,
  estimated_revenue, page_info, page_links);
  B = foreach A generate user, action, (int)timespent as timespent, query_term, 
ip_addr, timestamp;
@@ -392, +392 @@

  This script covers having a nested plan with splits (feature 11).
  {{{
  register pigperf.jar;
- A = load '$page_views' using 
org.apache.pig.test.utils.datagen.PigPerformanceLoader() as (user, action, 
timespent, query_term,
+ A = load '$page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, 
timespent, query_term,
  ip_addr, timestamp, estimated_revenue, page_info, page_links);
  B = foreach A generate user, timestamp;
  C = group B by user $parallelfactor;
@@ -409, +409 @@

  This script covers group all (feature 13).
  {{{
  register pigperf.jar;
- A = load '$page_views' using 
org.apache.pig.test.utils.datagen.PigPerformanceLoader()
+ A = load '$page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
  as (user, action, timespent, query_term, ip_addr, timestamp,
  estimated_revenue, page_info, page_links);
  B = foreach A generate user, (int)timespent as timespent, 
(double)estimated_revenue as e

[Pig Wiki] Update of "PigMix" by daijy

2010-06-16 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigMix" page has been changed by daijy.
http://wiki.apache.org/pig/PigMix?action=diff&rev1=14&rev2=15

--

  PigMix is a set of queries used test pig performance from release to release. 
 There are queries that test latency (how long does it
  take to run this query?), and queries that test scalability (how many fields 
or records can pig handle before it fails?).  In addition
  it includes a set of map reduce java programs to run equivalent map reduce 
jobs directly.  These will be used to test the performance
- gap between direct use of map reduce and using pig.
+ gap between direct use of map reduce and using pig. In Jun 2010, we release 
PigMix2, which include 5 more queries in addition to
+ the original 12 queries into PigMix to measure the performance of new Pig 
features. We will publish the result of both PigMix and PigMix2.
  
  == Runs ==
+ === PigMix ===
  
  The following table includes runs done of the pig mix.  All of these runs 
have been done on a cluster with 26 slaves plus one machine acting as the name 
node and job tracker.  The cluster was running 
  hadoop version 0.18.1.  (TODO:  Need to get specific hardware info on those 
machines).  
@@ -140, +142 @@

  || Total || 1407 || 1362.33   || 1.03   ||
  || Weighted Avg ||   ||   || 1.09   ||
  
+ === PigMix2 ===
+ Run date:  May 29, 2010, run against top of trunk as of that day.
+ || Test  || Pig run time || Java run time || Multiplier ||
+ || PigMix_1  || 122.33   || 117   || 1.05   ||
+ || PigMix_2  || 50.33|| 42.67 || 1.18   ||
+ || PigMix_3  || 189  || 100.33|| 1.88   ||
+ || PigMix_4  || 75.67|| 61|| 1.24   ||
+ || PigMix_5  || 64   || 138.67|| 0.46   ||
+ || PigMix_6  || 65.67|| 69.33 || 0.95   ||
+ || PigMix_7  || 88.33|| 84.33 || 1.05   ||
+ || PigMix_8  || 39   || 47.67 || 0.82   ||
+ || PigMix_9  || 274.33   || 215.33|| 1.27   ||
+ || PigMix_10 || 333.33   || 311.33|| 1.07   ||
+ || PigMix_11 || 151.33   || 157   || 0.96   ||
+ || PigMix_12 || 70.67|| 97.67 || 0.72   ||
+ || PigMix_13 || 80   || 33|| 2.42   ||
+ || PigMix_14 || 69   || 86.33 || 0.80   ||
+ || PigMix_15 || 80.33|| 69.33 || 1.16   ||
+ || PigMix_16 || 82.33|| 69.33 || 1.19   ||
+ || PigMix_17 || 286  || 229.33|| 1.25   ||
+ || Total || 2121.67  || 1929.67   || 1.10   ||
+ ||Weighted Avg ||  1.14544   ||
  
  
  == Features Tested ==
@@ -160, +184 @@

   1. union plus distinct
   1. order by
   1. multi-store query (that is, a query where data is scanned once, then 
split and grouped different ways).
+  1. outer join
+  1. merge join
+  1. multiple distinct aggregates
+  1. accumulative mode
  
  The data is generated so that it has a zipf type distribution for the group 
by and join keys, as this models most human generated
  data.
@@ -207, +235 @@

  between key value pairs and Ctrl-D between keys and values.  Bags in the file 
are delimited by Ctrl-B between tuples in the bag.
  A special loader, !PigPerformance loader has been written to read this 
format. 
  
+ PigMix2 include 4 more data set, which can be derived from the original 
dataset:
+ {{{
+ A = load 'page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
+ as (user, action, timespent, query_term, ip_addr, timestamp, 
estimated_revenue, page_info, page_links);
+ B = order A by user parallel $mappers;
+ store B into 'page_views_sorted' using PigStorage('\u0001');
+ 
+ alpha = load 'users' using PigStorage('\u0001') as (name, phone, address, 
city, state, zip);
+ a1 = order alpha by name parallel $mappers;
+ store a1 into 'users_sorted' using PigStorage('\u0001');
+ 
+ a = load 'power_users' using PigStorage('\u0001') as (name, phone, address, 
city, state, zip);
+ b = sample a 0.5;
+ store b into 'power_users_samples' using PigStorage('\u0001');
+ 
+ A = load 'page_views' as (user, action, timespent, query_term, ip_addr, 
timestamp,
+ estimated_revenue, page_info, page_links);
+ B = foreach A generate user, action, timespent, query_term, ip_addr, 
timestamp, estimated_revenue, page_info, page_links,
+ user as user1, action as action1, timespent as timespent1, query_term as 
query_term1, ip_addr as ip_addr1, timestamp as timestamp1, estimated_revenue as 
estimated_revenue1, page_info as page_info1, page_links as page_links1,
+ user as user2, action as action2, timespent as timespent2, query_term as 
query_term2, ip_addr as ip_addr2, timestamp as timestamp2, estimated_revenue as 
estimated_revenue2, page_info as page_

[Pig Wiki] Update of "TuringCompletePig" by AlanGates

2010-06-08 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "TuringCompletePig" page has been changed by AlanGates.
http://wiki.apache.org/pig/TuringCompletePig?action=diff&rev1=1&rev2=2

--

  = Making Pig Latin Turing Complete =
  == Introduction ==
  As more users adopt Pig and begin writing their data processing in Pig Latin 
and as they use Pig to process more and more complex
- tasks, a consistent request from these users is to add branches, loops, and 
functions to Pig Latin.  This will enable Pig Latin to
+ tasks, a consistent request from these users has been to add branches, loops, 
and functions to Pig Latin.  This will enable Pig Latin to
  process a whole new class of problems.  Consider, for example, an algorithm 
that needs to iterate until an error estimate is less
  than a given threshold.  This might look like (this just suggests logic, not 
syntax):
  
@@ -22, +22 @@

  
  == Requirements ==
  The following should be provided by this Turing complete Pig Latin:
-  1. Branching.  This will be satisfied by a standard `if` `else if` `else` 
functionality
+  1. Branching.  This will be satisfied by a standard `if / else if / else` 
functionality
   1. Looping.  This should include standard `while` and some form of `for`.  
for could be C style or Python style (foreach).  Care needs to be taken to 
select syntax that does not cause confusion with the existing `foreach` 
operator in Pig Latin.
   1. Functions.  
   1. Modules.
@@ -49, +49 @@

   * Which scripting language to choose?  Perl, Python, and Ruby all have 
significant adoption and could make a claim to be the best choice.
   * Syntactic and semantic checking is usually delayed until an embedded bit 
of code is reached in the outer control flow.  Given that Pig jobs can run for 
hours this can mean spending hours to discover a simple typo.
  
- Consider for example if built a python class that wrapped !PigServer and then 
translated the above code snippet.
+ Consider for example if Pig provided a Jython class that wrapped !PigServer 
and then we translated the above code snippet.
  
  {{{
  error = 100.0
@@ -68, +68 @@

  grunt.exec("fs mv 'outfile' 'infile'")
  }}}
  
- All of these references to `pig` and `grunt` as objects with command strings 
is undesirable.
+ All of these references to `pig` and `grunt` as objects with command strings 
are undesirable.
  So while I believe that embedding is a much better approach due to the lower 
work load and the plethora of tools available for other
  languages, I do not believe the above is an acceptable way to do it.  Thus I 
would like to place three additional requirements on
  embedded Pig Latin beyond those given above for Turing complete Pig Latin:
@@ -79, +79 @@

  This overcomes two of the three drawbacks noted above.  It does not provide 
for a way to do certain optimizations such as loop
  unrolling, but I think that is acceptable.
  
+ Having rejected the quote style of programming we could choose the Domain 
Specific Language (DSL) option, where we define Pig operators in the
+ target language.  Again using Python as an example:
+ 
+ {{{
+error = 100.0
+infile = 'original.data'
+pig = PigServer()
+grunt = Grunt()
+while error > 1.0:
+A = pig.load(infile, { 'loader' => 'piggybank.MyLoader'});
+B = A.group(pig.ALL);
+C = B.foreach { 
+   innerBag = doSomeCalculation(:A);
+   generate innerBag.flatten().as(:result,  :error)
+}
+
+PigIterator pi = pig.openIterator(C, 'outfile');
+output = grunt.fs.cat('outfile'");
+bla = output.partition("\t");
+error = bla(2)
+if error >= 1.0:
+grunt.fs.mv('outfile', 'infile');
+ }}}
+ 
+ This meets requirements 7 and 9 above.  It can partially but not fully meet 
8.  It can check that we use the right operators and pass
+ them the right types.  It cannot check the semantics of the operators, for 
example that `infile` exists and is readable.  This might be ok,
+ because it might turn out that things that cannot be checked at script 
compile time should not be checked up front anyway.  As an example, it should 
not 
+ check for `infile` up front because the script may not have created it yet.
+ 
+ This approach has the advantage that it will integrate very nicely with tools 
from the target language.  Debuggers, IDE, etc. will all now
+ view some form of Pig Latin as native to their language.
+ 
+ It does however have drawback, which is that what we would be creating a new 
dialect of Pig Latin.  There would be a Pig Latin dialect used when writing it
+ directly, and a different dialect for embedding.  This leads to confusion and 
duplication of effort.  So I would like to suggest another
+ requirement:
+ 
+   1.#10 Pig Latin should appear the same in the embedded form as in the 
non-embedded form.
+ 

[Pig Wiki] Update of "TuringCompletePig" by AlanGates

2010-06-07 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "TuringCompletePig" page has been changed by AlanGates.
http://wiki.apache.org/pig/TuringCompletePig

--

New page:
= Making Pig Latin Turing Complete =
== Introduction ==
As more users adopt Pig and begin writing their data processing in Pig Latin 
and as they use Pig to process more and more complex
tasks, a consistent request from these users is to add branches, loops, and 
functions to Pig Latin.  This will enable Pig Latin to
process a whole new class of problems.  Consider, for example, an algorithm 
that needs to iterate until an error estimate is less
than a given threshold.  This might look like (this just suggests logic, not 
syntax):

{{{
error = 100.0;
infile = 'original.data';
while (error > 1.0) {
A = load 'infile';
B = group A all;
C = foreach B generate flatten(doSomeCalculation(A)) as (result, error);
error = foreach C generate error;
store C into 'outfile';
if (error > 1.0) mv 'outfile' 'infile';
}
}}}

== Requirements ==
The following should be provided by this Turing complete Pig Latin:
 1. Branching.  This will be satisfied by a standard `if` `else if` `else` 
functionality
 1. Looping.  This should include standard `while` and some form of `for`.  for 
could be C style or Python style (foreach).  Care needs to be taken to select 
syntax that does not cause confusion with the existing `foreach` operator in 
Pig Latin.
 1. Functions.  
 1. Modules.
 1. The ability to use local in memory variables in the Pig Latin script.  For 
example, in the snippet given above the way `infile` is defined above the 
`while` and then used in the `load`.
 1. The ability to "store" results into local in memory variables.  For 
example, in the snippet given above the way the error calculation from the data 
processing is stored into `error` in the line `error = foreach C generate 
error;`.

== Approach ==
There are two possible approaches to this.  One is to add all of these features 
to Pig Latin itself.  This has several advantages:
 * All Pig Latin operations will be first class objects in the language.  There 
will not be a need to do quoted programming, like what happens when JDBC is 
used to write SQL inside a Java program.
 * There will be opportunities to do optimizations that are not available in 
embedded programming, such as loop unrolling, etc.

However, the cost of this approach is incredible.  It means turning Pig Latin 
into a full scripting language.  And it means
all kinds of tools like debuggers, etc. will never be available for Pig Latin 
users because the Pig team will not have the resources
or expertise to develop and maintain such tools.  And finally, does the world 
need another scripting language that starts with P?

The second possible approach to this is to embed Pig Latin into an existing 
scripting language, such as Perl, Python, Ruby, etc.  The
advantages of this are:
 * Most of the requirements noted above (branching, looping, functions, and 
modules) are present in these languages.
 * For any of these languages whole hosts of tools such as debuggers, IDEs, 
etc. exist and could be used.
 * Users do not have to learn a new language.

There are a few significant drawbacks to this approach:
 * It leads to a quoted programming style which is unnatural and irritating for 
developers.
 * Which scripting language to choose?  Perl, Python, and Ruby all have 
significant adoption and could make a claim to be the best choice.
 * Syntactic and semantic checking is usually delayed until an embedded bit of 
code is reached in the outer control flow.  Given that Pig jobs can run for 
hours this can mean spending hours to discover a simple typo.

Consider for example if built a python class that wrapped !PigServer and then 
translated the above code snippet.

{{{
error = 100.0
infile = 'original.data'
pig = PigServer()
grunt = Grunt()
while error > 1.0:
pig.registerQuery("A = load 'infile'; \
   B = group A all; \
   C = foreach B generate flatten(doSomeCalculation(A)) 
as (result, error); \
PigIterator pi = pig.openIterator("C", 'outfile');
output = grunt.exec("fs cat 'outfile'");
bla = output.partition("\t");
error = bla(2)
if error >= 1.0:
grunt.exec("fs mv 'outfile' 'infile'")
}}}

All of these references to `pig` and `grunt` as objects with command strings is 
undesirable.
So while I believe that embedding is a much better approach due to the lower 
work load and the plethora of tools available for other
languages, I do not believe the above is an acceptable way to do it.  Thus I 
would like to place three additional requirements on
embedded Pig Latin beyond those given above for Turing complete Pig Latin:
 1.#7 Pig Latin should appear as

[Pig Wiki] Update of "PigJournal" by AlanGates

2010-06-04 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigJournal" page has been changed by AlanGates.
http://wiki.apache.org/pig/PigJournal?action=diff&rev1=6&rev2=7

--

  
  '''Dependency:'''
  
- '''References:'''
+ '''References:'''  [[https://issues.apache.org/jira/browse/PIG-1434|PIG-1434]]
  
  '''Estimated Development Effort:'''  Small
  


[Pig Wiki] Update of "PigJournal" by AlanGates

2010-06-01 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigJournal" page has been changed by AlanGates.
http://wiki.apache.org/pig/PigJournal?action=diff&rev1=5&rev2=6

--

  || Multiquery support   || 0.3
  || ||
  || Add skewed join  || 0.4
  || ||
  || Add merge join   || 0.4
  || ||
+ || Add Zebra as contrib project || 0.4
  || ||
  || Support Hadoop 0.20  || 0.5
  || ||
  || Improved Sampling|| 0.6
  || There is still room for improvement for order by sampling ||
  || Change bags to spill after reaching fixed size   || 0.6
  || Also created bag backed by Hadoop iterator for single UDF cases ||
@@ -32, +33 @@

  || Switch local mode to Hadoop local mode   || 0.6
  || ||
  || Outer join for default, fragment-replicate, skewed   || 0.6
  || ||
  || Make configuration available to UDFs || 0.6
  || ||
+ || Load Store Redesign  || 0.7
  || ||
+ || Add Owl as contrib project   || not yet released   
  || ||
+ || Pig Mix 2.0  || not yet released   
  || ||
  
  == Work in Progress ==
  This covers work that is currently being done.  For each entry the main JIRA 
for the work is referenced.
  
- || Feature  || JIRA   
|| Comments ||
+ || Feature  || JIRA   
  || Comments ||
- || Metadata || 
[[http://issues.apache.org/jira/browse/PIG-823|PIG-823]]   || ||
+ || Boolean Type || 
[[https://issues.apache.org/jira/browse/PIG-1429|PIG-1429]] || ||
- || Query Optimizer  || 
[[http://issues.apache.org/jira/browse/PIG-1178|PIG-1178]] || ||
+ || Query Optimizer  || 
[[http://issues.apache.org/jira/browse/PIG-1178|PIG-1178]]   || ||
- || Load Store Redesign  || 
[[http://issues.apache.org/jira/browse/PIG-966|PIG-966]]   || ||
- || Add SQL Support  || 
[[http://issues.apache.org/jira/browse/PIG-824|PIG-824]]   || ||
- || Change Pig internal representation of charrarry to Text  || 
[[http://issues.apache.org/jira/browse/PIG-1017|PIG-1017]] || Patch ready, 
unclear when to commit to minimize disruption to users and destabilization to 
code base. ||
- || Integration with Zebra   || 
[[http://issues.apache.org/jira/browse/PIG-833|PIG-833]]   || ||
+ || Cleanup of javadocs  || 
[[https://issues.apache.org/jira/browse/PIG-1311|PIG-1311]] || ||
+ || UDFs in scripting languages  || 
[[https://issues.apache.org/jira/browse/PIG-928|PIG-928]]   || ||
+ || Ability to specify a custom partitioner  || 
[[https://issues.apache.org/jira/browse/PIG-282|PIG-282]]   || ||
+ || Pig usage stats collection   || 
[[https://issues.apache.org/jira/browse/PIG-1389|PIG-1389]], 
[[https://issues.apache.org/jira/browse/PIG-908|PIG-908]], 
[[https://issues.apache.org/jira/browse/PIG-864|PIG-864]], 
[[https://issues.apache.org/jira/browse/PIG-809|PIG-809]] || ||
+ || Make Pig available via Maven || 
[[https://issues.apache.org/jira/browse/PIG-1334|PIG-1334]] || ||
  
  
  == Proposed Future Work ==
@@ -68, +73 @@

  Within each subsection order is alphabetical and does not imply priority.
  
  === Agreed Work, Agreed Approach ===
-  Boolean Type 
- Boolean is currently supported internally as a type in Pig, but it is not 
exposed to users.  Data cannot be of type boolean, nor can UDFs (other than
- !FilterFuncs) return boolean.  Users have repeatedly requested that boolean 
be made a full type.
- 
- '''Category:'''  New Functionality
- 
- '''Dependency:'''  Will affect all !LoadCasters, as they will have to provide 
byteToBoolean methods.
- 
- '''References:'''
- 
- '''Estimated Development Effort:'''  small
- 
   Combiner Not Used with Limit or Filter 
  Pig Scripts that have a foreach with a nested limit or filter do not use the 
combiner even when they could.  Not all filters can use the combiner, but in 
some cases
  they can.  I think all limits could at least apply the limit in the combiner, 
though the UDF itself may only be executed in the reducer. 
@@ -226, +219 @@

  
  '''Estimated Development Effort:'''  small
  
-  Pig Mix 2.0 
- Pig Mix has 

[Pig Wiki] Update of "PigInteroperability" by jeff zhan g

2010-05-21 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigInteroperability" page has been changed by jeff zhang.
http://wiki.apache.org/pig/PigInteroperability?action=diff&rev1=1&rev2=2

--

  == Pig and Hive RCFiles ==
  The !HiveColumnarLoader, available as part of PiggyBank in Pig 0.7.0.
  
+ == Pig and Voldemort ==
+ The Pig LoadFunc for Voldemort. 
+ See 
http://github.com/rsumbaly/voldemort/blob/hadoop/contrib/hadoop/src/java/voldemort/hadoop/pig/VoldemortStore.java
+ 


[Pig Wiki] Update of "HowToRelease" by daijy

2010-05-17 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "HowToRelease" page has been changed by daijy.
http://wiki.apache.org/pig/HowToRelease?action=diff&rev1=21&rev2=22

--

  ant clean
  ant test
  ant clean
+ ant jar
+ cd contrib/zebra
+ ant
+ cd ../..
+ cd contrib/owl
+ ant
+ cd ../..
+ cd contrib/piggybank/java
+ ant
+ cd ../../..
  ant -Dversion=X.Y.Z  -Djava5.home= 
-Dforrest.home=  tar
  }}}
   2. Test the tar file by unpacking the release and


[Pig Wiki] Trivial Update of "LoadStoreMigrationGuide" by newacct

2010-05-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreMigrationGuide" page has been changed by newacct.
http://wiki.apache.org/pig/LoadStoreMigrationGuide?action=diff&rev1=38&rev2=39

--

  longend= Long.MAX_VALUE;
  private byte recordDel = '\n';
  private byte fieldDel = '\t';
- private ByteArrayOutputStream mBuf = null;
+ private ByteArrayOutputStream mBuf;
- private ArrayList mProtoTuple = null;
+ private ArrayList mProtoTuple;
  private static final String UTF8 = "UTF-8";
  
  public SimpleTextLoader() {
@@ -96, +96 @@

  case 'x':
  case 'u':
  this.fieldDel =
- Integer.valueOf(delimiter.substring(2)).byteValue();
+ (byte)Integer.parseInt(delimiter.substring(2));
  break;
  
  default:
  throw new RuntimeException("Unknown delimiter " + delimiter);
  }
  } else {
- throw new RuntimeException("PigStorage delimeter must be a single 
character");
+ throw new RuntimeException("PigStorage delimiter must be a single 
character");
  }
  }
  
@@ -141, +141 @@

  this.end = end;
  
  // Since we are not block aligned we throw away the first
- // record and cound on a different instance to read it
+ // record and count on a different instance to read it
  if (offset != 0) {
  getNext();
  }
@@ -179, +179 @@

  === New Implementation ===
  {{{
  public class SimpleTextLoader extends LoadFunc {
- protected RecordReader in = null;
+ protected RecordReader in;
  private byte fieldDel = '\t';
- private ArrayList mProtoTuple = null;
+ private ArrayList mProtoTuple;
  private TupleFactory mTupleFactory = TupleFactory.getInstance();
  private static final int BUFFER_SIZE = 1024;
  
@@ -207, +207 @@

  
  case 'x':
 fieldDel =
- Integer.valueOf(delimiter.substring(2), 16).byteValue();
+ (byte)Integer.parseInt(delimiter.substring(2), 16);
 break;
  
  case 'u':
  this.fieldDel =
- Integer.valueOf(delimiter.substring(2)).byteValue();
+ (byte)Integer.parseInt(delimiter.substring(2));
  break;
  
  default:
  throw new RuntimeException("Unknown delimiter " + delimiter);
  }
  } else {
- throw new RuntimeException("PigStorage delimeter must be a single 
character");
+ throw new RuntimeException("PigStorage delimiter must be a single 
character");
  }
  }
  
@@ -313, +313 @@

  case 'x':
  case 'u':
  this.fieldDel =
- Integer.valueOf(delimiter.substring(2)).byteValue();
+ (byte)Integer.parseInt(delimiter.substring(2));
  break;
  
  default:
  throw new RuntimeException("Unknown delimiter " + delimiter);
  }
  } else {
- throw new RuntimeException("PigStorage delimeter must be a single 
character");
+ throw new RuntimeException("PigStorage delimiter must be a single 
character");
  }
  }
  
@@ -496, +496 @@

  
  case 'x':
 fieldDel =
- Integer.valueOf(delimiter.substring(2), 16).byteValue();
+ (byte)Integer.parseInt(delimiter.substring(2), 16);
 break;
  case 'u':
  this.fieldDel =
- Integer.valueOf(delimiter.substring(2)).byteValue();
+ (byte)Integer.parseInt(delimiter.substring(2));
  break;
  
  default:
  throw new RuntimeException("Unknown delimiter " + delimiter);
  }
  } else {
- throw new RuntimeException("PigStorage delimeter must be a single 
character");
+ throw new RuntimeException("PigStorage delimiter must be a single 
character");
  }
  }
  


[Pig Wiki] Update of "HowToRelease" by daijy

2010-05-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "HowToRelease" page has been changed by daijy.
http://wiki.apache.org/pig/HowToRelease?action=diff&rev1=20&rev2=21

--

  cd build
  md5sum pig-X.Y.Z.tar.gz > pig-X.Y.Z.tar.gz.md5
  }}}
+  4. If you do not have a gpg key pair, do the following steps:
+a. Generating key pair using the following command. You can simply accept 
all default settings and give your name, email and Passphase. {{{
+ gpg --gen-key
+ }}}
+a. Export your public key. {{{
+ gpg --armor --output pubkey.txt --export 'Your Name'
+ }}}
+a. Open pubkey.txt, copy the full text and append it to the following 
files by pasting, then commit these changes: {{{
+ https://svn.apache.org/repos/asf/hadoop/pig/branches/branch-X.Y.Z/KEYS
+ https://svn.apache.org/repos/asf/hadoop/pig/trunk/KEYS
+ }}}
+a. Upload updated KEYS to Apache. {{{
+ scp KEYS people.apache.org:/www/www.apache.org/dist/hadoop/pig/KEYS
+ }}}
+a. Export your private key, keep it with you. {{{
+ gpg --export-secret-key -a "Your Name" > private.key
+ }}}
-  4. Sign the release (see 
[[http://www.apache.org/dev/mirror-step-by-step.html?Step-By-Step|Step-By-Step 
Guide to Mirroring Releases]] for more information). [TODO: add details on how 
to generate and store keys]{{{
+  5. Sign the release (see 
[[http://www.apache.org/dev/mirror-step-by-step.html?Step-By-Step|Step-By-Step 
Guide to Mirroring Releases]] for more information). {{{
  gpg --armor --output pig-X.Y.Z.tar.gz.asc --detach-sig pig-X.Y.Z.tar.gz
  }}}
+  6. Verify gpg signature. {{{
+ gpg --import KEYS  (if necessarily)
+ gpg --verify pig-X.Y.Z.tar.gz.asc pig-X.Y.Z.tar.gz
+ }}}
-  5. Copy release files to a public place (usually into public_html in your 
home directory):{{{
+  7. Copy release files to a public place (usually into public_html in your 
home directory):{{{
  ssh people.apache.org mkdir public_html/pig-X.Y.Z-candidate-0
  scp -p pig-X.Y.Z.tar.gz* people.apache.org:public_html/pig-X.Y.Z-candidate-0
  cd ..
  scp RELEASE_NOTES.txt people.apache.org:public_html/pig-X.Y.Z-candidate-0
  }}}
-  6. Call a release vote. The initial email should be sent to 
`pig-...@hadoop.apache.org`. Make sure to attache rat report to it. Here is a 
sample of email: {{{
+  8. Call a release vote. The initial email should be sent to 
`pig-...@hadoop.apache.org`. Make sure to attache rat report to it. Here is a 
sample of email: {{{
  From: Olga Natkovich [mailto:ol...@yahoo-inc.com] 
  Sent: Tuesday, November 25, 2008 3:59 PM
  To: pig-...@hadoop.apache.org
@@ -170, +191 @@

  }}}
   6. Update the front page news in 
author/src/documentation/content/xdocs/index.xml.
   7. Update the release news in 
author/src/documentation/content/xdocs/releases.xml.
-  7. Update the documentation links in 
author/src/documentation/content/xdocs/site.xml
+  8. Update the documentation links in 
author/src/documentation/content/xdocs/site.xml
-  8. Copy in the release specific documentation {{{
+  9. Copy in the release specific documentation {{{
  cd publish
  mkdir docs/rX.Y.Z
- cp -pr /build/docs/* publish/docs/rX.Y.Z/
+ cp -pr /docs/* publish/docs/rX.Y.Z/
  svn add publish/docs/rX.Y.Z
  }}}
-  9. Regenerate the site, review it and commit in HowToCommit.
+  10. Regenerate the site, review it and commit in HowToCommit.
-  10. Deploy your site changes.{{{
+  11. Deploy your site changes.{{{
  ssh people.apache.org
  cd /www/hadoop.apache.org/pig
  svn up
  }}}
-  10. Wait until you see your changes reflected on the Apache web site.
+  12. Wait until you see your changes reflected on the Apache web site.
-  11. Send announcements to the user and developer lists as well as 
(`annou...@haoop.apache.org`) once the site changes are visible. {{{
+  13. Send announcements to the user and developer lists as well as 
(`annou...@haoop.apache.org`) once the site changes are visible. {{{
  Pig  team is happy to announce Pig X.Y.Z release. 
  
  Pig is Hadoop subproject which provides high-level data-flow language and 
execution framework for parallel computation on Hadoop clusters.
@@ -192, +213 @@

  
  The highlights of this release are ... The details of the release can be 
found at http://hadoop.apache.org/pig/releases.html.
  }}}
-  12. In JIRA, mark the release as released.
+  14. In JIRA, mark the release as released.
 a. Goto JIRA and click on Administration tab.
 a. Select the Pig project.
 a. Select Manage versions.
@@ -200, +221 @@

 a. If a description has not yet been added for the version you are 
releasing, select Edit Details and give a brief description of the release.
 a. If the next version does not exist (that is, if you are releasing 
version 0.x, if version 0.x+1 does not yet exist) create it using the Add 
Version box at the top of the page.
  
-  13. In JIRA, mark the issues resolved in this release as closed.
+  15. In J

[Pig Wiki] Update of "PoweredBy" by DanHarvey

2010-05-06 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PoweredBy" page has been changed by DanHarvey.
http://wiki.apache.org/pig/PoweredBy?action=diff&rev1=1&rev2=2

--

   * [[http://twitter.com|Twitter]]<>
* We use Pig extensively to process usage logs, mine tweet data, and more.
* We have maintain [[http://github.com/kevinweil/elephant-bird|Elephant 
Bird]], a set of libraries for working with Pig, LZO compression, protocol 
buffers, and more.
-   * More details can be seen in this presentation: 
http://www.slideshare.net/kevinweil/nosql-at-twitter-nosql-eu-2010<>
+   * More details can be seen in this presentation: 
http://www.slideshare.net/kevinweil/nosql-at-twitter-nosql-eu-2010
+ 
   * [[http://www.yahoo.com/|Yahoo!]]
* More than 100,000 CPUs in >25,000 computers running Hadoop
* Our biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM)
@@ -31, +32 @@

* [[http://developer.yahoo.com/blogs/hadoop|Our Blog]] - Learn more about 
how we use Hadoop.
* >40% of Hadoop Jobs within Yahoo are Pig jobs.
  
+  * [[http://www.mendeley.com|Mendeley]]<>
+   * We are creating a platform to aggregate research and allow researchers to 
get the most out of the web.
+   * We moved all our catalogue stats and analysis to HBase and Pig
+   * We are using Scribe in combination with Pig for all our server, 
application and user log processing.
+   * Pig helps our business analytics, user experience evaluation, feature 
feedback and more out of these logs
+ 


[Pig Wiki] Update of "PigLatin" by test_abc

2010-05-05 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigLatin" page has been changed by test_abc.
http://wiki.apache.org/pig/PigLatin?action=diff&rev1=35&rev2=36

--

  {{{
  <1>
  <3>
+ <4>
  <5>
  }}}
  


[Pig Wiki] Update of "PigLatin" by OlgaN

2010-05-05 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigLatin" page has been changed by OlgaN.
http://wiki.apache.org/pig/PigLatin?action=diff&rev1=34&rev2=35

--

  <>
  <>
  <>
+ 
+ '''THIS PAGE IS OBSOLETE. Please use documentation at 
http://hadoop.apache.org/pig/'''
  
  '''Note:''' For Pig 0.2.0 or later, some content on this page may no longer 
be applicable.
  


[Pig Wiki] Update of "Eclipse_Environment" by ThejasN air

2010-04-30 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Eclipse_Environment" page has been changed by ThejasNair.
http://wiki.apache.org/pig/Eclipse_Environment?action=diff&rev1=15&rev2=16

--

   * Window > Open Perspective > Java
   * Window > Show View > ''see the various options ...''
  
-  Download jars and generate code 
- To download the required jars and generate code in src-gen, run 'ant jar' in 
trunk dir.
- 
   Update the Build Configuration 
-  * run 'ant eclipse-files' in trunk/ dir.
+  * run 'ant eclipse-files' in trunk/ dir. 
   * Refresh the project in eclipse 
  You are all set now!
  


[Pig Wiki] Update of "HowToDocumentation" by OlgaN

2010-04-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "HowToDocumentation" page has been changed by OlgaN.
http://wiki.apache.org/pig/HowToDocumentation?action=diff&rev1=11&rev2=12

--

 * Run the "ant docs" command: ant docs -Djava5.home=''java5_path'' 
-Dforrest.home=''forrest_path''
 * To check the *.html and *.pdf output, change to this directory: 
/trunk/docs
  
+ For releases, be sure to do the following:
+* Update the doc tab
+   * Open the tabs.xml file 
(…/src/docs/src/documentation/content/xdocs/tabs.xml)
+   * Update the doc tab for the current release. For example, changeto  
+* Update the API link
+   * Open the site.xml file 
(…/src/docs/src/documentation/content/xdocs/site.xml)
+   * Update the external api reference for the current release. For 
example, change http://hadoop.apache.org/pig/docs/r0.6.0/api/"; />  
to  http://hadoop.apache.org/pig/docs/r0.7.0/api/"; />
+ 


[Pig Wiki] Update of "Eclipse_Environment" by ThejasN air

2010-04-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Eclipse_Environment" page has been changed by ThejasNair.
http://wiki.apache.org/pig/Eclipse_Environment?action=diff&rev1=14&rev2=15

--

   * Refresh the project in eclipse 
  You are all set now!
  
- The 'ant eclipse-files' target does not exist in revisions before r938733, 
and you have to follow the steps below  -
+ The 'ant eclipse-files' target that generates eclipse configuration does not 
exist in revisions before r938733. So if you checked out an earlier version, 
you have to follow the steps below  -
  
  After the Java project is created, update the build configuration.
  


[Pig Wiki] Update of "Eclipse_Environment" by ThejasN air

2010-04-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Eclipse_Environment" page has been changed by ThejasNair.
http://wiki.apache.org/pig/Eclipse_Environment?action=diff&rev1=13&rev2=14

--

   * Window > Open Perspective > Java
   * Window > Show View > ''see the various options ...''
  
+  Download jars and generate code 
+ To download the required jars and generate code in src-gen, run 'ant jar' in 
trunk dir.
+ 
   Update the Build Configuration 
+  * run 'ant eclipse-files' in trunk/ dir.
+  * Refresh the project in eclipse 
+ You are all set now!
+ 
+ The 'ant eclipse-files' target does not exist in revisions before r938733, 
and you have to follow the steps below  -
+ 
  After the Java project is created, update the build configuration.
  
  To update the build configuration:


[Pig Wiki] Trivial Update of "PigAbstractionLayer" by n ewacct

2010-04-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigAbstractionLayer" page has been changed by newacct.
http://wiki.apache.org/pig/PigAbstractionLayer?action=diff&rev1=4&rev2=5

--

 * Created an entity handle for a container.
 * 
 * @param name of the container
-* @return a container descripto
+* @return a container description
 * @throws DataStorageException if name does not conform to naming 
 * convention enforced by the Data Storage.
 */
@@ -192, +192 @@

  }}}
  
  === Data Storage Descriptors ===
- Descriptors are a represenation of entities in the Data Storage and are used 
to access and carry out operations on such entities.
+ Descriptors are a representation of entities in the Data Storage and are used 
to access and carry out operations on such entities.
  There are Element Descriptors and Container Descriptors. The latter are 
descriptors for entities that contain Data Storage Element Descriptors.
  
  {{{


[Pig Wiki] Update of "HowToRelease" by AlanGates

2010-04-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "HowToRelease" page has been changed by AlanGates.
http://wiki.apache.org/pig/HowToRelease?action=diff&rev1=19&rev2=20

--

  BUG FIXES
  PIG-342: Fix DistinctDataBag to recalculate size after it has spilled. 
(bdimcheff via gates)
  }}}
-  2. Edit `src/docs/src/documentation/content/xdocs/site.xml`.  In the 
external reference for api where the link contains 
`change_to_correct_version_number_after_branching` change this string to the
+  2. Edit `src/docs/src/documentation/content/xdocs/site.xml`.  In the 
external reference for api where the link contains the previous version number 
change this string to the correct version number.
-  correct version number.
   3. Commit these changes to trunk:{{{
  svn commit -m "Preparing for release X.Y.Z"
  }}}


[Pig Wiki] Update of "FrontPage" by DmitriyRyaboy

2010-04-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "FrontPage" page has been changed by DmitriyRyaboy.
http://wiki.apache.org/pig/FrontPage?action=diff&rev1=147&rev2=148

--

  
   * [[http://hadoop.apache.org/pig/|Official Apache Pig Website]]
   * PigTalksPapers - Pig talks, papers, interviews 
+  * PoweredBy - a (partial) list of companies using Pig
   
  == User Documentation ==
  


[Pig Wiki] Update of "PoweredBy" by DmitriyRyaboy

2010-04-22 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PoweredBy" page has been changed by DmitriyRyaboy.
http://wiki.apache.org/pig/PoweredBy

--

New page:
Applications and organizations using Pig include (alphabetically):

 * [[http://www.cooliris.com/|Cooliris]] - Cooliris transforms your browser 
into a lightning fast, cinematic way to browse photos and videos, both online 
and on your hard drive.
  * We have a 15-node Hadoop cluster where each machine has 8 cores, 8 GB ram, 
and 3-4 TB of storage.
  * We use Hadoop for all of our analytics, and we use Pig to allow PMs and 
non-engineers the freedom to query the data in an ad-hoc manner.<>
 * [[http://www.dropfire.com/|DropFire]]
  * We generate Pig Latin scripts that describe structural and semantic 
conversions between data contexts
  * We use Hadoop to execute these scripts for production-level deployments
  * Eliminates the need for explicit data and schema mappings during database 
integration
 * [[http://www.linkedin.com/|LinkedIn]]
  * 3x30 Nehalem-based node grids, with 2x4 cores, 16GB RAM, 8x1TB storage 
using ZFS in a JBOD configuration.
  * We use Hadoop and Pig for discovering People You May Know and other fun 
facts.
 * [[http://www.ning.com/|Ning]]
  * We use Hadoop to store and process our log file
  * We rely on Apache Pig for reporting, analytics, Cascading for machine 
learning, and on a proprietary [[/hadoop/JavaScript|JavaScript]] API for ad-hoc 
queries
  * We use commodity hardware, with 8 cores and 16 GB of RAM per machine



 * [[http://twitter.com|Twitter]]<>
  * We use Pig extensively to process usage logs, mine tweet data, and more.
  * We have maintain [[http://github.com/kevinweil/elephant-bird|Elephant 
Bird]], a set of libraries for working with Pig, LZO compression, protocol 
buffers, and more.
  * More details can be seen in this presentation: 
http://www.slideshare.net/kevinweil/nosql-at-twitter-nosql-eu-2010<>
 * [[http://www.yahoo.com/|Yahoo!]]
  * More than 100,000 CPUs in >25,000 computers running Hadoop
  * Our biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM)
  * Used to support research for Ad Systems and Web Search
  * Also used to do scaling tests to support development of Hadoop on larger 
clusters
  * [[http://developer.yahoo.com/blogs/hadoop|Our Blog]] - Learn more about how 
we use Hadoop.
  * >40% of Hadoop Jobs within Yahoo are Pig jobs.


[Pig Wiki] Update of "Eclipse_Environment" by Ashutos hChauhan

2010-04-09 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "Eclipse_Environment" page has been changed by AshutoshChauhan.
http://wiki.apache.org/pig/Eclipse_Environment?action=diff&rev1=12&rev2=13

--

  For Pig, you need the 
[[http://www.easyeclipse.org/site/plugins/javacc.html|JavaCC plugin]] and 
the[[http://subclipse.tigris.org/|Subclipse Subversion plugin]].
  
  To download and install the plugins:
+ 
-  1. Open Eclipse 
+  1. Open Eclipse
   1. Select Help > Software Updates... > Available Software
   1. Add the two plugin sites by pressing Add Site... Button
-  * http://eclipse-javacc.sourceforge.net
+  1. http://eclipse-javacc.sourceforge.net
-  * http://subclipse.tigris.org/update_1.4.x
+  1. http://subclipse.tigris.org/update_1.4.x
-  1.#4 Select the plugins that appear under these sites
+  1. Select the plugins that appear under these sites
   1. Press Install - and follow the prompts to download and install the plugins
  
   Add the Pig Trunk Repository 
  To add the Pig trunk repository:
+ 
   1. Open Eclipse
   1. Select file > New > Other...
   1. Choose SVN, Repository Location > Next
   1. Under the General tab:
-  * URL: http://svn.apache.org/repos/asf/hadoop/pig/trunk
+  1. URL: http://svn.apache.org/repos/asf/hadoop/pig/trunk
-  * Use a custom label: Pig
+  1. Use a custom label: Pig
-  1.#5 Click Finish
+  1. Click Finish
  
  To view the results:
+ 
   * Window > Open Perspective > Other... > SVN Repository Exploring
   * Window > Show View > SVN Repositories
  
   Create a Java Project 
- 
  First, create a directory on your development machine (for example "mypig") 
and checkout the Pig source from SVN: 
http://svn.apache.org/repos/asf/hadoop/pig/trunk
  
  Note: Windows users need to download and install TortoiseSVN 
(http://tortoiseSVN.tigris.org/)
  
  To create a Java project:
+ 
   1. Open Eclipse
   1. Select file > New > Other ...
   1. Select Java Project
   1. On the New Java Project dialog:
-  * Project name: !PigProject
+  1. Project name: !PigProject
-  * Select: Create project from existing source
+  1. Select: Create project from existing source
-  * Directory: browse to the "mypig" directory on your development machine and 
select the Trunk directory
+  1. Directory: browse to the "mypig" directory on your development machine 
and select the Trunk directory
-  1.#5 Click Next
+  1. Click Next
   1. Click Finish
  
  To view the results:
+ 
   * Window > Open Perspective > Java
   * Window > Show View > ''see the various options ...''
  
@@ -54, +58 @@

  After the Java project is created, update the build configuration.
  
  To update the build configuration:
+ 
   1. Open Eclipse
   1. Select Window > Open Perspective > Java (to open the !MyPig project)
   1. Select Project > Properties
   1. For the Java Build Path, check the settings as shown below.
  
  Source
+ 
  {{{
  lib-src/bzip2
  lib-src/shock
@@ -68, +74 @@

  test -> Make sure nothing is excluded
  
  The default output folder should be bin.
+ }}}
+ Libraries
  
- }}}
- 
- 
- Libraries
  {{{
  lib/hadoopXXX.jar
  lib/hbaseXXX-test.jar
  lib/hbaseXXX.jar
+ lib/Pig/zookeeper-hbase-xxx.jar
  build/ivy/lib/Pig/javacc.jar
- build/ivy/lib/Pig/jline-XXX.jar 
+ build/ivy/lib/Pig/jline-XXX.jar
  build/ivy/lib/Pig/jsch-xxx.jar
  build/ivy/lib/Pig/junit-xxx.jar
+ }}}
+ NOTE:
  
- }}}
- NOTE: For pig sources checked out from Apache before revision r771273, 
replace "build/ivy/lib/Pig" with "lib". Revision r771273 and above in apache 
svn use ivy to resolve dependencies need to build pig. 
+  1. For pig sources checked out from Apache before revision r771273, replace 
"build/ivy/lib/Pig" with "lib". Revision r771273 and above in apache svn use 
ivy to resolve dependencies need to build pig.
+  1. If you are building piggybank you will need few extra jars. You can find 
all of those in build/ivy/lib/Pig/  once you run jar target of ant successfully.
  
  Order and Export
+ 
  {{{
  Should have be the following order:
  
@@ -96, +104 @@

  src
  JRE System Library
  all the jars from the "Libraries" tab
- 
  }}}
- 
- 
   Troubleshooting 
-* Build problems: Check if eclipse is using JDK version 1.6, pig needs it 
(Under Preferences/Java/Compiler).
+  * Build problems: Check if eclipse is using JDK version 1.6, pig needs it 
(Under Preferences/Java/Compiler).
  
   Tips 
-* To build using eclipse , open the ant window (Windows/Show View/Ant) , 
then drag and drop build.xml under your project to this window. Double click on 
jar in that will build pig.jar, on test will run unit tests.
+  * To build using eclipse , open the ant window (Windows/Show View/Ant) , 
then drag and drop build.xml under your project to this window. Double click on 
jar in that will build pig.jar, on test will run unit tests.
  


[Pig Wiki] Update of "HowToRelease" by AlanGates

2010-04-08 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "HowToRelease" page has been changed by AlanGates.
http://wiki.apache.org/pig/HowToRelease?action=diff&rev1=18&rev2=19

--

  BUG FIXES
  PIG-342: Fix DistinctDataBag to recalculate size after it has spilled. 
(bdimcheff via gates)
  }}}
+  2. Edit `src/docs/src/documentation/content/xdocs/site.xml`.  In the 
external reference for api where the link contains 
`change_to_correct_version_number_after_branching` change this string to the
+  correct version number.
   3. Commit these changes to trunk:{{{
  svn commit -m "Preparing for release X.Y.Z"
  }}}
@@ -56, +58 @@

   7. Commit these changes to trunk:{{{
  svn commit -m "Preparing for X.Y+1.0 development"
  }}}
- 
- TODO:
-  1. Add documentation update the process once we integrate the documentation 
into forrect. (Will need docs target in build.xml)
  
  == Updating Release Branch ==
  


[Pig Wiki] Update of "owl" by jaytang

2010-04-01 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "owl" page has been changed by jaytang.
http://wiki.apache.org/pig/owl?action=diff&rev1=15&rev2=16

--

  
  
  || Feature || Status ||
- || Owl is a stand-alone table store, not tied to any particular data query or 
processing languages, currently supporting MR, Pig Latin, and Pig SQL || 
current ||
+ || Owl is a stand-alone table store, not tied to any particular data query or 
processing languages, supporting MR, Pig Latin, and Pig SQL || current ||
  || Owl has a flexible data partitioning model, with multiple levels of 
partitioning, physical and logical partitioning, and partition pruning for 
query optimization || current ||
  || Owl has a flexible interface for pushing projections and filters all the 
way down || current ||
  || Owl has a framework for storing data in many storage formats, and 
different storage formats can co-exist within the same table || current ||
@@ -39, +39 @@

  
  == Prerequisite ==
  
- Owl depends on Pig for its tuple classes as its basic unit of data container, 
and Hadoop 20 for !OwlInputFormat.  Its first release will require Pig 0.7 and 
Hadoop 20.2.  Owl also requires a storage driver; Owl integrates with Zebra 0.7 
out-of-the-box.
+ Owl depends on Pig for its tuple classes as its basic unit of data container, 
and Hadoop 20 for !OwlInputFormat.  Its first release will require Pig 0.7 or 
later and Hadoop 20.2 or late.  Owl integrates with Zebra 0.7 out-of-the-box.
  
  == Getting Owl ==
  
@@ -78, +78 @@

  
  After installing Tomcat and MySQL, you will need these files:
  
-* owl-<0.x.x>.war - owl web application
+* owl-<0.x.x>.war - owl web application at contrib/owl/build
-* owl-<0.x.x>.jar - owl client library ''!OwlInputFormat'' and 
''!OwlDriver'' with all their dependent 3rd party libraries
+* owl-<0.x.x>.jar - owl client library ''!OwlInputFormat'' and 
''!OwlDriver'' with all their dependent 3rd party libraries at contrib/owl/build
 * mysql
* mysql_schema.sql - owl database schema file at contrib/owl/setup/mysql
* owlServerConfig.xml - owl server configuration file at 
contrib/owl/setup/mysql
@@ -87, +87 @@

* oracle_schema.sql - owl database schema file at 
contrib/owl/setup/oracle
* owlServerConfig.xml - owl server configuration file at 
contrib/owl/setup/oracle
  
- Set up parameters in owlServerConfig:
- 
-* update jdbc driver connection information in owlServerConfig.xml
-* put this file on the same box where tomcat is installed
- 
  Create db schema in !MySql:
  
 * create a database "owl" in mysql
 * create db schema with "mysql_schema.sql"
 * make sure the user specified in jdbc connection string has full access 
to all objects in the newly created "owl" db
+ 
+ Set up parameters in owlServerConfig:
+ 
+* update jdbc driver connection information in owlServerConfig.xml
+* put this file on the same box where tomcat is installed
  
  Deploy Owl to Tomcat:
  


[Pig Wiki] Update of "owl" by jaytang

2010-04-01 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "owl" page has been changed by jaytang.
http://wiki.apache.org/pig/owl?action=diff&rev1=14&rev2=15

--

  Sample code is attached to write a client application against owl:
  * Sample code using !OwlDriver API: 
[[attachment:TestOwlDriverSample.java]]
  
+ == Next Step ==
+ 
+ We recognize that Hive already addressed some of the above problems, and that 
there is significant overlap between Owl and Hive. Yet we also believe that Owl 
adds important new features that are necessary for managing very large tables. 
We look forward to collaborating with the Hive team on finding the right model 
for integration between the two systems and creating a unified data management 
system for Hadoop. 
+ 


[Pig Wiki] Update of "owl" by jaytang

2010-04-01 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "owl" page has been changed by jaytang.
http://wiki.apache.org/pig/owl?action=diff&rev1=13&rev2=14

--

  
  
  == High Level Diagram ==
+ 
+ {{attachment:owl.jpg}}
  
  As one can see, Owl gives Hadoop users a uniform interface for organizing, 
discovering and managing data stored in many different formats, and to promote 
interoperability among different programming frameworks. Owl presents a single 
logical view of data organization and hides the complexity and evolutions in 
underlying physical data layout schemes. It gives Hadoop applications a stable 
foundation to build upon. 
  


New attachment added to page owl on Pig Wiki

2010-04-01 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page "owl" for change notification. An attachment 
has been added to that page by jaytang. Following detailed information is 
available:

Attachment name: owl.jpg
Attachment size: 20679
Attachment link: 
http://wiki.apache.org/pig/owl?action=AttachFile&do=get&target=owl.jpg
Page link: http://wiki.apache.org/pig/owl


[Pig Wiki] Update of "owl" by jaytang

2010-04-01 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "owl" page has been changed by jaytang.
http://wiki.apache.org/pig/owl?action=diff&rev1=12&rev2=13

--

  The core M/R programming interface as we know it (the mapper, reducer, output 
collector, record reader and input format ) all deal with collection of 
abstract data objects, not files. However, the current set of !InputFormat 
implementations provided by job API are relatively primitive and are heavily 
coupled to file formats and HDFS paths to describe input and output locations. 
From an application programmer’s perspective, one has to think about both the 
abstract data and the physical representation and storage location, which is a 
disconnect from the abstract data API. In the meantime, the number of file 
formats and (de)serialization libraries have flourished in the Hadoop 
community. Some of these require certain metadata to operate/optimize. While 
providing optimization and performance enhancements, these file formats and 
SerDe libs don’t make it any easier to develop applications on and manage very 
big data sets. 
  
  
- == High Level Diagram == 
+ == High Level Diagram ==
  
  As one can see, Owl gives Hadoop users a uniform interface for organizing, 
discovering and managing data stored in many different formats, and to promote 
interoperability among different programming frameworks. Owl presents a single 
logical view of data organization and hides the complexity and evolutions in 
underlying physical data layout schemes. It gives Hadoop applications a stable 
foundation to build upon. 
  
@@ -34, +34 @@

  || Owl has support for converting data between write-friendly and 
read-friendly formats || future ||
  || Owl has support for addressing HDFS NameNode limitations by decreasing the 
number of files needed to store very large data sets || future ||
  || Owl provides a security model for secure data access || future ||
- 
  
  == Prerequisite ==
  
@@ -102, +101 @@

  * deploy owl war file to Tomcat
  * set up -Dorg.apache.hadoop.owl.xmlconfig= for the Tomcat deployment
  
- == Developing on Owl == 
+ == Developing on Owl ==
  
  Owl has two major public APIs.  ''Owl Driver'' provides management APIs 
against three core Owl abstractions: "Owl Table", "Owl Database", and 
"Partition".  This API is backed up by an internal Owl metadata store that runs 
on Tomcat and a relational database.  ''!OwlInputFormat'' provides a data 
access API and is modeled after the traditional Hadoop !InputFormat.  In the 
future, we plan to support ''!OwlOutputFormat'' and thus the notion of "Owl 
Managed Table" where Owl controls the data flow into and out of "Owl Tables".  
Owl also supports Pig integration with OwlPigLoader/Storer module.
  


[Pig Wiki] Update of "owl" by jaytang

2010-04-01 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "owl" page has been changed by jaytang.
http://wiki.apache.org/pig/owl?action=diff&rev1=11&rev2=12

--

  
  = Apache Owl Wiki =
  
- The goal of Owl is to provide a high level data management abstraction.  
!MapReduce and Pig applications interacting directly with HDFS directories and 
files must deal with low level data management issues such as storage format, 
serialization/compression schemes, data layout, and efficient data accesses, 
etc, often with different solutions. Owl aims to provide a standard way to 
addresses this issue and abstracts away the complexities of reading/writing 
huge amount of data from/to HDFS.
+ == Vision ==
  
- Owl provides a tabular view of data on Hadoop and thus supports the notion of 
''Owl Tables''.  Conceptually, it is similar to a relation database table.  An 
Owl Table has these characteristics:
+ Owl provides a more natural abstraction for Map-Reduce and Map-Reduce-based 
technologies (e.g., Pig, SQL) by allowing developers to express large datasets 
as tables, which in turn consist of rows and columns. Owl tables are similar, 
but not identical to familiar database / data warehouse tables.
  
+ The core M/R programming interface as we know it (the mapper, reducer, output 
collector, record reader and input format ) all deal with collection of 
abstract data objects, not files. However, the current set of !InputFormat 
implementations provided by job API are relatively primitive and are heavily 
coupled to file formats and HDFS paths to describe input and output locations. 
From an application programmer’s perspective, one has to think about both the 
abstract data and the physical representation and storage location, which is a 
disconnect from the abstract data API. In the meantime, the number of file 
formats and (de)serialization libraries have flourished in the Hadoop 
community. Some of these require certain metadata to operate/optimize. While 
providing optimization and performance enhancements, these file formats and 
SerDe libs don’t make it any easier to develop applications on and manage very 
big data sets. 
-* lives in an Owl database name space and could contain multiple partitions
-* has columns and rows and supports a unified table level schema
-* interface to supports !MapReduce and Pig Latin and can easily work with 
other languages
-* designed for efficient batch read/write operations, partitions can be 
added or removed from a table
-* supports external tables (data already exists on file system)
-* pluggable architecture for different storage format such as Zebra
-* presents a logically partitioned view of data and supports very large 
data set via its multi-level flexible partitioning scheme
-* efficient data access mechanisms over very large data set via partition 
and projection pruning
  
  
- Owl has two major public APIs.  ''Owl Driver'' provides management APIs 
against three core Owl abstractions: "Owl Table", "Owl Database", and 
"Partition".  This API is backed up by an internal Owl metadata store that runs 
on Tomcat and a relational database.  ''!OwlInputFormat'' provides a data 
access API and is modeled after the traditional Hadoop !InputFormat.  In the 
future, we plan to support ''!OwlOutputFormat'' and thus the notion of "Owl 
Managed Table" where Owl controls the data flow into and out of "Owl Tables".  
Owl also supports Pig integration with OwlPigLoader/Storer module.
+ == High Level Diagram == 
  
- Initially, we like to open source Owl as a Pig contrib project.  In the long 
term, Owl could become a separate Hadoop subproject as it provides a platform 
service all Hadoop applications.
+ As one can see, Owl gives Hadoop users a uniform interface for organizing, 
discovering and managing data stored in many different formats, and to promote 
interoperability among different programming frameworks. Owl presents a single 
logical view of data organization and hides the complexity and evolutions in 
underlying physical data layout schemes. It gives Hadoop applications a stable 
foundation to build upon. 
+ 
+ == Main Properties and Features ==
+ 
+ 
+ || Feature || Status ||
+ || Owl is a stand-alone table store, not tied to any particular data query or 
processing languages, currently supporting MR, Pig Latin, and Pig SQL || 
current ||
+ || Owl has a flexible data partitioning model, with multiple levels of 
partitioning, physical and logical partitioning, and partition pruning for 
query optimization || current ||
+ || Owl has a flexible interface for pushing projections and filters all the 
way down || current ||
+ || Owl has a framework for storing data in many storage formats, and 
different storage formats can co-exist within the same table || current ||
+ || Owl provides capability discovery mechanism to allow applications to tak

[Pig Wiki] Update of "FrontPage" by AlanGates

2010-03-31 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "FrontPage" page has been changed by AlanGates.
http://wiki.apache.org/pig/FrontPage?action=diff&rev1=146&rev2=147

--

  
  = Apache Pig Wiki =
  
- [[http://incubator.apache.org/pig/|Apache Pig]] is a platform for analyzing 
large data sets. Pig's language, Pig Latin, lets you specify a sequence of data 
transformations such as merging data sets, filtering them, and applying 
functions to records or groups of records. Pig comes with many built-in 
functions but you can also create your own user-defined functions to do 
special-purpose processing. 
+ [[http://hadoop.apache.org/pig/|Apache Pig]] is a platform for analyzing 
large data sets. Pig's language, Pig Latin, lets you specify a sequence of data 
transformations such as merging data sets, filtering them, and applying 
functions to records or groups of records. Pig comes with many built-in 
functions but you can also create your own user-defined functions to do 
special-purpose processing. 
  
  Pig Latin programs run in a distributed fashion on a cluster (programs are 
complied into Map/Reduce jobs and executed using Hadoop). For quick 
prototyping, Pig Latin programs can also run in "local mode" without a cluster 
(all processing takes place in a single local JVM).
  
@@ -20, +20 @@

  
  '''Why Pig Latin instead of SQL?'''  
[[http://www.cs.cmu.edu/~olston/publications/sigmod08.pdf|Pig Latin: A 
Not-So-Foreign Language ...]]
  
- '''Pig Has Grown Up!'''. On 10/22/08 Pig graduated from the 
[[http://incubator.apache.org/|Incubator]] and joined 
[[http://hadoop.apache.org/|Apache Hadoop]] as a subproject.
- 
- '''Pig is Getting Faster!'''  2-6 times faster, for many queries.  We've 
created a set of benchmarks and run them against the pig 0.1.0 release 
(modified to run on hadoop 0.18) and against the current trunk (previously 
`types` branch.) Joins and order bys in particular made large performance 
gains. For complete details see PigMix.
- 
- '''Interested in Pig Guts?''' We are completely redesigning the Pig execution 
and optimization framework. For design details see PigOptimizationWishList and 
PigExecutionModel. 
- 
- '''Want to contribute but don't know where to kick in?''' Here is a 
[[http://wiki.apache.org/pig/ProposedProjects|list of project]] that we 
desired. We need new blood! 
+ '''Want to contribute but don't know where to kick in?''' Here is our 
[[http://wiki.apache.org/pig/PigJournal|journal]] of projects we have worked 
on, are working on,
+ and hope to work on.  Find a project that interests you and jump on in.
  
  '''Pig available as part of Amazon's Elastic !MapReduce''', as of August 2009.
  
@@ -40, +35 @@

   * [[http://hadoop.apache.org/pig/|User Documentation]]
   * [[http://www.cloudera.com/hadoop-training-pig-introduction|Online Pig 
Training]] - Complete with video lectures, exercises, and a pre-configured 
virtual machine. Developed by Cloudera and Yahoo!
   * PiggyBank - User-defined functions (UDFs) contributed by Pig users!
+  * PigTools - Tools Pig users have built around and on top of Pig.
+  * PigInteroperability - How to make Pig work with other platforms you may be 
using, such as HBase and Cassandra.
  
  == Developer Documentation ==
   * How tos


[Pig Wiki] Update of "PigInteroperability" by AlanGates

2010-03-31 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigInteroperability" page has been changed by AlanGates.
http://wiki.apache.org/pig/PigInteroperability

--

New page:
This page describes how Pig interoperates with other platforms, such as HBase
and Hive.

== Pig and Cassandra ==
http://issues.apache.org/jira/browse/CASSANDRA-910

A loader for loading Cassandra data into Pig.  Works with Pig 0.7.0 (branched
but not yet released as of 3/31/2010).

== Pig and HBase ==
In Pig 0.6 and before, the built in HBaseStorage can be used to load data from
Hbase.  Work is ongoing to enhance this loader and make it a storage function
also.  See http://issues.apache.org/jira/browse/PIG-1205

== Pig and Hive RCFiles ==
The !HiveColumnarLoader, available as part of PiggyBank in Pig 0.7.0.


[Pig Wiki] Update of "PigTools" by AlanGates

2010-03-31 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "PigTools" page has been changed by AlanGates.
http://wiki.apache.org/pig/PigTools?action=diff&rev1=14&rev2=15

--

  http://code.google.com/p/pig-eclipse
  
  Provides pig-latin editor in eclipse, have the feature of syntax 
highlighting. I just make it for interests, now it has less features than 
pigpen. 
+ 
+ === Elephant-Bird ===
+ http://github.com/kevinweil/elephant-bird/
+ 
+ Twitter's library of LZO and/or Protocol Buffer-related Hadoop !InputFormats, 
!OutputFormats, Writables, Pig !LoadFuncs, HBase miscellanea, etc. The majority 
of these are in production at Twitter running over data every day.
  
  === Emacs Pig Latin Mode ===
  http://github.com/cloudera/piglatin-mode


[Pig Wiki] Update of "owl" by jaytang

2010-03-26 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "owl" page has been changed by jaytang.
http://wiki.apache.org/pig/owl?action=diff&rev1=10&rev2=11

--

  * !OwlInputFormat API - org.apache.hadoop.owl.mapreduce
  
  Sample code is attached to write a client application against owl:
+ * Sample code using !OwlDriver API: 
[[attachment:TestOwlDriverSample.java]]
  


  1   2   3   4   5   6   7   8   >