from:"Alan Gates"

What is a relation?

2008-12-05 Thread Alan Gates


All,

A question on types in pig.  When you say:

A = load 'myfile';

what exactly is A?  For the moment let us call A a relation, since it  
is a set of records, and we can pass it to a relational operator,  
such as FILTER, ORDER, etc.


To clarify the question, is a relation equivalent to a bag?  In some  
ways it seems to be in our current semantics.  Certainly you can turn  
a relation into a bag:


A = load 'myfile';
B = group A all;

The schema of the relation B at this point is group, A, where A is  
a bag.  This does not necessarily mean that a relation is a bag,  
because an operation had to occur to turn the relation into a bag  
(the group all).


But bags can be turned into relations, and then treated again as if  
they were bags:


C = foreach B {
   C1 = filter A by $0  0;
   generate COUNT(C1);
}

Here the bag A created in the previous grouping step is being treated  
as it were a relation and passed to a relational operator, and the  
resulting relation (C1) treated as a bag to be passed COUNT.  So at a  
very minimum it seems that a bag is a type of a relation, even if not  
all relations are bags.


But, if top level (non-nested) relations are bags, why isn't it legal  
to do:


A = load 'myfile';
B = A.$0;

The second statement would be legal nested inside a foreach, but is  
not legal at the top level.


We have been aware of this discrepancy for a while, and lived with  
it.  But I believe it is time to resolve it.  We've noticed that some  
parts of pig assume an equivalence between bag and relation (e.g. the  
typechecker) and other parts do not (e.g. the syntax example above).   
This inconsistency is confusing to users and developers alike.  As  
Pig Latin matures we need to strive to make it a logically coherent  
and complete language.


So, thoughts on how it ought to be?

The advantage I see for saying a relation is equivalent to a bag is  
simplicity of the language.  There is no need to introduce another  
data type.  And it allows full relational operations to occur at both  
the top level and nested inside foreach.


But this simplicity also seems me the downside.  Are we decoupling  
the user so far from the underlying implementation that he will not  
be able to see side effects of his actions?  A top level relation is  
assumably spread across many chunks and any operation on it will  
require one or more map reduce jobs, whereas a relation nested in a  
foreach is contained on one node.   This also makes pig much more  
complex, because while it may hide this level of detail from the  
user, it clearly has to understand the difference between top level  
and nested operations and handle both cases.


Alan.

Re: Pig Team now has two new committers!

2008-12-09 Thread Alan Gates


Congrats to both of you, an honor well earned.

Alan.

On Dec 9, 2008, at 8:51 AM, Olga Natkovich wrote:


Hi,

I am happy to announce that Hadoop PMC voted to make Pradeep Kamath  
and

Santhosh Srinivasan Pig Committer to acknowledge their significant
contribution to the project!

Congratulation to Santhosh and Pradeep!

Olga

Re: Pig performance

2008-12-20 Thread Alan Gates

I left a comment on the blog addressing some of the issues he brought  
up.


Alan.

On Dec 20, 2008, at 1:00 AM, Jeff Hammerbacher wrote:


Hey Pig team,

Did anyone check out the recent claims about Pig's poor performance  
versus
Cascading? Though I haven't worked extensively with either system,  
I found
the statements made fairly bold and am curious to hear more about  
their

validity from the Pig development team:
http://www.manamplified.org/archives/2008/12/cascading-and-pig- 
planners.html

.

Thanks,
Jeff

Re: Adaptive Query Optimization

2009-01-20 Thread Alan Gates

There is no concept of costing in pig at this point.  Currently we let  
the script writer decide when to choose an FR Join over a symmetric  
hash join.


We certainly welcome any work on an optimizer in pig.  Be sure and  
take a look at https://issues.apache.org/jira/browse/PIG-360 where  
some work on an optimizer has already started.


Alan.

On Jan 16, 2009, at 10:51 AM, nitesh bhatia wrote:


Hi
I am working on addition of adaptive behavior in Pig Execution  
Model. Is
there any pre-defined method to estimate execution time for Pig  
Simple join?
I think for FRJoin some method will be required to estimate it. My  
idea is

to design an Adaptive query optimizer similar to that for Glue-Nail
Deductive database system (http://portal.acm.org/citation.cfm?id=615194 
).


--nitesh

--
Nitesh Bhatia
Dhirubhai Ambani Institute of Information  Communication Technology
Gandhinagar
Gujarat

Life is never perfect. It just depends where you draw the line.

visit:
http://www.awaaaz.com - connecting through music
http://www.volstreet.com - lets volunteer for better tomorrow
http://www.instibuzz.com - Voice opinions, Transact easily, Have fun

Re: switching to different parser in Pig

2009-02-24 Thread Alan Gates

 or a C++ parser. JavaCC  
generates

only Java parsers.

Another concern about ANTLR was that it was reputed to change a lot  
as

the guru, Terence Parr, experimented with new syntax and
functionality. JavaCC, at least at the time, was reputed to be more
stable, perhaps stable to a fault. I wanted stability and  
reliability.


2. SableCC is much like JavaCC; it generates a Java parser from a
grammar description; but it had, in my opinion, less flexible
abstract-syntax-tree building than JJTree/JavaCC. In SableCC (when I
looked at it), the AST it built was always a direct reflection of  
your

grammar, generating one tree node for each grammar expansion involved
in a parse, much like using JavaCC with Java Tree Builder (JTB
http://www.cs.purdue.edu/jtb/). When using JavaCC, JTB is the
alternative to using JJTree.

Using SableCC, or the combination JavaCC/JTB, should be _very_  
similar

indeed.

In my opinion, SableCC and JavaCC/JTB have made a conscious choice to
simplify AST building--you get trees that reflect the expansions in
your grammar. Period. But often these default trees will be big, full
of extraneous nodes that reflect precedence hierarchies in the
recursive-descent parsing. If you want to have more control over AST
building, to get more compact and tailored ASTs, you need to pay the
price of learning JJTree.

Assuming that you need to build ASTs, with JavaCC you have the choice
between JJTree and JTB. With SableCC, when I last looked at it, you
only get the JTB-like option.

***

(Again, corrections and expansions would be much appreciated.)

Ken Beesley





---


Of course, no two software tools are likely to do _exactly_ the same
job. Someone already pointed you to ANTLR, which is probably the
best-known alternative to JavaCC. Another possibility is SableCC.
http://sablecc.org

The criteria include stability, documentation, language of the parser
generated, and abstract-syntax-tree building.

When I last looked (a couple of years ago) at ANTLR, SableCC and
JavaCC, I chose JavaCC for the following reasons:

1. ANTLR could not handle Unicode input. Things change, of course, so
ANTLR might now be more Unicode-friendly. Unicode was important to  
me,

so this was a big factor in my decision.

On the plus side for ANTLR, it has better abstract-syntax-tree
building capabilities (in my opinion) than JJTree/JavaCC. You can
learn to use JJTree commands, but it's not easy for most people.

And ANTLR can generate either a Java or a C++ parser. JavaCC  
generates

only Java parsers.

Another concern about ANTLR was that it was reputed to change a lot  
as

the guru, Terence Parr, experimented with new syntax and
functionality. JavaCC, at least at the time, was reputed to be more
stable, perhaps stable to a fault. I wanted stability and  
reliability.


2. SableCC is much like JavaCC; it generates a Java parser from a
grammar description; but it had, in my opinion, less flexible
abstract-syntax-tree building than JJTree/JavaCC. In SableCC (when I
looked at it), the AST it built was always a direct reflection of  
your

grammar, generating one tree node for each grammar expansion involved
in a parse, much like using JavaCC with Java Tree Builder (JTB
http://www.cs.purdue.edu/jtb/). When using JavaCC, JTB is the
alternative to using JJTree.

Using SableCC, or the combination JavaCC/JTB, should be _very_  
similar

indeed.

In my opinion, SableCC and JavaCC/JTB have made a conscious choice to
simplify AST building--you get trees that reflect the expansions in
your grammar. Period. But often these default trees will be big, full
of extraneous nodes that reflect precedence hierarchies in the
recursive-descent parsing. If you want to have more control over AST
building, to get more compact and tailored ASTs, you need to pay the
price of learning JJTree.

Assuming that you need to build ASTs, with JavaCC you have the choice
between JJTree and JTB. With SableCC, when I last looked at it, you
only get the JTB-like option.

--





On Mon, Feb 23, 2009 at 10:06 PM, Alan Gates ga...@yahoo-inc.com  
wrote:
We looked into antlr.  It appears to be very similar to javacc,  
with the
added feature that the java code it generates is humanly  
readable.  That
isn't why we want to switch off of javacc.  Olga listed the 3  
things we

want

out of a parser that javacc isn't giving us (lack of docs, no easy
customization of error handle, decoupling of scanning and  
parsing).  So

antlr doesn't look viable.

In response to Pi's suggestion that we could use the logical plan,  
I hope

we
could use something close to it.  Whatever we choose we want it to  
be
flexible enough to represent richer language constructs (like  
branch and
loop).  I'm not sure our current logical plan can do that.  At the  
same
time, we don't need another layer of translation (we already have

Fwd: Core For Paper ---- Grid and Cloud Middleware Workshop, in conjunction with GCC2009

2009-03-03 Thread Alan Gates

Begin forwarded message:

From: Yongqiang He heyongqi...@software.ict.ac.cn
Date: March 1, 2009 10:18:03 PM PST
To: core-u...@hadoop.apache.org core-u...@hadoop.apache.org, core-...@hadoop.apache.org 
 core-...@hadoop.apache.org, hbase-u...@hadoop.apache.org hbase-u...@hadoop.apache.org 
, hive-u...@hadoop.apache.org hive-u...@hadoop.apache.org, hive-...@hadoop.apache.org 
 hive-...@hadoop.apache.org
Subject: Core For Paper   Grid and Cloud Middleware Workshop, in  
conjunction with GCC2009

Reply-To: hive-...@hadoop.apache.org

Call for Paper
The grid and cloud computing technologies both aim to aggregate  
distributed

resources in local- or wide-area environment and to provide a uniform
computing environment. The fundamental of grid and cloud systems is  
the
underlying middleware which sustains the variety of applications by  
system
level abstracts and common functionalities. The grid and cloud  
middleware
synthesizes multiple similar research issues. Software architecture,  
naming
space, distributed data organization and storage, high performance  
data
processing, task scheduling and so on are all corresponding focuses.  
This

workshop is conveyed to promote the related information exchange and
communication. We also hope to advance the research and development  
of grid

and cloud middleware. Topics include but are not limited to:

l   Middleware architecture and implementation,
l   Virtualization, isolation and multi-tenant environment,
l   Distributed information organization,
l   Structured, semi-structured and unstructured data management  
and

processing,
l   Map-Reduce or other novel programming models,
l   Language, language extension and tools for large scale  
computing,

l   Performance analysis/benchmark,
l   Web based user interface,
l   Scheduling, security, monitoring, and accounting,
l   Application or case study.

Important dates
Deadline of submission  April 15, 2009
Notification of acceptance  May 15, 2009
Delivery of camera-ready   June 5, 2009

For More, Please Visit:  http://grid.lzu.edu.cn/gcc2009/item/item.jsp?id=5

--
Best regards!

He Yongqiang
Email:  heyongqi...@software.ict.ac.cn
Tel: 86-10-62600969(O)
Fax：86-10-626000900
Key Laboratory of Network Science and Technology/
Research Center for Grid and Service Computing,
Institute of Computing Technology, Chinese Academy of Sciences,
No.3 Kexueyuan South Road,
Beijing 100190, China

Re: scope string in OperatorKey

2009-03-11 Thread Alan Gates

The purpose of the scope string is to allow us to have multiple  
sessions of pig running and distinguish the operators.  It's one of  
those things that was put in before an actual requirement, so whether  
it will prove useful or not remains to be seen.


As for removing it from explain, is it still reasonably easy to  
distinguish operators without it?  IIRC the OperatorKey includes an  
operator number.  When looking at the explain plans this is useful for  
cases where there is more than one of a given type of operator and you  
want to be able to distinguish between them.


Alan.

On Mar 6, 2009, at 3:14 PM, Thejas Nair wrote:

What is the purpose of scope string in  
org.apache.pig.impl.plan.OperatorKey

?Is it meant to be used if we have a pig deamon process ?

Is it ok to stop printing the scope part in explain output? It does  
not seem

to add value to it and makes the output more verbose.

Thanks,
Thejas

Re: [VOTE] Release Pig 1.0.0 (candidate 0)

2009-03-20 Thread Alan Gates

README.txt  still has the incubator text in it.  This needs to be  
removed.  I'll roll a new package and call a new vote.


Alan.

On Mar 17, 2009, at 3:21 PM, Olga Natkovich wrote:


Pig Committers,

I have created a candidate build for Pig 1.0.0.

This release represents a major rewrite of Pig from the parser down.  
It

also introduced type system into Pig and greatly improved system
performance.

The rat report is attached. Note that there are many java files listed
as being without a license header. All these files are generated by
javacc.

Keys used to sign the release are available at
http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS?view=markup.

Please download, test, and try it out:

http://people.apache.org/~olga/pig-1.0.0-candidate-0
http://people.apache.org/~olga/pig-1.0.0-candidate-0

Should we release this? Vote closes on Friday, March 20th.

Olga

Re: Ajax library for Pig

2009-04-08 Thread Alan Gates

Sorry if these are silly questions, but I'm not very familiar with  
some of these technologies.  So what you propose is that Pig would be  
installed on some dedicated server machine and a web server would be  
placed in front of it.  Then client libraries would be developed that  
made calls to the web server.  Would these client side libraries  
include presentation in the browser, both for user's submitting  
queries and receiving results?  Also, pig currently does not have a  
server mode, thus any web server would have to spin off threads that  
ran a pig job.


If the above is what you're proposing, I think it would be great.   
Opening up pig to more users by making it browser accessible would be  
nice.


Alan.

On Apr 3, 2009, at 5:36 AM, nitesh bhatia wrote:


Hi
Since pig is getting a lot of usage in industries and universities;
how about adding a front-end support for Pig? The plan is to write a
jquery/dojo type of general JavaScript/AJAX library which can be used
over any server technologies (php, jsp, asp, etc.) to call pig
functions over web.

Direct Web Remoting (DWR- http://directwebremoting.org ), an open
source project at Java.net gives a functionality that allows
JavaScript in a browser to interact with Java on a server. Can we
write a JavaScript library exclusively for Pig using DWR? I am not
sure about licensing issues.

The major advantages I can point is
-Use of Pig over HTTP rather SSH.
-User management will become easy as this can be handled easily  
using any CMS


--nitesh

--
Nitesh Bhatia
Dhirubhai Ambani Institute of Information  Communication Technology
Gandhinagar
Gujarat

Life is never perfect. It just depends where you draw the line.

visit:
http://www.awaaaz.com - connecting through music
http://www.volstreet.com - lets volunteer for better tomorrow
http://www.instibuzz.com - Voice opinions, Transact easily, Have fun

Pig release 0.2.0

2009-04-09 Thread Alan Gates


The Pig  team is happy to announce Pig 0.2.0 has been released.

This release includes the addition of a types, better error detection  
and handling, and 5x performance improvement over 0.1.1.


The details of the release can be found at http://hadoop.apache.org/pig/releases.html 
.


Pig is a Hadoop subproject which provides high-level data-flow  
language and execution framework for parallel computation on Hadoop  
clusters.  More details about Pig can be found at http://hadoop.apache.org/pig/ 
.

Re: Ajax library for Pig

2009-04-14 Thread Alan Gates

Would you want to contribute this to the Pig project or release it  
separately?  Either way, keep us posted on your progress.  It sounds  
interesting.


Alan.

On Apr 9, 2009, at 9:28 PM, nitesh bhatia wrote:


Hi
Thanks for the reply.
This will be the architecture:

1. Pig would be installed on some dedicated server machine (say P)  
with

hadoop support.
2. In front of it will be a web server (say S)
  2.1 A web server will consist of a dedicated tomcat server (say  
St) for

handling dwr servlets.
  2.2 PigScript.js  proposed javascript.
  2.2 If user is using some other server than tomcat for  
presentation layer
(say http for php or IIS for asp.net); the server (say Su) will  
appear in

front of St.

-Connections between Su and St will be done through PigScript.js
- St and P will be done through dwr
- To get the results from server, this system will be using Reverse- 
ajax
calls ( i.e async call from server to browser  an inbuilt feature in  
DWR).


DWR is under Apache Licence V2.

--nitesh

On Wed, Apr 8, 2009 at 9:11 PM, Alan Gates ga...@yahoo-inc.com  
wrote:


Sorry if these are silly questions, but I'm not very familiar with  
some of
these technologies.  So what you propose is that Pig would be  
installed on
some dedicated server machine and a web server would be placed in  
front of
it.  Then client libraries would be developed that made calls to  
the web
server.  Would these client side libraries include presentation in  
the
browser, both for user's submitting queries and receiving results?   
Also,
pig currently does not have a server mode, thus any web server  
would have to

spin off threads that ran a pig job.

If the above is what you're proposing, I think it would be great.   
Opening

up pig to more users by making it browser accessible would be nice.

Alan.


On Apr 3, 2009, at 5:36 AM, nitesh bhatia wrote:

Hi

Since pig is getting a lot of usage in industries and universities;
how about adding a front-end support for Pig? The plan is to write a
jquery/dojo type of general JavaScript/AJAX library which can be  
used

over any server technologies (php, jsp, asp, etc.) to call pig
functions over web.

Direct Web Remoting (DWR- http://directwebremoting.org ), an open
source project at Java.net gives a functionality that allows
JavaScript in a browser to interact with Java on a server. Can we
write a JavaScript library exclusively for Pig using DWR? I am not
sure about licensing issues.

The major advantages I can point is
-Use of Pig over HTTP rather SSH.
-User management will become easy as this can be handled easily  
using any

CMS

--nitesh

--
Nitesh Bhatia
Dhirubhai Ambani Institute of Information  Communication Technology
Gandhinagar
Gujarat

Life is never perfect. It just depends where you draw the line.

visit:
http://www.awaaaz.com - connecting through music
http://www.volstreet.com - lets volunteer for better tomorrow
http://www.instibuzz.com - Voice opinions, Transact easily, Have fun







--
Nitesh Bhatia
Dhirubhai Ambani Institute of Information  Communication Technology
Gandhinagar
Gujarat

Life is never perfect. It just depends where you draw the line.

visit:
http://www.awaaaz.com - connecting through music
http://www.volstreet.com - lets volunteer for better tomorrow
http://www.instibuzz.com - Voice opinions, Transact easily, Have fun

Re: [Pig Wiki] Update of HowToContribute by AlanGates

2009-04-16 Thread Alan Gates

At this point these are all proposed, none are yet realized.  So there  
is no code for any of them.  The place to track these proposals are in  
the referenced JIRAs.


Alan.

On Apr 15, 2009, at 6:44 PM, zhang jianfeng wrote:


Hi Alan,

Thank you for your guideline. So where's code of these  
ProposedProjects. Are
they in different branch or in  the trunk? How can I track the  
progress of

these ProposedProjects ?

Thank you.



On Thu, Apr 16, 2009 at 7:17 AM, Apache Wiki wikidi...@apache.org  
wrote:



Dear Wiki user,

You have subscribed to a wiki page or wiki category on Pig Wiki for
change notification.

The following page has been changed by AlanGates:
http://wiki.apache.org/pig/HowToContribute


--
* [http://www.apache.org/dev/contributors.html Apache contributor
documentation]
* [http://www.apache.org/foundation/voting.html Apache voting
documentation]

+ == Picking Something to Work On ==
+ Looking for a place to start?  A great first place is to peruse the
+ [https://issues.apache.org/jira/browse/PIG JIRA] and find an  
issue that

needs
+ resolved.  If you're looking for a bigger project, try  
ProposedProjects.

This
+ gives a list of projects the Pig team would like to see worked on.
+

Re: [Pig Wiki] Update of ProposedProjects by AlanGates

2009-04-16 Thread Alan Gates

Your understanding of the proposal is correct.  The goal would be to  
produce Java code rather than a pipeline configuration.  But the  
reasoning is not so that users can then take that and modify  
themselves.  There's nothing preventing them from doing it, but it has  
a couple of major drawbacks.


1) Code generators generally generate horrific looking code, because  
they are going for speed and compactness not human maintainability.   
Trying to work in that code would be very difficult.


2) If you start adding code to generated code, you can no longer use  
the original Pig Latin.  You are from that point forward stuck in  
Java, since you can't backport your Java into the Pig Latin.


The proposal is designed to test the performance of Pig based on  
generated Java (or for that matter any other language, it need not be  
Java).  For the idea you suggest, the NATIVE keyword (proposed here https://issues.apache.org/jira/browse/PIG-506) 
 is a better solution.


Alan.

On Apr 16, 2009, at 12:54 AM, nitesh bhatia wrote:


Hi
Can you briefly explain what is required in the first project? After  
reading
the description my impression is, currently when we are executing  
commands
on Pig Shell, Pig is first converting to map-reduce jobs and then  
feeding it
to hadoop. In this project are we proposing that, the execution plan  
made by
Pig will be first converted to a java file for map-reduce procedure  
and then

feed onto hadoop network ?

If this is the case then I am sure it will be great help to users as  
this
functionality can be used to write complicated map-reduce jobs very  
easily.
Initially user can write the Pig scripts / commands required for his  
job and
get the map-reduce java files. Then he can edit map-reduce files to  
extend
the functionality  and add extra procedures that are not provided by  
Pig but

can be executed over hadoop.

--nitesh

On Wed, Apr 15, 2009 at 9:57 PM, Apache Wiki wikidi...@apache.org  
wrote:



Dear Wiki user,

You have subscribed to a wiki page or wiki category on Pig Wiki for
change notification.

The following page has been changed by AlanGates:
http://wiki.apache.org/pig/ProposedProjects

New page:
= Proposed Pig Projects =
This page describes projects what we (the committers) would like to  
see

added
to Pig.  The scale of these projects vary, but they are larger  
projects,

usually on the weeks or months scale.  We have not yet filed
[https://issues.apache.org/jira/browse/PIG JIRAs] for some of these
because they are still in the vague idea stage.  As they become more
concrete,
[https://issues.apache.org/jira/browse/PIG JIRAs] will be filed for  
them.


We welcome contributers to take on one of these projects.  If you  
would

like
to do so, please file a JIRA (if one does not already exist for the
project)
with a proposed solution.  Pig's committers will work with you from  
there

to
help refine your solution.  Once a solution is agreed upon, you can  
begin

implementation.

If you see a project here that you would like to see Pig implement  
but you

are
not in a position to implement the solution right now, feel free to  
vote

for
the project.  Add your name to the list of supporters.  This will  
help
contributers looking for a project to select one that will benefit  
many

users.

If you would like to propose a project for Pig, feel free to add to  
this

list.
If it is a smaller project, or something you plan to begin work on
immediately, filing a [https://issues.apache.org/jira/browse/PIG  
JIRA] is

a better route.

|| Catagory || Project || JIRA || Proposed By || Votes For ||
|| Execution || Pig currently executes scripts by building a  
pipeline of
pre-built operators and running data through those operators in map  
reduce
jobs.  We need to investigate instead have Pig generate java code  
specific
to a job, and then compiling that code and using it to run the map  
reduce

jobs. || || Many conference attendees || gates ||
|| Language || Currently only DISTINCT, ORDER BY, and FILTER are  
allowed
inside FOREACH.  All operators should be allowed in FOREACH. (Limit  
is being
worked on [https://issues.apache.org/jira/browse/PIG-741 741] || ||  
gates

|| ||
|| Optimization || Speed up comparison of tuples during shuffle for  
ORDER
BY || [https://issues.apache.org/jira/browse/PIG-659 659] || olgan  
|| ||
|| Optimization || Order by should be changed to not use POPackage  
to put

all of the tuples in a bag on the reduce side, as the bag is just
immediately flattened.  It can instead work like join does for the  
last

input in the join. || || gates || ||
|| Optimization || Often in a Pig script that produces a chain of  
MR jobs,
the map phases of 2nd and subsequent jobs very little.  What little  
they do
should be pushed into the proceeding reduce and the map replaced by  
the
identity mapper.  Initial tests showed that the identity mapper was  
50%
faster than using a Pig mapper (because Pig uses the loader to  
parse out

tuples

Re: A proposal for changing pig's memory management

2009-05-19 Thread Alan Gates

The claims in the paper I was interested in were not issues like non- 
blocking I/O etc.  The claim that is of interest to pig is that a  
memory allocation and garbage collection scheme that is beyond the  
control of the programmer is a bad fit for a large data processing  
system.  This is a fundamental design choice in Java, and fits it well  
for the vast majority of its uses.  But for systems like Pig there  
seems to be no choice but to work around Java's memory management.   
I'll clarify this point in the document.


I took a closer look at NIO.  My concern is that it does not give the  
level of control I want.  NIO allows you to force a buffer to disk and  
request a buffer to load, but you cannot force a page out of memory.   
It doesn't even guarantee that after you load a page it will really be  
loaded.  One of the biggest issues in pig right now is that we run out  
memory or get the garbage collector in a situation where it can't make  
sufficient progress.  Perhaps switching to large buffers instead of  
having many individual objects will address this.  But I'm concerned  
that if we cannot explicitly force data out of memory onto disk then  
we'll be back in the same boat of trusting the Java memory manager.


Alan.

On May 14, 2009, at 7:43 PM, Ted Dunning wrote:


That Telegraph dataflow paper is pretty long in the tooth.  Certainly
several of their claims have little force any more (lack of non- 
blocking
I/O, poor thread performance, no unmap, very expensive  
synchronization for
uncontested locks).  It is worth that they did all of their tests on  
the 1.3

JVM and things have come an enormous way since then.

Certainly, it is worth having opaque contains based on byte arrays,  
but

isn't that pretty much what the NIO byte buffers are there to provide?
Wouldn't a virtual tuple type that was nothing more than a byte  
buffer, type

and an offset do almost all of what is proposed here?

On Thu, May 14, 2009 at 5:33 PM, Alan Gates ga...@yahoo-inc.com  
wrote:



http://wiki.apache.org/pig/PigMemory

Alan.

Re: A proposal for changing pig's memory management

2009-05-20 Thread Alan Gates



On May 19, 2009, at 10:30 PM, Mridul Muralidharan wrote:



I am still not very convinced about the value about this  
implementation - particularly considering the advances made since  
1.3 in memory allocators and garbage collection.


My fundamental concern is not with the slowness of garbage  
collection.  I am asserting (along with the paper) that garbage  
collection is not an optimal choice for a large data processing  
system.  I don't want to improve the garbage collector, I want to  
manage a subset of the memory without it.





The side effect of this proposal is many, and sometimes non-obvious.
Like implicitly moving young generation data into older generation,  
causing much more memory pressure for gc, fragmentation of memory  
blocks causing quite a bit of memory pressure, replicating quite a  
bit of functionality with garbage collection, possibility of bugs  
with ref counting, etc.


I don't understand your concerns regarding the load on the gc and  
memory fragmentation.  Let's say I have 10,000 tuples, each with 10  
fields.  Let's also assume that these tuples live long enough to make  
it into the old memory pool, since this is the interesting case  
where objects live long enough to cause a problem.  In the current  
implementation there will be 110,000 objects that the gc has to manage  
moving into the old pool, and check every time it cleans the old  
pool.  In the proposed implementation there would be 10,001 objects  
(assuming all the data fit into one buffer) to manage.  And rather  
than allocating 100,000 small pieces of memory, we would have  
allocated one large segment.  My belief is that this would lighten the  
load on the gc.


This does replicate some of  the functionality of the garbage  
collector.  Complex systems frequently need to re-implement  
foundational functionality in order to optimize it for their needs.   
Hence many RDBMS engines have their own implementations of memory  
management, file I/O, thread scheduling, etc.


As for bugs in ref counting, I agree that forgetting to deallocate is  
one of the most pernicious problems of allowing programmers to do  
memory management.  But in this case all that will happen is that a  
buffer will get left around that isn't needed.  If the system needs  
more memory then that buffer will eventually get selected for flushing  
to disk, and then it will stay there as no one will call it back into  
memory.  So the cost of forgetting to deallocate is minor.





If assumption that current working set of bag/tuple does not need to  
be spilled, and anything else can be, then this will pretty much  
deteriorate to current impl in worst case.
That is not the assumption.  There are two issues:  1) trying to spill  
bags only when we determine we need to is highly error prone, because  
we can't accurately determine when we need to and because we sometimes  
can't dump fast enough to survive; 2) current memory usage is far too  
high, and needs to be reduced.








A much more simpler method to gain benefits would be to handle  
primitives as ... primitives and not through the java wrapper  
classes for them.
It should be possible to write schema aware tuples which make use of  
the primitives specified to take a fraction of memory required (4  
bytes + null_check boolean for int + offset mapping instead of 24/32  
bytes it currently is, etc).


In my observation, at least 50% of the data in pig is untyped, which  
means it's a byte array.  Of the 50% that people declare or is  
determined by the program, probably 50-80% of that are chararrays and  
maps.  So that means that somewhere around 25% of the data is  
numeric.  Shrinking that 25% by 75% will be nice, but not adequate.   
And it does nothing to help with the issue of being able to spill in a  
controlled way instead of only in emergency situations.


Alan.

Re: UDF with parameters?

2009-05-22 Thread Alan Gates

Yes, it is possible.  The UDF should take the percentage you want as a  
constructor argument.  It will have to be passed as a string and  
converted.  Then in your Pig Latin, you will use the DEFINE statement  
to pass the argument to the constructor.


REGISTER /src/myfunc.jar
DEFINE percentile myfunc.percentile('90');
A = LOAD 'students' as (name, gpa);
B = FOREACH A GENERATE percentile(gpa);

See http://hadoop.apache.org/pig/docs/r0.2.0/piglatin.html#DEFINE for  
more details.


Alan.

On May 22, 2009, at 3:37 PM, Brian Long wrote:


Hi,

I'm interested in developing a PERCENTILE UDF, e.g. for calculating a
median, 99th percentile, 90th percentile, etc. I'd like the UDF to be
parametric with respect to the percentile being requested, but I  
don't see
any way to do that, and it seems like I might need to create  
PERCENTILE_50,
PERCENTILE_90, etc type UDFs explicitly, versus being able to do  
something

like GENERATE PERCENTILE(90, duration)

I'm new to Pig, so I might be missing the way to do this... is it  
possible?


Thanks,
Brian

Proposed design for new merge join in pig

2009-05-28 Thread Alan Gates


http://wiki.apache.org/pig/PigMergeJoin

Alan.

Updated PigMix numbers for latest top of trunk

2009-05-28 Thread Alan Gates


http://wiki.apache.org/pig/PigMix

Alan.

Re: PigPen Source

2009-06-15 Thread Alan Gates

It has not yet been integrated into contrib because it requires the  
eclipse libraries to build, and those weren't integrated.  The ivy  
stuff used by pig's build should be configured to pick up the  
appropriate eclipse jars so that this can be added to contrib.


Alan.

On Jun 15, 2009, at 12:09 PM, Russell Jurney wrote:

I want to play with PigPen, but although I can find the patches  
here: https://issues.apache.org/jira/browse/PIG-366 on the Jira, I  
cannot find the source in trunk/contrib/pigpen, or in any path in  
any branch.


Where does the PigPen source reside?  Does it exist only as a patch?

Russell Jurney
rjur...@cloudstenography.com

Re: Rewire and multi-query load/store optimization

2009-06-16 Thread Alan Gates

+1 on option one.  The use of store-load was only to overcome a  
temporary problem in Pig.  We've fixed the problem, so let's not  
propagate it.  We will need to document this very clearly (maybe even  
to the point of issuing warnings in the parser when we see this combo)  
so users understand that this is now a hinderance rather than a help.


Alan.

On Jun 12, 2009, at 2:19 PM, Santhosh Srinivasan wrote:


With the implementation of rewire as part of the optimizer
infrastructure, a bug was exposed in the load/store optimization in  
the

multi-query feature. Below, I will articulate the bug and the
ramifications of a few possible solutions.

Load/store optimization in the multi-query feature?
---

If a script has an explicit store and a corresponding load which loads
the output of the store, the store-load combination can be  
optimized. An

example will illustrate the concept.

Pre-conditions:

1. The store location and the load location should match
2. The store format and the load format should be compatible

{code}

A = load 'input';
B = group A by $0;
store B into 'output';
C = load 'output';
D = group C by $0;
store D into 'some_other_output';

{code}

In the script above, the output of the first store serves as input of
the second load (C). In addition, the store and load use  
PigStorage() as

the store/load mechanism. In the logical plan this combination by
splitting B into the store and D.

Bug
---

When the load in the store/load combination was removed, the inner  
plans

of the load's successors (in this case D), were not updated correctly.
As a result, the projections in the inner plans still held  
references to

non-existing operators.

Consequence of the bug fix
---

During the map-reduce (M/R) compilation the split operator is compiled
into a store and a load. Prior to multi-query, for each M/R boundary
resulted in a temporary store using BinStorage. The subsequent load
could infer the type as BinStorage returns typed records, i.e., non- 
byte

array records.

With multi-query and the load/store optimization, the temporary
BinStorage data is not generated. Instead, the subsequent load uses  
the

output of the previous store as its input. Here, the loader can get
typed or untyped records based on the loader. As a result, the  
operators

in the map phase that rely on the type information (inferred from the
logical plan) will fail due to type mismatch.

Possible Solutions
--

Solution 1
==
Switch the load/store optimization. Users were primarily storing
intermediate data within the same script to overcome Pig's limitation,
i.e., absence of the multi-query feature. Going forward, with
multi-query turned on, users who store intermediate data will not  
enjoy

all the benefits of the optimization.

Solution 2
==
After the M/R compilation is completed, during the final pass of the
plan, fix the types of the projections to reflect typed/untyped  
data. In
other words, if the loader is returning typed data then retain the  
types

else change the types to bytearray. In order to make this decision,
loaders should support an interface to indicate if the records are  
typed

or untyped.


Thanks,
Santhosh

Re: [VOTE] Release Pig 0.3.0 (candidate 0)

2009-06-22 Thread Alan Gates


Downloaded, ran, ran tutorial, built piggybank.  All looks good.

+1

Alan.

On Jun 18, 2009, at 12:30 PM, Olga Natkovich wrote:


Hi,

I created a candidate build for Pig 0.3.0 release. The main feature of
this release is support for multiquery which allows to share  
computation

across multiple queries within the same script. We see significant
performance improvements (up to order of magnitude) as the result of
this optimization.

I ran the rat report and made sure that all the source files contain
proper headers. (Not attaching the report since it caused trouble with
the last release.)

Keys used to sign the release candidate are at
http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS.

Please, download and try the release candidate:
http://people.apache.org/~olga/pig-0.3.0-candidate-0/.

Please, vote by Wednesday, June 24th.

Olga

Re: asking for comments on benchmark queries

2009-06-23 Thread Alan Gates


Zheng,

I don't think you're subscribed to pig-dev (your emails have been  
bouncing to the moderator).  So I've cc'd you explicitly on this.


I don't think we need a Pig JIRA, it's probably easier if we all work  
on the hive one.  I'll post my comments on the various scripts to that  
bug.  I've also attached them here since pig-dev won't see the updates  
to that bug.


Alan.

grep_select.pig:

Adding types in the LOAD statement will force Pig to cast the key  
field, even though it doesn't need to (it only reads and writes the  
key field).  So I'd change the query to be:


rmf output/PIG_bench/grep_select;
a = load '/data/grep/*' using PigStorage as (key,field);
b = filter a by field matches '.*XYZ.*';
store b into 'output/PIG_bench/grep_select';

field will still be cast to a chararray for the matches, but we won't  
waste time casting key and then turning it back into bytes for the  
store.


rankings_select.pig:

Same comment, remove the casts.  pagerank will be properly cast to an  
integer.


rmf output/PIG_bench/rankings_select;
a = load '/data/rankings/*' using PigStorage('|') as  
(pagerank,pageurl,aveduration);

b = filter a by pagerank  10;
store b into 'output/PIG_bench/rankings_select';

rankings_uservisits_join.pig:

Here you want to keep the casts of pagerank so that it is handled as  
the right type.  adRevenue will default to double in SUM when you  
don't specify a type.  You also want to project out all unneeded  
columns as soon as possible.  You should set PARALLEL on the join to  
use the number of reducers appropriate for your cluster.  Given that  
you have 10 machines and 5 reduce slots per machine, and speculative  
execution is off you probably want 50 reducers.  I notice you set  
parallel to 60 on the group by.  That will give you 10 trailing  
reducers.  Unless you have a need for the result to be split 60 ways  
you should reduce that to 50 as well.  (I'm assuming here when you say  
you have a 10 node cluster you mean 10 data nodes, not counting your  
name node and task tracker.  The reduce formula should be 5 * number  
of data nodes.)


A last question is how large are the uservisits and rankings data  
sets?  If either is  80M or so you can use the fragment/replicate  
join, which is much faster than the general join.  The following  
script assumes that isn't the case; but if it is let me know and I can  
show you the syntax for it.


So the end query looks like:

rmf output/PIG_bench/html_join;
a = load '/data/uservisits/*' using PigStorage('|') as
	 
(sourceIP 
,destURL 
,visitDate 
,adRevenue,userAgent,countryCode,languageCode:,searchWord,duration);
b = load '/data/rankings/*' using PigStorage('|') as  
(pagerank:int,pageurl,aveduration);

c = filter a by visitDate  '1999-01-01' AND visitDate  '2000-01-01';
c1 = fjjkkoreach c generate sourceIP, destURL, addRevenue;
b1 = foreach b generate pagerank, pageurl;
d = JOIN c1 by destURL, b1 by pageurl parallel 50;
d1 = foreach d generate sourceIP, pagerank, adRevenue;
e = group d1 by sourceIP parallel 50;
f = FOREACH e GENERATE group, AVG(d1.pagerank), SUM(d1.adRevenue);
store f into 'output/PIG_bench/html_join';

uservisists_agrre.pig:

Same comments as above on projecting out as early as possible and on  
setting parallel appropriately for your cluster.


rmf output/PIG_bench/uservisits_aggre;
a = load '/data/uservisits/*' using PigStorage('|') as
	 
(sourceIP 
,destURL 
,visitDate 
,adRevenue,userAgent,countryCode,languageCode,searchWord,duration);

a1 = foreach a generate sourceIP, adRevenue;
b = group a by sourceIP parallel 50;
c = FOREACH b GENERATE group, SUM(a. adRevenue);
store c into 'output/PIG_bench/uservisits_aggre';



On Jun 22, 2009, at 10:36 PM, Zheng Shao wrote:


Hi Pig team,

We’d like to get your feedback on a set of queries we implemented on  
Pig.


We’ve attached the hadoop configuration and pig queries in the  
email. We start the queries by issuing “pig xxx.pig”. The queries  
are from SIGMOD’2009 paper. More details are athttps:// 
issues.apache.org/jira/browse/HIVE-396 (Shall we open a JIRA on PIG  
for this?)



One improvement is that we are going to change hadoop to use LZO as  
intermediate compression algorithm very soon. Previously we used  
gzip for all performance tests including hadoop, hive and pig.


The reason that we specify the number of reducers in the query is to  
try to match the same number of reducer as Hive automatically  
suggested. Please let us know what is the best way to set the number  
of reducers in Pig.


Are there any other improvements we can make to the Pig query and  
the hadoop configuration?


Thanks,
Zheng

hadoop-site.xmlhive-default.xmlhadoop-env.sh.txt

Re: requirements for Pig 1.0?

2009-06-24 Thread Alan Gates

Integration with Owl is something we want for 1.0.  I am hopeful that  
by Pig's 1.0 Owl will have flown the coop and become either a  
subproject or found a home in Hadoop's common, since it will hopefully  
be used by multiple other subprojects.


Alan.

On Jun 23, 2009, at 11:42 PM, Russell Jurney wrote:


For 1.0 - complete Owl?

http://wiki.apache.org/pig/Metadata

Russell Jurney
rjur...@cloudstenography.com


On Jun 23, 2009, at 4:40 PM, Alan Gates wrote:

I don't believe there's a solid list of want to haves for 1.0.  The  
big issue I see is that there are too many interfaces that are  
still shifting, such as:


1) Data input/output formats.  The way we do slicing (that is, user  
provided InputFormats) and the equivalent outputs aren't yet  
solid.  They are still too tied to load and store functions.  We  
need to break those out and understand how they will be expressed  
in the language. Related to this is the semantics of how Pig  
interacts with non-file based inputs and outputs.  We have a  
suggestion of moving to URLs, but we haven't finished test driving  
this to see if it will really be what we want.


2) The memory model.  While technically the choices we make on how  
to represent things in memory are internal, the reality is that  
these changes may affect the way we read and write tuples and bags,  
which in turn may affect our load, store, eval, and filter functions.


3) SQL.  We're working on introducing SQL soon, and it will take it  
a few releases to be fully baked.


4) Much better error messages.  In 0.2 our error messages made a  
leap forward, but before we can claim to be 1.0 I think they need  
to make 2 more leaps:  1) they need to be written in a way end  
users can understand them instead of in a way engineers can  
understand them, including having sufficient error documentation  
with suggested courses of action, etc.; 2) they need to be much  
better at tying errors back to where they happened in the script,  
right now if one of the MR jobs associated with a Pig Latin script  
fails there is no way to know what part of the script it is  
associated with.


There are probably others, but those are the ones I can think of  
off the top of my head.  The summary from my viewpoint is we still  
have several 0.x releases before we're ready to consider 1.0.  It  
would be nice to be 1.0 not too long after Hadoop is, which still  
gives us at least 6-9 months.


Alan.


On Jun 22, 2009, at 10:58 AM, Dmitriy Ryaboy wrote:

I know there was some discussion of making the types release (0.2)  
a Pig 1
release, but that got nixed. There wasn't a similar discussion on  
0.3.

Has the list of want-to-haves for Pig 1.0 been discussed since?

Re: requirements for Pig 1.0?

2009-06-24 Thread Alan Gates

To be clear, going to 1.0 is not about having a certain set of  
features.  It is about stability and usability.  When a project  
declares itself 1.0 it is making some guarantees regarding the  
stability of its interfaces (in Pig's case this is Pig Latin, UDFs,  
and command line usage).  It is also declaring itself ready for the  
world at large, not just the brave and the free.  New features can  
come in as experimental once we're 1.0, but the semantics of the  
language and UDFs can't be shifting (as we've done the last several  
releases and will continue to do for a bit I think).


With that in mind, further comments inlined.

On Jun 24, 2009, at 10:18 AM, Dmitriy Ryaboy wrote:


Alan, any thoughts on performance baselines and benchmarks?
Meaning do we need to reach a certain speed before 1.0?  I don't think  
so.  Pig is fast enough now that many people find it useful.  We want  
to continue working to shrink the gap between Pig and MR, but I don't  
see this as a blocker for 1.0.




I am a little surprised that you think SQL is a requirement for 1.0,  
since

it's essentially an overlay, not core functionality.
If we were debating today whether to go 1.0, I agree that we would not  
wait for SQL.  But given that we aren't (at least I wouldn't vote for  
it now) and that SQL will be in soon, it will need to stabilize.


What about the storage layer rewrite (or is that what you referred  
to with

your first bullet-point)?
To be clear, the Zebra (columnar store stuff) is not a rewrite of the  
storage layer.  It is an additional storage option we want to  
support.  We aren't changing current support for load and store.




Also, the subject of making more (or all) operators nestable within a
foreach comes up now and then.. would you consider this important  
for 1.0,

or something that can wait?

This would be an added feature, not a semantic change in Pig Latin.



Integration with other languages (a-la PyPig)?

Again, this is a new feature, not a stability issue.



The Roadmap on the Wiki is still as of Q3 2007 makes it hard  
for an

outside contributor to know where to jump :-).
Agreed.  Olga has given me the task of updating this soon.  I'm going  
to try to get to that over the next couple of weeks.  This discussion  
will certainly provide input to that update.


Alan.

Re: Is it a bug ?

2009-07-23 Thread Alan Gates

It looks wrong to me, but I don't have a deep understanding of that  
code.


Alan.

On Jul 15, 2009, at 6:03 PM, zhang jianfeng wrote:


Hi all,



Today, when I read the source code, I find a piece of suspicious code:
(PigServer.java Line 1047)



   graph.ignoreNumStores = processedStores;//  I think  
here

should be graph.ignoreNumStores = ignoreNumStores

   graph.processedStores = processedStores;

   graph.fileNameMap = fileNameMap;



I think this may be a typing mistake. Can anyone confirm it ?



Thank you.





Jeff Zhang

Re: Pig 0.4.0 release

2009-08-18 Thread Alan Gates



On Aug 18, 2009, at 10:05 AM, Dmitriy Ryaboy wrote:


I am about to submit a cleaned up patch for 924.
It works fine as a static patch (in fact I can attach it to 660 as
well) -- compiling with -Dhadoop.version=XX works as proposed for the
static shims. It does the necessary prep for the code to be able to
switch based on what's in its classpath, but it does not require
unbundling to work statically.


Ok, we'll take a look.



The hadoop20 jar attached to the zebra ticket is built in a different
way than 18 and 19; it does not report its version (18 and 19 do).
Right now I get around it by hard-coding a special case (Unknown =
20), but that's obviously suboptimal. Could someone rebuild
hadoop20.jar the way Pig wants it, and with the proper version
identification?  If that happens, 924/660 can go in together with
hadoop20.jar and users will at least be able to build against a static
version of hadoop without requiring a patch.


The hadoop 0.20 jar submitted with Zebra is not a standard jar.  It  
has extra tfile functionality that was not in 0.20, but will be in  
0.20.1.  It isn't something we should publish.  If we put a  
hadoop20.jar into pig's lib, it should be from 0.20 (or when  
available, 0.20.1).


Alan.



-Dmitriy

On Tue, Aug 18, 2009 at 9:56 AM, Alan Gatesga...@yahoo-inc.com  
wrote:

Non-committers certainly get a vote, it just isn't binding.

I agree on PIG-925 as a blocker.  I don't see PIG-859 as a blocker  
since

there is a simple work around.

If we want to release 0.4.0 within a week or so, dynamic shims  
won't be an
option because we won't be able to solve the bundled hadoop lib  
problem in
that amount of time.  I agree that we are not making life easy  
enough for
users who want to build with hadoop 0.20.  Based on comments on the  
JIRA,
I'm not sure the patch for the static shims is ready.  What if  
instead we
checked in a version of hadoop20.jar that will work for users who  
want to
build with 0.20.  This way users can still build this if they want  
and our

release isn't blocked on the patch.

Alan.


On Aug 17, 2009, at 12:03 PM, Dmitriy Ryaboy wrote:


Olga,

Do non-commiters get a vote?

Zebra is in trunk, but relies on 0.20, which is somewhat  
inconsistent

even if it's in contrib/

Would love to see dynamic (or at least static) shims incorporated  
into

the 0.4 release (see PIG-660, PIG-924)

There are a couple of bugs still outstanding that I think would need
to get fixed before a release:

https://issues.apache.org/jira/browse/PIG-859
https://issues.apache.org/jira/browse/PIG-925

I think all of these can be solved within a week; assuming we are
talking about a release after these go into trunk, +1.

-D


On Mon, Aug 17, 2009 at 11:46 AM, Olga Natkovichol...@yahoo- 
inc.com

wrote:


Pig Developers,



We have made several significant performance and other  
improvements over

the last couple of months:



(1) Added an optimizer with several rules

(2) Introduced skew and merge joins

(3) Cleaned COUNT and AVG semantics



I think it is time for another release to make this functionality
available to users.



I propose that Pig 0.4.0 is released against Hadoop 18 since most  
users
are still using this version. Once Hadoop 20.1 is released, we  
will roll

Pig 0.5.0 based on Hadoop 20.



Please, vote on the proposal by Thursday.



Olga

Re: questions about integration of pig and HBase

2009-09-09 Thread Alan Gates

See the JIRA PIG-6.  See also the HbaseStorage unit test that tests  
the functionality.


Alan.

On Sep 9, 2009, at 5:31 AM, Vincent BARAT wrote:


Thank you for the link.

Anyway, what I was looking for is an example of PIG syntax loading  
from a HBase table, is it something like:


queries = LOAD 'HBase Table USING HBaseStorage()

?

Jeff Zhang a écrit :
Using HBaseStorage as your loadFunc, it uses a customer slicer  
HBaseSlice

You can refer this link for more information
http://hadoop.apache.org/pig/docs/r0.3.0/udf.html#Custom+Slicer
2009/9/9 Vincent BARAT vincent.ba...@ubikod.com


Alan Gates a écrit :

Pig supports reading from Hbase (in Hadoop/Hbase 0.18 only).

Hello,

Do you have any link to the documentation about how to do that?
I can't find any example...

Thanks,

Re: Request for feedback: cost-based optimizer

2009-09-11 Thread Alan Gates

This is a good start at adding a cost based optimizer to Pig.  I have  
a number of comments:


1) Your argument for putting it in the physical layer rather than the  
logical is that the logical layer does not know physical statistics.   
This need not be true.  You suggest adding a getStatistics call to the  
loader to give statistics.  The logical layer can make this call and  
make decisions based on the results without understanding the  
underlying physical layer.  It seems that the real reason you want to  
put the optimizer in the physical layer is, rather than trying to do  
predictive statistics (such as we guess this join will result in a 2x  
data explosion) you want to see the results of actual MR jobs and then  
make decisions.  This seems like a reasonable choice for a couple of  
reasons:  a) statistical guesses are hard to get right, and Pig has  
limited statistics to begin with; b) since Pig Latin scripts can be  
arbitrarily long, bad guesses at the beginning will have a worse  
ripple effect than bad guesses in a SQL optimizer.


2) The changes you propose in Pig Server are quite complex.  Would it  
be possible instead to put the changes in MapReduceLauncher?  It could  
run the first MR job in a Pig Latin script, look at the results, and  
then rerun your CBO on the remaining physical plan and re-translate  
this to a new MR plan and resubmit.  This would require annotations to  
the MR plan to indicate where in a physical plan the MR boundaries  
fall, so that correct portions of the original physical plan could be  
used for reoptimization and recompilation.  But it would contain the  
complexity of your changes to MapReduceLauncher instead of scattering  
them through the entire system.


3) On adding getStatistics, I am currently working on a proposal to  
make a number of changes to the load interface, including  
getStatistics.  I hope to publish that proposal by next week.   
Similarly I am working on a proposal of how Pig will interact with  
metadata systems (such as Owl) which I also hope to propose next  
week.  We will be actively working in these areas because we need them  
for our SQL implementation.  So, one, you'll get a lot of this for  
free; two, we should stay connected on these things so what we  
implement works for what you need.


Alan.

On Sep 1, 2009, at 9:54 AM, Dmitriy Ryaboy wrote:


Whoops :-)
Here's the Google doc:
http://docs.google.com/Doc?docid=0Adqb7pZsloe6ZGM4Z3o1OG1fMjFrZjViZ21jdAhl=en

-Dmitriy

On Tue, Sep 1, 2009 at 12:51 PM, Santhosh Srinivasans...@yahoo- 
inc.com wrote:

Dmitriy and Gang,

The mailing list does not allow attachments. Can you post it on a
website and just send the URL ?

Thanks,
Santhosh

-Original Message-
From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com]
Sent: Tuesday, September 01, 2009 9:48 AM
To: pig-dev@hadoop.apache.org
Subject: Request for feedback: cost-based optimizer

Hi everyone,
Attached is a (very) preliminary document outlining a rough design we
are proposing for a cost-based optimizer for Pig.
This is being done as a capstone project by three CMU Master's  
students

(myself, Ashutosh Chauhan, and Tejal Desai). As such, it is not
necessarily meant for immediate incorporation into the Pig codebase,
although it would be nice if it, or parts of it, are found to be  
useful

in the mainline.

We would love to get some feedback from the developer community
regarding the ideas expressed in the document, any concerns about the
design, suggestions for improvement, etc.

Thanks,
Dmitriy, Ashutosh, Tejal

Re: [VOTE] Release Pig 0.4.0 (candidate 0)

2009-09-16 Thread Alan Gates

When I run this against a Hadoop 0.18.3 instance I can do DFS  
operations, but MR operations fail with:


Error message from job controller
-
java.lang.AbstractMethodError:  
org.apache.xerces.dom.DocumentImpl.getXmlStandalone()Z
at  
com 
.sun 
.org 
.apache.xalan.internal.xsltc.trax.DOM2TO.setDocumentInfo(DOM2TO.java: 
373)
at  
com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java: 
127)
at  
com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java: 
94)
at  
com 
.sun 
.org 
.apache 
.xalan 
.internal 
.xsltc.trax.TransformerImpl.transformIdentity(TransformerImpl.java:662)
at  
com 
.sun 
.org 
.apache 
.xalan 
.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:708)
at  
com 
.sun 
.org 
.apache 
.xalan 
.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:313)
at org.apache.hadoop.conf.Configuration.write(Configuration.java: 
994)

at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:780)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:370)
at  
org 
.apache 
.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at  
org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)

at java.lang.Thread.run(Thread.java:619)

Pig Stack Trace
---
ERROR 6015: During execution, encountered a Hadoop error.

org.apache.pig.backend.executionengine.ExecException: ERROR 6015:  
During execution, encountered a Hadoop error.
at  
com 
.sun 
.org 
.apache.xalan.internal.xsltc.trax.DOM2TO.setDocumentInfo(DOM2TO.java: 
373)
at  
com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java: 
127)
at  
com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java: 
94)
at  
com 
.sun 
.org 
.apache 
.xalan 
.internal 
.xsltc.trax.TransformerImpl.transformIdentity(TransformerImpl.java:662)
at  
com 
.sun 
.org 
.apache 
.xalan 
.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:708)
at  
com 
.sun 
.org 
.apache 
.xalan 
.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:313)
at org.apache.hadoop.conf.Configuration.write(Configuration.java: 
994)

at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:780)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:370)
at  
org 
.apache 
.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at  
org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
Caused by: java.lang.AbstractMethodError:  
org.apache.xerces.dom.DocumentImpl.getXmlStandalone()Z

... 11 more
= 
= 
= 
= 
= 
= 
= 
= 



This doesn't look good.

Alan.

On Sep 14, 2009, at 2:05 PM, Olga Natkovich wrote:


Hi,



I created a candidate build for Pig 0.4.0 release. The highlights of
this release are



-  Performance improvements especially in the area of JOIN
support where we introduced two new join types: skew join to deal with
data skew and sort merge join to take advantage of the sorted data  
sets.


-  Support for Outer join.

-  Works with Hadoop 18



I ran the release audit and rat report looked fine. The relevant  
part is

attached below.



Keys used to sign the release are available at
http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS?view=markup.



Please download the release and try it out:
http://people.apache.org/~olga/pig-0.4.0-candidate-0.



Should we release this? Vote closes on Thursday, 9/17.



Olga





[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/contrib/ 
CHANGES.txt

[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/contrib/zebra/ 
CHANG

ES.txt
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/broken- 
links.x

ml
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ 
cookbook.html

[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/index.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/linkmap.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ 
piglatin_refer

ence.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ 
piglatin_users

.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/setup.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ 
tutorial.html

[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/udf.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/api/ 
package-li

st
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
changes.

html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
missingS

inces.txt
[java]  !?

Re: [VOTE] Release Pig 0.4.0 (candidate 0)

2009-09-16 Thread Alan Gates


When I run it as:

java -cp ./pig.jar:/home/y/conf/pig/piglet/released -Dhod.server=  
org.apac
he.pig.Main /d1/pig_harness/out/pigtest/gates/gates.1253134669/ 
Checkin_2.pig


it works.  When I run it as:

JAVA_HOME=/usr PIG_CONF_DIR=/home/y/conf/pig/piglet/released/ bin/pig  
~/pig/scripts/Checkin_2.pig


it fails with the stack given earlier.

Alan.

On Sep 16, 2009, at 12:46 PM, Olga Natkovich wrote:


Alan,

I tried the jar packaged in the release and I am able to successfully
run tests. Could you give it another try?

Thanks,

Olga

-Original Message-
From: Alan Gates [mailto:ga...@yahoo-inc.com]
Sent: Wednesday, September 16, 2009 9:53 AM
To: pig-dev@hadoop.apache.org
Cc: priv...@hadoop.apache.org
Subject: Re: [VOTE] Release Pig 0.4.0 (candidate 0)

When I run this against a Hadoop 0.18.3 instance I can do DFS
operations, but MR operations fail with:

Error message from job controller
-
java.lang.AbstractMethodError:
org.apache.xerces.dom.DocumentImpl.getXmlStandalone()Z
at
com
.sun
.org
.apache.xalan.internal.xsltc.trax.DOM2TO.setDocumentInfo(DOM2TO.java:
373)
at
com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java:
127)
at
com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java:
94)
at
com
.sun
.org
.apache
.xalan
.internal
.xsltc.trax.TransformerImpl.transformIdentity(TransformerImpl.java: 
662)

at
com
.sun
.org
.apache
.xalan
.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java: 
708)

at
com
.sun
.org
.apache
.xalan
.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java: 
313)

at org.apache.hadoop.conf.Configuration.write(Configuration.java:
994)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java: 
780)

at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:370)
at
org
.apache
.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java: 
247)

at
org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java: 
279)

at java.lang.Thread.run(Thread.java:619)

Pig Stack Trace
---
ERROR 6015: During execution, encountered a Hadoop error.

org.apache.pig.backend.executionengine.ExecException: ERROR 6015:
During execution, encountered a Hadoop error.
at
com
.sun
.org
.apache.xalan.internal.xsltc.trax.DOM2TO.setDocumentInfo(DOM2TO.java:
373)
at
com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java:
127)
at
com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java:
94)
at
com
.sun
.org
.apache
.xalan
.internal
.xsltc.trax.TransformerImpl.transformIdentity(TransformerImpl.java: 
662)

at
com
.sun
.org
.apache
.xalan
.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java: 
708)

at
com
.sun
.org
.apache
.xalan
.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java: 
313)

at org.apache.hadoop.conf.Configuration.write(Configuration.java:
994)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java: 
780)

at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:370)
at
org
.apache
.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java: 
247)

at
org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java: 
279)

Caused by: java.lang.AbstractMethodError:
org.apache.xerces.dom.DocumentImpl.getXmlStandalone()Z
... 11 more
=
=
=
=
=
=
=
=
= 
= 
==


This doesn't look good.

Alan.

On Sep 14, 2009, at 2:05 PM, Olga Natkovich wrote:


Hi,



I created a candidate build for Pig 0.4.0 release. The highlights of
this release are



-  Performance improvements especially in the area of JOIN
support where we introduced two new join types: skew join to deal  
with

data skew and sort merge join to take advantage of the sorted data
sets.

-  Support for Outer join.

-  Works with Hadoop 18



I ran the release audit and rat report looked fine. The relevant
part is
attached below.



Keys used to sign the release are available at
http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS?view=markup.



Please download the release and try it out:
http://people.apache.org/~olga/pig-0.4.0-candidate-0.



Should we release this? Vote closes on Thursday, 9/17.



Olga





   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/contrib/
CHANGES.txt
   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/contrib/zebra/
CHANG
ES.txt
   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/broken-
links.x
ml
   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/
cookbook.html
   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/index.html
   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ 
linkmap.html

   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/
piglatin_refer
ence.html

Re: [VOTE] Release Pig 0.4.0 (candidate 1)

2009-09-17 Thread Alan Gates

Now the code won't build because there's no hadoop jar in the lib  
directory.


Alan.

On Sep 17, 2009, at 12:09 PM, Olga Natkovich wrote:


Hi,

I have fixed the issue causing the failure that Alan reported.

Please test the new release:
http://people.apache.org/~olga/pig-0.4.0-candidate-1/.

Vote closes on Tuesday, 9/22.

Olga


-Original Message-
From: Olga Natkovich [mailto:ol...@yahoo-inc.com]
Sent: Monday, September 14, 2009 2:06 PM
To: pig-dev@hadoop.apache.org; priv...@hadoop.apache.org
Subject: [VOTE] Release Pig 0.4.0 (candidate 0)

Hi,



I created a candidate build for Pig 0.4.0 release. The highlights of
this release are



-  Performance improvements especially in the area of JOIN
support where we introduced two new join types: skew join to deal with
data skew and sort merge join to take advantage of the sorted data  
sets.


-  Support for Outer join.

-  Works with Hadoop 18



I ran the release audit and rat report looked fine. The relevant  
part is

attached below.



Keys used to sign the release are available at
http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS?view=markup.



Please download the release and try it out:
http://people.apache.org/~olga/pig-0.4.0-candidate-0.



Should we release this? Vote closes on Thursday, 9/17.



Olga





[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/contrib/ 
CHANGES.txt

[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/contrib/zebra/ 
CHANG

ES.txt
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/broken- 
links.x

ml
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ 
cookbook.html

[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/index.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/linkmap.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ 
piglatin_refer

ence.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ 
piglatin_users

.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/setup.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ 
tutorial.html

[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/udf.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/api/ 
package-li

st
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
changes.

html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
missingS

inces.txt
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
user_com

ments_for_pig_0.3.1_to_pig_0.5.0-dev.xml
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
changes/

alldiffs_index_additions.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
changes/

alldiffs_index_all.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
changes/

alldiffs_index_changes.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
changes/

alldiffs_index_removals.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
changes/

changes-summary.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
changes/

classes_index_additions.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
changes/

classes_index_all.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
changes/

classes_index_changes.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
changes/

classes_index_removals.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
changes/

constructors_index_additions.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
changes/

constructors_index_all.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
changes/

constructors_index_changes.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
changes/

constructors_index_removals.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
changes/

fields_index_additions.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
changes/

fields_index_all.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
changes/

fields_index_changes.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
changes/

fields_index_removals.html
[java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ 
changes/

jdiff_help.html

Re: Revisit Pig Philosophy?

2009-09-21 Thread Alan Gates

I agree with Milind that we should move to saying that Pig Latin is a  
data flow language independent of any particular platform, while the  
current implementation of Pig is tied to Hadoop.  I'm not sure how  
thin that implementation will be, but I'm in favor of making it thin  
where possible (such as the recent proposal to shift LoadFunc to  
directly use InputFormat).


I also strongly agree that we need to be more precise in our  
terminology between Pig (the platform) and Pig Latin (the language),  
especially as we're working on making Pig bilingual (with the addition  
of SQL).


I am fine with saying that Pig SQL adheres as much as possible (given  
the underlying systems, etc.) to ANSI SQL semantics.  And where there  
is shared functionality such as UDFs we again adhere to SQL semantics  
when it does not conflict with other Pig goals.  So COUNT, and SUM  
should handle nulls the way SQL does, for example.  But we need to  
craft the statement carefully.  To see why, consider Pig's data  
model.  We would like our types to map nicely into SQL types, so that  
if Pig SQL users declare a column to be of type VARCHAR(32) or  
FLOAT(10) we can map those onto some Pig type.  But we don't want to  
use SQL types directly inside Pig, as they aren't a good match for  
much of Pig processing.  So any statement of using SQL semantics needs  
caveats.


I would also vote for modifying our Pigs Live Anywhere dictum to be:

Pig Latin is intended to be a language for parallel data processing.  
It is not
tied to one particular parallel framework. The initial implementation  
of Pig  is on Hadoop and seeks to leverage the power of Hadoop  
wherever possible.  However, nothing Hadoop specific should be exposed  
in Pig Latin.


We may also want to add a vocabulary section to the philosophy  
statement to clarify between Pig and Pig Latin.


Alan.


On Sep 18, 2009, at 8:01 PM, Milind A Bhandarkar wrote:


It's Friday evening, so I have some time to discuss philosophy ;-)

Before we discuss any question about revisiting pig philosophy, the
first question that needs to be answered is what is pig ? (this
corresponds to the Hindu philosophy's basic argument, that any deep
personal philosophical investigations need to start with a question
koham? (in Sanskrit, it means 'who am I?'))

So, coming back to approx 4000 years after the origin of that
philosophy, we need to ask what is pig? (incidentally, pig, or
varaaha in Sanskrit, was the second incarnation of lord Vishnu in
hindu scriptures, but that's not relevant here.)

What we need to decide is, is pig is a dataflow language ? I think
not. Pig Latin is the language. Pig is referred to in countless
slide decks ( aka pig scriptures, btw I own 50% of these scriptures)
as a runtime system that interprets pig Latin, kind of like java and
jvm. (Duality of nature, called dwaita philosophy in sanskrit is
applicable here. But I won't go deeper than that.)

So, pig-Latin-the-language's stance  could still be that it could be
implemented on any runtime. But pig the runtime's philosophy could be
that it is a thin layer on top of hadoop. And all the world could
breathe a sigh of relief. (mostly, by not having to answer these
philosophical questions.)

So, 'koham' is the 4000 year old question this project needs to
answer. That's all.

AUM.. (it's Friday.)

- (swami) Milind ;-)

On Sep 18, 2009, at 19:05, Jeff Hammerbacher ham...@cloudera.com
wrote:


Hey,


2. Local mode and other parallel frameworks

snip
Pigs Live Anywhere

Pig is intended to be a language for parallel data processing. It
is not
tied to one particular parallel framework. It has been implemented
first
on hadoop, but we do not intend that to be only on hadoop.
/snip

Are we still holding onto this? What about local mode? Local mode
is not
being treated on equal footing with that of Hadoop for practical
reasons. However, users expect things that work on local mode to  
work

without any hitches on Hadoop.

Are we still designing the system assuming that Pig will be stacked
on
top of other parallel frameworks?



FWIW, I appreciate this philosophical stance from Pig. Allowing
locally
tested scripts to be migrated to the cluster without breakage is a
noble
goal, and keeping the option of (one day) developing an alternative
execution environment for Pig that runs over HDFS but uses a richer
physical
set of operators than MapReduce would be great.

Of course, those of you who are running Pig in production will have
a much
better sense of the feasibility, rather than desirability, of this
philosophical stance.

Later,
Jeff

Re: [VOTE] Release Pig 0.4.0 (candidate 2)

2009-09-22 Thread Alan Gates

private is the pmc list.  Releases need pmc votes, hence we send to  
private.


Alan.

On Sep 21, 2009, at 7:46 PM, Milind A Bhandarkar wrote:


Unrelated to the message content:

why is there a priv...@hadoop.apache.org on the cc here? Is this even
a valid alias? An open source project needs to conduct it's
discussions in public, so an email address (even) named private
makes me very nervous about the development process.

- Milind

On Sep 21, 2009, at 18:56, Olga Natkovich ol...@yahoo-inc.com  
wrote:



Hi,

The new version is available in
http://people.apache.org/~olga/pig-0.4.0-candidate-2/.

I see one failure in a unit test in piggybank (contrib.) but it is  
not

related to the functions themselves but seems to be an issue with
MiniCluster and I don't feel we need to chase this down. I made sure
that the same test runs ok with Hadoop 20.

Please, vote by end of day on Thursday, 9/24.

Olga

-Original Message-
From: Olga Natkovich [mailto:ol...@yahoo-inc.com]
Sent: Thursday, September 17, 2009 12:09 PM
To: pig-dev@hadoop.apache.org; priv...@hadoop.apache.org
Subject: [VOTE] Release Pig 0.4.0 (candidate 1)

Hi,

I have fixed the issue causing the failure that Alan reported.

Please test the new release:
http://people.apache.org/~olga/pig-0.4.0-candidate-1/.

Vote closes on Tuesday, 9/22.

Olga


-Original Message-
From: Olga Natkovich [mailto:ol...@yahoo-inc.com]
Sent: Monday, September 14, 2009 2:06 PM
To: pig-dev@hadoop.apache.org; priv...@hadoop.apache.org
Subject: [VOTE] Release Pig 0.4.0 (candidate 0)

Hi,



I created a candidate build for Pig 0.4.0 release. The highlights of
this release are



-  Performance improvements especially in the area of JOIN
support where we introduced two new join types: skew join to deal  
with

data skew and sort merge join to take advantage of the sorted data
sets.

-  Support for Outer join.

-  Works with Hadoop 18



I ran the release audit and rat report looked fine. The relevant
part is
attached below.



Keys used to sign the release are available at
http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS?view=markup.



Please download the release and try it out:
http://people.apache.org/~olga/pig-0.4.0-candidate-0.



Should we release this? Vote closes on Thursday, 9/17.



Olga





   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/contrib/
CHANGES.txt
   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/contrib/zebra/
CHANG
ES.txt
   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/broken-
links.x
ml
   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/
cookbook.html
   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/index.html
   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ 
linkmap.html

   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/
piglatin_refer
ence.html
   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/
piglatin_users
.html
   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/setup.html
   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/
tutorial.html
   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/udf.html
   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/api/
package-li
st
   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/
changes.
html
   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/
missingS
inces.txt
   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/
user_com
ments_for_pig_0.3.1_to_pig_0.5.0-dev.xml
   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/
changes/
alldiffs_index_additions.html
   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/
changes/
alldiffs_index_all.html
   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/
changes/
alldiffs_index_changes.html
   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/
changes/
alldiffs_index_removals.html
   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/
changes/
changes-summary.html
   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/
changes/
classes_index_additions.html
   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/
changes/
classes_index_all.html
   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/
changes/
classes_index_changes.html
   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/
changes/
classes_index_removals.html
   [java]  !?
/home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/
changes/
constructors_index_additions.html
   [java]  !?

Re: High(er) res Pig logo?

2009-09-28 Thread Alan Gates

I have a couple of higher resolution pigs in overalls and a pig on the  
Hadoop elephant.  I've checked them into src/docs/src/documentation/ 
resources/images/ so all can use them.


Also, we're working on cleaning up the Pig with Y! logo issue.

Alan.

On Sep 27, 2009, at 9:59 AM, Dmitriy Ryaboy wrote:

Where can one find the Pig logo in a size/resolution suitable for  
presentations?


Also, I went on the website and noticed that the Y! reappeared on  
Pig's chest.


-D

Re: LocalRearrange out of bounds exception - tips for debugging?

2009-10-13 Thread Alan Gates

Have you checked that each record your input data has at least the  
number of fields you specify?  Have you checked that the field  
separator in your data matches the default for PigPerformanceLoader  
(^A I think)?


Alan.

On Oct 13, 2009, at 10:28 AM, Dmitriy Ryaboy wrote:


We ran into what looks like some edge case bug in Pig, which causes it
to throw an IndexOutOfBoundsException (stack trace below).  The script
just joins two relations; it looks like our data was generated
incorrectly, and the join is empty, which may be what's causing the
failure. It also appears to only happen when at least one of the
inputs is on the large size (at least a few hundred megs).  Any ideas
on what could be happening and how to zoom in on the underlying cause?
We are running off unmodified trunk.

Script:

register datagen.jar;
E =  load 'Employee' using
org.apache.pig.test.utils.datagen.PigPerformanceLoader() as
(id,name,cc,dc);
D =  load 'Department' using
org.apache.pig.test.utils.datagen.PigPerformanceLoader() as
(dept_id,dept_nm);
P =  load 'Project' using
org.apache.pig.test.utils.datagen.PigPerformanceLoader() as
(id,emp_id,role);
R1 = JOIN E by dc, D by dept_id;
R2 = JOIN R1 by E::id, P by emp_id;
store R2 into 'TestCase2Output';

R2 join fails with the stack trace below. It also fails if we
pre-calculate R1, store it, and load it directly (so, load R1, load P,
join R1 by $0, P by emp_id). We've verified that the records in R1 and
R2 have the expected fields, etc.


Stack Trace:

java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
   at java.util.ArrayList.RangeCheck(ArrayList.java:547)
   at java.util.ArrayList.get(ArrayList.java:322)
   at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143)
   at  
org 
.apache 
.pig 
.backend 
.hadoop 
.executionengine 
.physicalLayer.expressionOperators.POProject.getNext(POProject.java: 
148)
   at  
org 
.apache 
.pig 
.backend 
.hadoop 
.executionengine 
.physicalLayer.expressionOperators.POProject.getNext(POProject.java: 
226)
   at  
org 
.apache 
.pig 
.backend 
.hadoop 
.executionengine 
.physicalLayer 
.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java: 
260)
   at  
org 
.apache 
.pig 
.backend 
.hadoop 
.executionengine 
.physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162)
   at  
org 
.apache 
.pig 
.backend 
.hadoop 
.executionengine 
.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249)
   at  
org 
.apache 
.pig 
.backend 
.hadoop 
.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
   at  
org 
.apache 
.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce 
$Map.map(PigMapReduce.java:93)

   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java: 
358)

   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
   at org.apache.hadoop.mapred.Child.main(Child.java:170)

Hudson testing of patches

2009-10-22 Thread Alan Gates

We've had many questions on this, so I'm sending this to everyone on  
the dev list in hopes of clarifying the situation.  Our Hudson setup  
for testing patches is falsely returning failures on all or most unit  
tests for all patches.  So if you submit a patch and all the unit  
tests fail, don't worry.  We are working on getting Hudson fixed.  We  
committers are working through the patch queue manually, running the  
unit tests ourselves.  As we don't work all night like Hudson and each  
run of the unit tests takes about 3 hours, this is going slowly.  But  
please know we will get to your patches, even if it takes us a day or  
two.


Alan.

Re: [VOTE] Release Pig 0.5.0 (candidate 0)

2009-10-26 Thread Alan Gates


+1

On my laptop (mac) ran tutorial in both local and hadoop modes, ran a  
join/group/sort/limit script in both local and hadoop modes, did build  
of pig and contrib.
On linux box did build of both pig and contrib, ran a join/group/sort/ 
limit script in both local and hadoop modes.


Alan.

On Oct 25, 2009, at 1:17 PM, Olga Natkovich wrote:


Hi,



I created a candidate build for Pig 0.5.0 release. It contains the  
same

functionality as Pig 0.4.0 except it works with Hadoop 20.x releases.



I ran the release audit and rat report looked fine. The relevant  
part is

attached below.



Keys used to sign the release are available at
http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS?view=markup.



Please download the release and try it out:

http://people.apache.org/~olga/pig-0.5.0-candidate-0.



Should we release this? Vote closes on Thursday, 10/29.



Olga



[java]  !?
/home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/src/org/ 
apache

/pig/StoreConfig.java

[java]  !?
/home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/src/org/ 
apache

/pig/backend/hadoop/executionengine/util/MapRedUtil.java

[java]  !?
/home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/src/org/ 
apache

/pig/impl/logicalLayer/schema/SchemaUtil.java

[java]  !?
/home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/test/org/ 
apach

e/pig/test/TestDataBagAccess.java

[java]  !?
/home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/test/org/ 
apach

e/pig/test/TestNullConstant.java

[java]  !?
/home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/test/org/ 
apach

e/pig/test/TestSchemaUtil.java

[java]  !?
/home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/test/org/ 
apach

e/pig/test/utils/dotGraph/parser/DOTParser.java

[java]  !?
/home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/test/org/ 
apach

e/pig/test/utils/dotGraph/parser/DOTParserConstants.java

[java]  !?
/home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/test/org/ 
apach

e/pig/test/utils/dotGraph/parser/DOTParserTokenManager.java

[java]  !?
/home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/test/org/ 
apach

e/pig/test/utils/dotGraph/parser/DOTParserTreeConstants.java

[java]  !?
/home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/test/org/ 
apach

e/pig/test/utils/dotGraph/parser/JJTDOTParserState.java

[java]  !?
/home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/test/org/ 
apach

e/pig/test/utils/dotGraph/parser/ParseException.java

[java]  !?
/home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/test/org/ 
apach

e/pig/test/utils/dotGraph/parser/SimpleCharStream.java

[java]  !?
/home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/test/org/ 
apach

e/pig/test/utils/dotGraph/parser/Token.java

[java]  !?
/home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/test/org/ 
apach

e/pig/test/utils/dotGraph/parser/TokenMgrError.java

Re: LoadFunc.skipNext() function for faster sampling ?

2009-11-03 Thread Alan Gates

We definitely want to avoid parsing every tuple when sampling.  But do  
we need to implement a special function for it?  Pig will have access  
to the InputFormat instance, correct?  Can it not call  
InputFormat.getNext the desired number of times (which will not parse  
the tuple) and then call LoadFunc.getNext to get the next parsed tuple?


Alan.

On Nov 3, 2009, at 4:28 PM, Thejas Nair wrote:

In the new implementation of SampleLoader subclasses (used by order- 
by,
skew-join ..) as part of the loader redesign, we are not only  
reading all

the records input but also parsing them as pig tuples.

This is because the SampleLoaders are wrappers around the actual input
loaders specified in the query. We can make things much faster by  
having a
skipNext() function (or skipNext(int numSkip) ) which will avoid  
parsing the

record into a pig tuple.
LoadFunc could optionally implement this (easy to implement)  
function (which

will be part of an interface) for improving speed of queries such as
order-by.

-Thejas

Re: [VOTE] Branch for Pig 0.6.0 release

2009-11-09 Thread Alan Gates

+1.  In addition to the new features we've added, our change to use  
Hadoop's LineRecordReader brought Pig to parity with Hadoop in the  
PigMix tests, about a 30% average performance improvement.  This  
should be huge for our users.


Alan.

On Nov 9, 2009, at 12:26 PM, Olga Natkovich wrote:


Hi,



I would like to propose to branch for Pig 0.6.0 release with the  
intent
to have a release before the end of the year. We have done a lot of  
work

since branching for Pig 0.5.0 that we would like to share with users.
This includes changing how bags are spilled onto disk (PIG-975,
PIG-1037), skewed and fragment-replicated outer join plus many other
performance improvements and bug fixes.



Please vote by Thursday.



Thanks,



Olga

Re: package org.apache.hadoop.zebra.parse missing

2009-11-11 Thread Alan Gates

The parser package is generated as part of the build.  Doing invoking  
ant in the contrib/zebra directory should result in the parser package  
being created at ./src-gen/org/apache/hadoop/zebra/parser


Alan.

On Nov 11, 2009, at 12:54 AM, Min Zhou wrote:


Hi guys,

I checked out pig from trunk, and found package
org.apache.hadoop.zebra.parse missing. Do you assure this package has
been committed?
see this link
http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/zebra/src/java/org/apache/hadoop/zebra/


Min

--
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com

Re: FYI - forking TFile off Hadoop into Zebra

2009-11-13 Thread Alan Gates



On Nov 11, 2009, at 4:13 PM, Ashutosh Chauhan wrote:


On Wed, Nov 11, 2009 at 18:26, Chao Wang ch...@yahoo-inc.com wrote:


Last, we would like to point out that this is a short term solution  
for

Zebra and we plan to:
1) port all changes to Zebra TFile back into Hadoop TFile.
2) in the long run have a single unified solution for this.

Just for clarity, in long run as Zebra stabilizes and Pig adopts

hadoop-0.22, Zebra will get rid of this fork?


I think the promise is they'll get rid of the fork at some point, not  
necessarily at 0.22 though.


Alan.



Ashutosh

Re: optimizer hints in Pig

2009-11-16 Thread Alan Gates

In general I think optimizer hints fit well with Pig's approach to  
data processing, as expressed in our philosophic statement that Pigs  
are domestic animals (see http://hadoop.apache.org/pig/ 
philosophy.html ).


At least in the examples you give, I don't see 'with' as binding.  The  
user is giving Pig information; it can choose how to use it, or to not  
use it all.  I would like 'using' to continue to be binding as in that  
case the user is explicitly telling Pig to do something in a  
particular way.


Alan.

On Nov 14, 2009, at 2:07 PM, Ashutosh Chauhan wrote:


Hi All,

We would like to know what Pig devs feel about optimizer hints.
Traditionally, optimizer hints have been received with mixed reactions
in RDBMS world.  Oracle provides lots of knobs[1][2] to turn and tune,
while postgres[3][4] have tried to stay away from them. Mysql have few
of them (e.g., straight_join). Surajit Chaudhary [5] (Microsoft) is
making case in favor of them.
More specifically, I am talking of hints like following

a = filter 'mydata' by myudf ($1) with selectivity 0.5;
// This is letting user to tell Pig that  myudf filters out nearly
half of tuples of 'mydata'.

c = join a by $0, b by $0 with selectivity a.$0 = b.$0, 0.1;
// This is letting user to tell Pig that only 10% of keys in a will
match with those in b.

Exact syntax isn't important it could be adapted. But, question is
does it seem to be  a useful enough idea to be added in Pig Latin.
Pig's case is slightly different from other sql engines in that while
other systems treats them as hints and thus are free to ignore them
Pig treats hints as commands in a sense that it will fail even if it
can figure out that hint will result in failure of query. Perhaps, Pig
can interpret using as command and with as hint.

Thoughts?

Ashutosh

[1] http://www.dba-oracle.com/art_otn_cbo_p7.htm
[2] 
http://www.dba-oracle.com/oracle11g/oracle_11g_extended_optimizer_statistics.htm
[3] http://archives.postgresql.org/pgsql-hackers/2006-10/msg00663.php
[4] http://archives.postgresql.org/pgsql-hackers/2006-08/msg00506.php
[5] portal.acm.org/ft_gateway.cfm?id=1559955type=pdf

Welcome Jeff Zhang

2009-11-19 Thread Alan Gates


All,

I would like to welcome Jeff Zhang as our newest Pig committer.  Jeff  
has been contributing to Pig for about nine months now.  He's been  
active on the mailing lists, in contributing patches, and in helping  
other users with their patches.  Congratulations Jeff, and thanks for  
your contributions to Pig.


Alan.

Yahoo is hiring for Hadoop development

2009-11-20 Thread Alan Gates


All,

Yahoo has a number of Hadoop development positions open.  There are  
engineering, architect, management, and QA positions all open.  See http://developer.yahoo.net/blogs/hadoop/2009/11/updated_do_you_have_what_it_ta.html 
 for details.


Alan.

Re: TPC-H benchmark

2009-11-23 Thread Alan Gates

I don't know of any.  Officially Pig cannot publish a TPC-H number  
because it is not a transaction based store.  But I still think it  
would be very interesting to see the results if someone took the time  
to translate the queries.


Alan.

On Nov 22, 2009, at 6:20 PM, RichardGUO Fei wrote:



Hi,



Apart from Pig Performance and Pig Mix, do you know any TPC-H  
benchmark rewritten for Pig?




Thanks,

Richard

_
MSN十周年庆典，查看MSN注册时间，赢取神秘大奖
http://10.msn.com.cn

Re: Why we name it zebra ?

2009-11-30 Thread Alan Gates



On Nov 26, 2009, at 7:39 AM, Jeff Zhang wrote:


Hi all,

I'd like to know where's the name zebra come from ? does it convey the
meaning of this meta data system that the columnar storage format is  
like

the lines on the zebra's skin.


Pretty much, yes.  We've fallen into the habit of giving animal names  
to projects.  We discussed several animals but zebra won.


Alan.




Thank you

Jeff Zhang

Re: Pig reading hive columnar rc tables

2009-11-30 Thread Alan Gates



On Nov 30, 2009, at 12:18 PM, Dmitriy Ryaboy wrote:

That's awesome, I've been itching to do that but never got around to  
it..

Garrit, do you have any benchmarks on read speeds?

I don't know about putting this in piggybank, as it carries with it  
pretty

significant dependencies, increasing the size of the jar and making it
difficult for users to don't need it to build piggybank in the first  
place.

We might want to consider some other contrib for it -- maybe a misc
contrib that would have indivudual ant targets for these kinds of
compatibility submissions?


Does it have to increase the size of the piggybank jar?  Instead of  
including hive in our piggybank jar, which I agree would be bad, can  
we just say that if you want to use this function you need to provide  
the appropriate hive jar yourself?  This way we could use ivy to pull  
the jars and build piggybank.


I'm not really wild about creating a new section of contrib just for  
functions that have heavier weight requirements.


Alan.



-D


On Mon, Nov 30, 2009 at 3:09 PM, Olga Natkovich ol...@yahoo- 
inc.com wrote:



Hi Garrit,

It would be great if you could contribute the code. The process is
pretty simple:

- Open a JIRA that describes what the loader does and that you would
like to contribute it to the Piggybank.
- Submit the patch that contains the loader. Make sure it has unit  
tests

and javadoc.

On this is done, one of the committers will review and commit the  
patch.


More details on how to contribute are in
http://wiki.apache.org/pig/PiggyBank.

Olga

-Original Message-
From: Gerrit van Vuuren [mailto:gvanvuu...@specificmedia.com]
Sent: Friday, November 27, 2009 2:42 AM
To: pig-dev@hadoop.apache.org
Subject: Pig reading hive columnar rc tables

Hi,



I've coded a LoadFunc implementation that can read from Hive  
Columnar RC
tables, this is needed for a project that I'm working on because  
all our

data is stored using the Hive thrift serialized Columnar RC format. I
have looked at the piggy bank but did not find any implementation  
that

could do this. We've been running it on our cluster for the last week
and have worked out most bugs.



There are still some improvements to be done but I would need  like
setting the amount of mappers based on date partitioning. Its been
optimized so as to read only specific columns and can churn through a
data set almost 8 times faster with this improvement because not all
column data is read.



I would like to contribute the class to the piggybank can you guide  
me

in what I need to do?

I've used hive specific classes to implement this, is it possible  
to add

this to the piggy bank build ivy for automatic download of the
dependencies?



Thanks,

Gerrit Jansen van Vuuren

Re: SQL in Pig?

2010-01-19 Thread Alan Gates

We are still actively working on adding SQL to Pig.  We hope to have  
an updated patch posted to that JIRA in February or March.


Alan.

On Jan 18, 2010, at 4:15 PM, Michael Dalton wrote:


Hi,

What's the current status of SQL support in Pig? I looked at the  
JIRA (
http://issues.apache.org/jira/browse/PIG-824) and it seems like  
there hasn't
been any activity on adding SQL to Pig since August. I was just  
curious if
that's something that's still being actively developed and is of  
interest to

the Pig development team, and will be integrated at some point. Thanks

Best regards,

Mike

Backward compatibility

2010-01-25 Thread Alan Gates

Over the last year the number of Pig users has grown, both in terms of  
absolute number and the number of different companies using it.
However, it is going to be a little while yet before Pig reaches a  
maturity level that it can declare a 1.0 release and promise it won't
break backward compatibility until 2.0  So I think we need to discuss  
how we intend to handle backward compatibility across releases.


The scope of what I'm covering in backwards compatibility are all the  
interfaces and classes in org.apache.pig, the Pig Latin language, and  
data formats that Pig's bundled loaders read and write.


I propose the following criteria for deciding when to break backward  
compatibility:


1) We shouldn't break it without a strong reason.  A strong reason is  
a show stopping bug, a compelling new feature, large gains in
performance, or a change in architecture that significantly eases Pig  
use or development.  Examples would be things like load/store
redesign, which should make it much easier to write load and store  
functions.


2) Where possible we should bundle disruptions of an interface  
together rather than spread them across releases.  This avoids the
death by a thousand cuts of having interfaces change a little bit each  
release.


Thoughts?

Alan.

Re: reading/writing HBase in Pig

2010-01-25 Thread Alan Gates



On Jan 18, 2010, at 10:14 PM, Michael Dalton wrote:

I took a look at the load-store branch and that definitely seems  
like the
right place to do this. So the right thing to do would be to just  
open up a
JIRA and then post a patch against the load-store rewrite tree,  
correct?


Yes.  You should take a look at PIG-1200, which seems to be going part  
way towards doing what you want to do.


Alan.

Begin a discussion about Pig as a top level project

2010-03-19 Thread Alan Gates

You have probably heard by now that there is a discussion going on in
the Hadoop PMC as to whether a number of the subprojects (Hbase, Avro,
Zookeeper, Hive, and Pig) should move out from under the Hadoop
umbrella and become top level Apache projects (TLP). This discussion
has picked up recently since the Apache board has clearly communicated
to the Hadoop PMC that it is concerned that Hadoop is acting as an
umbrella project with many disjoint subprojects underneath it. They
are concerned that this gives Apache little insight into the health
and happenings of the subproject communities which in turn means
Apache cannot properly mentor those communities.

The purpose of this email is to start a discussion within the Pig
community about this topic. Let me cover first what becoming TLP
would mean for Pig, and then I'll go into what options I think we as a
community have.

Becoming a TLP would mean that Pig would itself have a PMC that would
report directly to the Apache board. Who would be on the PMC would be
something we as a community would need to decide. Common options
would be to say all active committers are on the PMC, or all active
committers who have been a committer for at least a year. We would
also need to elect a chair of the PMC. This lucky person would have
no additional power, but would have the additional responsibility of
writing quarterly reports on Pig's status for Apache board meetings,
as well as coordinating with Apache to get accounts for new
committers, etc. For more information see http://www.apache.org/foundation/how-it-works.html#roles

Becoming a TLP would not mean that we are ostracized from the Hadoop
community. We would continue to be invited to Hadoop Summits, HUGs,
etc. Since all Pig developers and users are by definition Hadoop
users, we would continue to be a strong presence in the Hadoop
community.

I see three ways that we as a community can respond to this:

1) Say yes, we want to be a TLP now.
2) Say yes, we want to be a TLP, but not yet. We feel we need more
time to mature. If we choose this option we need to be able to
clearly articulate how much time we need and what we hope to see
change in that time.
3) Say no, we feel the benefits for us staying with Hadoop outweigh
the drawbacks of being a disjoint subproject. If we choose this, we
need to be able to say exactly what those benefits are and why we feel
they will be compromised by leaving the Hadoop project.

There may other options that I haven't thought of. Please feel free
to suggest any you think of.

Questions? Thoughts? Let the discussion begin.

Alan.

JIRA Fix Version

2010-03-24 Thread Alan Gates

A reminder to Pig committers:  When closing a JIRA issue as Resolved/ 
Fixed please make sure to set the Fix Version field.  This helps our  
users know what versions they need to use to get fixes for their  
issues.  And it helps release managers when they build releases to  
know what is and isn't in the release they're building.  There were  
~170 issues in Pig's JIRA marked fixed but with no version.  I've  
assigned most of them to the appropriate version.


Alan.

Re: Begin a discussion about Pig as a top level project

2010-03-31 Thread Alan Gates

So far I haven't seen any feedback on this. Apache has asked the
Hadoop PMC to submit input in April on whether some subprojects should
be promoted to TLPs. We, the Pig community, need to give feedback to
the Hadoop PMC on how we feel about this. Please make your voice heard.

So now I'll head my own call and give my thoughts on it.

The biggest advantage I see to being a TLP is a direct connection to
Apache. Right now all of the Pig team's interaction with Apache is
through the Hadoop PMC. Being directly connected to Apache would
benefit Pig team members who would have a better view into Apache. It
would also raise our profile in Apache and thus make other projects
more aware of us.

However, I am concerned about loosing Pig's explicit connection to
Hadoop. This concern has a couple of dimensions. One, Hadoop and
MapReduce are the current flavor of the month in computing. Given
that Pig shares a name with the common farm animal, it's hard to be
sure based on search statistics. But Google trends shows that
hadoop is searched on much more frequently than hadoop pig or
apache pig (see http://www.google.com/trends?q=hadoop%2Chadoop
+pig). I am guessing that most Pig users come from Hadoop users who
discover Pig via Hadoop's website. Loosing that subproject tab on
Hadoop's front page may radically lower the number of users coming to
Pig to check out our project. I would argue that this benefits Hadoop
as well, since high level languages like Pig Latin have the potential
to greatly extend the user base and usability of Hadoop.

Two, being explicitly connected to Hadoop keeps our two communities
aware of each others needs. There are features proposed for MR that
would greatly help Pig. By staying in the Hadoop community Pig is
better positioned to advocate for and help implement and test those
features. The response to this will be that Pig developers can still
subscribe to Hadoop mailing lists, submit patches, etc. That is, they
can still be part of the Hadoop community. Which reinforces my point
that it makes more sense to leave Pig in the Hadoop community since
Pig developers will need to be part of that community anyway.

Finally, philosophically it makes sense to me that projects that are
tightly connected belong together. It strikes me as strange to have
Pig as a TLP completely dependent on another TLP. Hadoop was
originally a subproject of Lucene. It moved out to be a TLP when it
became obvious that Hadoop had become independent of and useful apart
from Lucene. Pig is not in that position relative to Hadoop.

So, I'm -1 on Pig moving out. But this is a soft -1. I'm open to
being persuaded that I'm wrong or my concerns can be addressed while
still having Pig as a TLP.

Alan.

On Mar 19, 2010, at 10:59 AM, Alan Gates wrote:

You have probably heard by now that there is a discussion going on
in the Hadoop PMC as to whether a number of the subprojects (Hbase,
Avro, Zookeeper, Hive, and Pig) should move out from under the
Hadoop umbrella and become top level Apache projects (TLP). This
discussion has picked up recently since the Apache board has clearly
communicated to the Hadoop PMC that it is concerned that Hadoop is
acting as an umbrella project with many disjoint subprojects
underneath it. They are concerned that this gives Apache little
insight into the health and happenings of the subproject communities
which in turn means Apache cannot properly mentor those communities.

Becoming a TLP would mean that Pig would itself have a PMC that
would report directly to the Apache board. Who would be on the PMC
would be something we as a community would need to decide. Common
options would be to say all active committers are on the PMC, or all
active committers who have been a committer for at least a year. We
would also need to elect a chair of the PMC. This lucky person
would have no additional power, but would have the additional
responsibility of writing quarterly reports on Pig's status for
Apache board meetings, as well as coordinating with Apache to get
accounts for new committers, etc. For more information see http://www.apache.org/foundation/how-it-works.html#roles

I see three ways that we as a community can respond to this:

1) Say yes, we want to be a TLP now.
2) Say yes, we want to be a TLP, but not yet. We feel we need more
time to mature. If we

Re: Begin a discussion about Pig as a top level project

2010-04-05 Thread Alan Gates

 intend to position as a data flow language that is  
backend
agnostic? If the answer is yes, then there is a strong case for  
making

Pig a TLP.

Are we influenced by Hadoop? A big YES! The reason Pig chose to  
become a

Hadoop sub-project was to ride the Hadoop popularity wave. As a
consequence, we chose to be heavily influenced by the Hadoop  
roadmap.


Like a good lawyer, I also have rebuttals to Alan's questions :)

1. Search engine popularity - We can discuss this with the Hadoop  
team
and still retain links to TLP's that are coupled (loosely or  
tightly).
2. Explicit connection to Hadoop - I see this as logical  
connection v/s

physical connection. Today, we are physically connected as a
sub-project. Becoming a TLP, will not increase/decrease our  
influence on

the Hadoop community (think Logical, Physical and MR Layers :)
3. Philosophy - I have already talked about this. The tight  
coupling is
by choice. If Pig continues to be a data flow language with clear  
syntax

and semantics then someone can implement Pig on top of a different
backend. Do we intend to take this approach?

I just wanted to offer a different opinion to this thread. I  
strongly
believe that we should think about the original philosophy. Will  
we have

a Pig standards committee that will decide on the changes to the
language (think C/C++) if there are multiple backend  
implementations?


I will reserve my vote based on the outcome of the philosophy and
backward compatibility discussions. If we decide that Pig will be
treated and maintained like a true language with clear syntax and
semantics then we have a strong case to make it into a TLP. If  
not, we
should retain our existing ties to Hadoop and make Pig into a data  
flow

language for Hadoop.

Santhosh

-Original Message-
From: Thejas Nair [mailto:te...@yahoo-inc.com]
Sent: Friday, April 02, 2010 4:08 PM
To: pig-dev@hadoop.apache.org; Dmitriy Ryaboy
Subject: Re: Begin a discussion about Pig as a top level project

I agree with Alan and Dmitriy - Pig is tightly coupled with  
hadoop, and
heavily influenced by its roadmap. I think it makes sense to  
continue as

a sub-project of hadoop.

-Thejas



On 3/31/10 4:04 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote:


Over time, Pig is increasing its coupling to Hadoop (for good
reasons), rather than decreasing it. If and when Pig becomes a  
viable
entity without hadoop around, it might make sense as a TLP. As  
is, I

think becoming a TLP will only introduce unnecessary administrative

and bureaucratic headaches.

So my vote is also -1.

-Dmitriy



On Wed, Mar 31, 2010 at 2:38 PM, Alan Gates ga...@yahoo-inc.com

wrote:



So far I haven't seen any feedback on this.  Apache has asked the
Hadoop PMC to submit input in April on whether some subprojects
should be promoted to TLPs.  We, the Pig community, need to give
feedback to the Hadoop PMC on how we feel about this.  Please make

your voice heard.


So now I'll head my own call and give my thoughts on it.

The biggest advantage I see to being a TLP is a direct  
connection to
Apache.  Right now all of the Pig team's interaction with Apache  
is

through the Hadoop PMC.  Being directly connected to Apache would
benefit Pig team members who would have a better view into Apache.
It would also raise our profile in Apache and thus make other

projects more aware of us.


However, I am concerned about loosing Pig's explicit connection to

Hadoop.
This concern has a couple of dimensions.  One, Hadoop and  
MapReduce

are the current flavor of the month in computing.  Given that Pig
shares a name with the common farm animal, it's hard to be sure  
based

on search statistics.

But Google trends shows that hadoop is searched on much more
frequently than hadoop pig or apache pig (see
http://www.google.com/trends?q=hadoop%2Chadoop+pig).  I am  
guessing

that most Pig users come from Hadoop users who discover Pig via

Hadoop's website.

Loosing that subproject tab on Hadoop's front page may radically
lower the number of users coming to Pig to check out our  
project.  I

would argue that this benefits Hadoop as well, since high level
languages like Pig Latin have the potential to greatly extend the

user base and usability of Hadoop.


Two, being explicitly connected to Hadoop keeps our two  
communities
aware of each others needs.  There are features proposed for MR  
that

would greatly help Pig.  By staying in the Hadoop community Pig is
better positioned to advocate for and help implement and test  
those
features.  The response to this will be that Pig developers can  
still



subscribe to Hadoop mailing lists, submit patches, etc.  That is,
they can still be part of the Hadoop community.  Which  
reinforces my
point that it makes more sense to leave Pig in the Hadoop  
community
since Pig developers will need to be part of that community  
anyway.


Finally, philosophically it makes sense to me that projects that  
are
tightly connected belong together.  It strikes me

Re: Begin a discussion about Pig as a top level project

2010-04-05 Thread Alan Gates

Prognostication is a difficult business.  Of course I'd love it if  
someday there is an ISO Pig Latin committee (with meetings in cool  
exotic places) deciding the official standard for Pig Latin.  But that  
seems like saying in your start up's business plan, When we reach  
Google's size, then we'll do x.  If there ever is an ISO Pig Latin  
standard it will be years off.


As others have noted, staying tight to Hadoop now has many advantages,  
both in technical and adoption terms.  Hence my advocacy of keeping  
Pig Latin Hadoop agnostic while tightly integrating the backend.   
Which is to say that in my view, Pig is Hadoop specific now, but there  
may come a day when that is no longer true.   Whether Pig will ever  
move past just running on Hadoop to running in other parallel systems  
won't be known for years to come.  Given that, do you think it makes  
sense to say that Pig stays a subproject for now, but if it someday  
grows beyond Hadoop only it becomes a TLP?  I could agree to that  
stance.


Alan.

On Apr 3, 2010, at 12:43 PM, Santhosh Srinivasan wrote:


I see this as a multi-part question. Looking back at some of the
significant roadmap/existential questions asked in the last 12  
months, I

see the following:

1. With the introduction of SQL, what is the philosophy of Pig (I sent
an email about this approximately 9 months ago)
2. What is the approach to support backward compatibility in Pig (Alan
had sent an email about this 3 months ago)
3. Should Pig be a TLP (the current email thread).

Here is my take on answering the aforementioned questions.

The initial philosophy of Pig was to be backend agnostic. It was
designed as a data flow language. Whenever a new language is designed,
the syntax and semantics of the language have to be laid out. The  
syntax

is usually captured in the form of a BNF grammar. The semantics are
defined by the language creators. Backward compatibility is then a
question of holding true to the syntax and semantics. With Pig, in
addition to the language, the Java APIs were exposed to customers to
implement UDFs (load/store/filter/grouping/row transformation etc),
provision looping since the language does not support looping  
constructs

and also support a programmatic mode of access. Backward compatibility
in this context is to support API versioning.

Do we still intend to position as a data flow language that is backend
agnostic? If the answer is yes, then there is a strong case for making
Pig a TLP.

Are we influenced by Hadoop? A big YES! The reason Pig chose to  
become a

Hadoop sub-project was to ride the Hadoop popularity wave. As a
consequence, we chose to be heavily influenced by the Hadoop roadmap.

Like a good lawyer, I also have rebuttals to Alan's questions :)

1. Search engine popularity - We can discuss this with the Hadoop team
and still retain links to TLP's that are coupled (loosely or tightly).
2. Explicit connection to Hadoop - I see this as logical connection  
v/s

physical connection. Today, we are physically connected as a
sub-project. Becoming a TLP, will not increase/decrease our  
influence on

the Hadoop community (think Logical, Physical and MR Layers :)
3. Philosophy - I have already talked about this. The tight coupling  
is
by choice. If Pig continues to be a data flow language with clear  
syntax

and semantics then someone can implement Pig on top of a different
backend. Do we intend to take this approach?

I just wanted to offer a different opinion to this thread. I strongly
believe that we should think about the original philosophy. Will we  
have

a Pig standards committee that will decide on the changes to the
language (think C/C++) if there are multiple backend implementations?

I will reserve my vote based on the outcome of the philosophy and
backward compatibility discussions. If we decide that Pig will be
treated and maintained like a true language with clear syntax and
semantics then we have a strong case to make it into a TLP. If not, we
should retain our existing ties to Hadoop and make Pig into a data  
flow

language for Hadoop.

Santhosh

-Original Message-
From: Thejas Nair [mailto:te...@yahoo-inc.com]
Sent: Friday, April 02, 2010 4:08 PM
To: pig-dev@hadoop.apache.org; Dmitriy Ryaboy
Subject: Re: Begin a discussion about Pig as a top level project

I agree with Alan and Dmitriy - Pig is tightly coupled with hadoop,  
and
heavily influenced by its roadmap. I think it makes sense to  
continue as

a sub-project of hadoop.

-Thejas



On 3/31/10 4:04 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote:


Over time, Pig is increasing its coupling to Hadoop (for good
reasons), rather than decreasing it. If and when Pig becomes a viable
entity without hadoop around, it might make sense as a TLP. As is, I
think becoming a TLP will only introduce unnecessary administrative

and bureaucratic headaches.

So my vote is also -1.

-Dmitriy



On Wed, Mar 31, 2010 at 2:38 PM, Alan Gates ga...@yahoo-inc.com

wrote:



So

Re: TypeCheckingVisitor and casting to less precise numeric types

2010-04-15 Thread Alan Gates

You are correct that all of these casts can be done.  We omitted them  
explicitly because of what you said that we did not want to loose  
precision.  We should be able to downcast when users ask explicitly  
for it, but we don't want to do this implicitly.


Alan.

On Mar 24, 2010, at 2:47 PM, Anil Chawla wrote:




Hi,
I know that Pig has logic for casting inputs to the expected data  
types

when invoking a UDF and I understand that this logic resides in the
TypeCheckingVisitor class. I am curious to know why certain casts  
have been
omitted from the castLookup map. Specifically, I do not see any  
entries
for casting a more precise numeric type (e.g. Double) to a less  
precise
numeric type (e.g. Integer). Any reason why all down conversions of  
numeric

types have been omitted? Is it because we do not want to perform any
automatic casts that lead to a loss of precision (loss of data)?

In my situation, we are trying to abstract all numeric data types  
into a
single number type. If a UDF takes a numeric parameter, we want  
Pig to

invoke that UDF with any numeric argument, regardless of whether the
argument must be upconverted or downconverted. We are OK with the  
loss of
precision in that circumstance. As a result, we added the following  
to the

castLookup map:

castLookup.put(DataType.LONG, DataType.INTEGER);
castLookup.put(DataType.FLOAT, DataType.LONG);
castLookup.put(DataType.FLOAT, DataType.INTEGER);
castLookup.put(DataType.DOUBLE, DataType.FLOAT);
castLookup.put(DataType.DOUBLE, DataType.LONG);
castLookup.put(DataType.DOUBLE, DataType.INTEGER);

All of these casts seem to work fine our tests. Other than loss of
precision, is there any reason why adding these casts might be a bad  
idea?


Thanks,
-Anil

Re: Shouldn't hadoop18.jar be removed from lib of trunk?

2010-04-22 Thread Alan Gates

It should be removed.  I filed https://issues.apache.org/jira/browse/PIG-1388 
 so we'll remember to remove it in 0.8.


Alan.

On Apr 21, 2010, at 10:24 PM, chaitanya krishna wrote:


Hi,

Since pig-trunk now supports hadoop-0.20 and as it already has
hadoop20.jar, shouldn't the hadoop18.jar be removed from it? I think  
it is

redundant from now. or am I missing something?

Regards,
V.V.Chaitanya.

Re: Consider cleaning up backend code

2010-04-22 Thread Alan Gates

A couple of years ago we had this concept that Pig as is should be  
able to run on other backends (like say Dryad if it were open  
source).  So we built this whole backend interface and (mostly) kept  
Hadoop specific objects out of the front end.


Recently we have modified that stand and said that this implementation  
of Pig is Hadoop specific.  Pig Latin itself will still stay Hadoop  
independent.  So the ability to have multiple backends is fine.  But  
the ability to have non-Hadoop backends is not really interesting now.


So I at least see the proposal here as getting rid of generic code  
that tries to hide the fact that we are working on top of Hadoop  
(things like DataStorage and ExecutionEngine).


Alan.

On Apr 22, 2010, at 4:14 PM, Arun C Murthy wrote:

I read it as getting rid of concepts parallel to hadoop in  src/org/ 
apache/pig/backend/hadoop/datastorage.


Is that true?

thanks,
Arun

On Apr 22, 2010, at 1:34 PM, Dmitriy Ryaboy wrote:

I kind of dig the concept of being able to plug in a different  
backend,
though I definitely thing we should get rid of the dead localmode  
code. Can
you give an example of how this will simplify the codebase? Is it  
more than
just GenericClass foo = new SpecificClass(), and the associated  
extra files?


-D

On Thu, Apr 22, 2010 at 1:25 PM, Arun C Murthy a...@yahoo-inc.com  
wrote:



+1

Arun


On Apr 22, 2010, at 11:35 AM, Richard Ding wrote:

Pig has an abstraction layer (interfaces and abstract classes) to
support multiple execution engines. After PIG-1053, Hadoop is the  
only
execution engine supported by Pig. I wonder if we should remove  
this
layer of code, and make Hadoop THE execution engine for Pig. This  
will

simplify a lot the backend code.



Thanks,

-Richard

Re: When is the pig-0.7.0 and pig-0.8.0 scheduled to be released?

2010-04-23 Thread Alan Gates

We've already branched for 0.7, which means we're not putting any new  
features in there, just critical bug fixes.  We're extensively testing  
it now and hope to release it soon.


We don't have a date for 0.8 yet.

Alan.

On Apr 23, 2010, at 2:08 AM, chaitanya krishna wrote:


Hi,

 Can someone please tell me when is pig-0.7.0 planned to be  
released, i.e.,

when is the code-freeze date?
 Also, can someone tell me the relevant dates for pig-0.8.0?


Thanks,
V.V.Chaitanya.

Re: [VOTE] Release Pig 0.7.0 (candidate 0)

2010-05-07 Thread Alan Gates

+1.  Ran the tutorial and some simple smoke tests on my mac and on  
linux.  Checked that the signature keys are good.


Alan.

On May 5, 2010, at 11:44 AM, Daniel Dai wrote:


Hi,

I have created a candidate build for Pig 0.7.0. A description of  
what is new and different is included in the release notes: http://people.apache.org/~daijy/pig-0.7.0-candidate-0/RELEASE_NOTES.txt


Keys used to sign the release are available at 
http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS?view=markup

Please download, test, try it out and vote. The download link is:

http://people.apache.org/~daijy/pig-0.7.0-candidate-0

Thanks
Daniel

[Travel Assistance] - Applications Open for ApacheCon NA 2010

2010-05-17 Thread Alan Gates


The Travel Assistance Committee is now taking in applications for those
wanting to attend ApacheCon North America (NA) 2010, which is taking  
place

between the 1st and 5th November in Atlanta.

The Travel Assistance Committee is looking for people who would like  
to be
able to attend ApacheCon, but who need some financial support in order  
to be
able to get there. There are limited places available, and all  
applications

will be scored on their individual merit.

Financial assistance is available to cover travel to the event, either  
in
part or in full, depending on circumstances. However, the support  
available

for those attending only the barcamp is smaller than that for people
attending the whole event. The Travel Assistance Committee aims to  
support
all ApacheCons, and cross-project events, and so it may be prudent for  
those

in Asia and the EU to wait for an event closer to them.

More information can be found on the main Apache website at
http://www.apache.org/travel/index.html - where you will also find a  
link to

the online application and details for submitting.

Applications for applying for travel assistance are now being  
accepted, and

will close on the 7th July 2010.

Good luck to all those that will apply.

You are welcome to tweet, blog as appropriate.

Regards,

The Travel Assistance Committee.

Re: Code Repository

2010-05-21 Thread Alan Gates


http://wiki.apache.org/pig/HowToContribute

Alan.

On May 20, 2010, at 9:15 PM, Renato Marroquín Mogrovejo wrote:

Hi, is there a PIG coding standard? or any type of documentation I  
could

follow?
Thanks.

Renato M.

Re: About PigPen

2010-05-24 Thread Alan Gates

The one on the JIRA is more up to date.  However, be aware that PigPen  
has not been updated since Pig 0.2 and does not work with new versions  
of Pig.


Alan.

On May 23, 2010, at 11:25 PM, Renato Marroquín Mogrovejo wrote:

Hi, does anybody know which the PigPen release is? I found two  
links. The

first one is from the wiki and the second one is from the jira.

http://issues.apache.org/jira/secure/attachment/12393772/org.apache.pig.pigpen_0.0.1.jar
https://issues.apache.org/jira/secure/attachment/12400858/PigPen.tgz

Thanks in advance.


Renato M.

Re: does EvalFunc generate the entire bag always ?

2010-05-27 Thread Alan Gates

The default case is that a UDFs that take bags (such as COUNT, etc.)  
are handed the entire bag at once.  In the case where all UDFs in a  
foreach implement the algebraic interface and the expression itself is  
algebraic than the combiner will be used, thus significantly limiting  
the size of the bag handed to the UDF.  The accumulator does hand  
records to the UDF a few thousand at a time.  Currently it has no way  
to turn off the flow of records.


What you want might be accomplished by the LIMIT operator, which can  
be used inside a nested foreach.  Something like:


C = foreach B {
C1 = sort A by $0;
C2 = limit 5 C1;
generate myUDF(C2);
}

Alan.

On May 26, 2010, at 11:59 AM, hc busy wrote:


Hey, guys, how are Bags passed to EvalFunc stored?

I was looking at the Accumulator interface and it says that the  
reason why

this needed for COUNT and SUM is because EvalFunc always gives you the
entire bag when the EvalFunc is run on a bag.

I always thought if I did COUNT(TABLE) or SUM(TABLE.FIELD), and the  
code

inside that does


for(Tuple entry:inputDataBag){
 stuff
}


was an actual iterator that iterated on the bag sequentially without
necessarily having the entire bag in memory all at once. ?? Because  
it's an
iterator, so there's no way to do anything other than to stream  
through it.


I'm looking at this because Accumulator has no way of telling Pig  
I've seen
enough It streams through the entire bag no matter what happens.  
(like,
hypothetically speaking, if I was writing 5th item of a sorted bag  
udf),

after I see 5th of a 5 million entry bag, I want to stop executing if
possible.

Is there a easy way to make this happen?

Hudson returning -1 on javadoc

2010-05-27 Thread Alan Gates

Since it's return from the hospital Hudson has been returning -1 on  
all patches submitted complaining about a broken javadoc tag.  It  
turns out the bad tag snuck into the code whilst Hudson was away.   
I've checked in a fix, so Hudson should be happy again.  Any patches  
that were flunked just for that 1 javadoc warning should be considered  
ok.


Alan.

Re: does EvalFunc generate the entire bag always ?

2010-06-02 Thread Alan Gates


I don't think it pushes limit yet in this case.

Alan.

On Jun 1, 2010, at 1:44 PM, hc busy wrote:


well, see that's the thing, the 'sort A by $0' is already nlg(n)

ahh, I see, my own example suffers from this problem.

I guess I'm wondering how 'limit' works in conjunction with UDF's... A
practical application escapes me right now, But if I do

C = foreach B{
  C1 = MyUdf(B.bag_on_b);
  C2 = limit C1 5;
}

does it know to push limit in this case?


On Thu, May 27, 2010 at 2:32 PM, Alan Gates ga...@yahoo-inc.com  
wrote:


The default case is that a UDFs that take bags (such as COUNT,  
etc.) are
handed the entire bag at once.  In the case where all UDFs in a  
foreach
implement the algebraic interface and the expression itself is  
algebraic
than the combiner will be used, thus significantly limiting the  
size of the
bag handed to the UDF.  The accumulator does hand records to the  
UDF a few

thousand at a time.  Currently it has no way to turn off the flow of
records.

What you want might be accomplished by the LIMIT operator, which  
can be

used inside a nested foreach.  Something like:

C = foreach B {
  C1 = sort A by $0;
  C2 = limit 5 C1;
  generate myUDF(C2);
}

Alan.


On May 26, 2010, at 11:59 AM, hc busy wrote:

Hey, guys, how are Bags passed to EvalFunc stored?


I was looking at the Accumulator interface and it says that the  
reason why
this needed for COUNT and SUM is because EvalFunc always gives you  
the

entire bag when the EvalFunc is run on a bag.

I always thought if I did COUNT(TABLE) or SUM(TABLE.FIELD), and  
the code

inside that does


for(Tuple entry:inputDataBag){
 stuff
}


was an actual iterator that iterated on the bag sequentially without
necessarily having the entire bag in memory all at once. ??  
Because it's

an
iterator, so there's no way to do anything other than to stream  
through

it.

I'm looking at this because Accumulator has no way of telling Pig  
I've

seen
enough It streams through the entire bag no matter what happens.  
(like,
hypothetically speaking, if I was writing 5th item of a sorted  
bag udf),
after I see 5th of a 5 million entry bag, I want to stop executing  
if

possible.

Is there a easy way to make this happen?

Re: algebraic optimization not invoked for filter following group?

2010-06-15 Thread Alan Gates

For at least simple cases what's in the pseduo code should work.  I  
hope someday soon we can start using the new logical optimizer work  
(in the experimental package) to build rules for the MR optimizer  
(like this combiner stuff) as well, which should be much easier to  
code.  But it will be a while before we get there.


I don't think this will automatically make it work for split, because  
I think it will see the split in the plan and that will make it choose  
not to optimize.


Alan.

On Jun 2, 2010, at 4:18 PM, Dmitriy Ryaboy wrote:

It looks like right now, the combiner optimization does not kick in  
for a

script like this:

data = load 'foo' using PigStorage() as (a, b, c);
grouped = group data by a;
filtered = filter grouped by COUNT(data)  1000;

Looking at the code in CombinerOptimizer, seems like the Filter bit  
is just
pseudo-coded in comments. Are there complications there other than  
what is

already noted, or is it just the matter of coding up the pseudo-code?

On that note -- assuming the optimization was implemented for Filter
following group, would it automagically start working for Splits, as  
well?


-D

Re: SIZE() of relation

2010-06-15 Thread Alan Gates

There have been several requests for this.  I'm not a fan of it,  
because it makes it too easy to forget that you're forcing a single  
reducer MR job to accomplish this.  But I'm open to persuasion if  
everyone else disagrees.


Alan.

On Jun 11, 2010, at 7:27 PM, Russell Jurney wrote:

This would be great.  Save us from GROUP ALL/FOREACH, which is  
awkward.


On Fri, Jun 11, 2010 at 7:14 PM, Dmitriy Ryaboy dvrya...@gmail.com  
wrote:


It would be cool to just treat relations as bags in the general  
case. They

kind of are, and kind of are not. Causes lots of user confusion.
There are obvious users-doing-dumb-stuff scenarios that arise though.
I guess the Pig philosophy is that the user is the optimizer,  
though.. so

maybe it's ok.

-D

On Fri, Jun 11, 2010 at 6:42 PM, Russell Jurney russell.jur...@gmail.com

wrote:


Would it be possible, and not a ton of work to make the builtin  
SIZE()

work

on a relation?  Reason being, I frequently do this:

B = GROUP A ALL;
C = FOREACH B GENERATE SIZE(A) AS total;
DUMP C;

And I would rather do this:

DUMP SIZE(A);

Russ

Re: the last job in the mapreduce plan

2010-06-15 Thread Alan Gates

I've never seen a case where this happens.  Is this a theoretical  
question or are you seeing this issue?


Alan.

On Jun 15, 2010, at 8:49 AM, Gang Luo wrote:


Hi,
Is it possible the last MapReduce job in the MR plan only loads  
something and stores it without any other processing in between? For  
example, when visiting some physical operator, we need to end the  
current MR operator after embedding the physical operator into MR  
operator, and create a new MR operator for later physical operators.  
Unfortunately, the following physical operator is a store, the end  
of the entire query. In this case, the last MR operator only contain  
load and store without any meaningful work in between. This idle  
MapReduce job will degrade the performance. Will this happen in Pig?


Thanks,
-Gang

Re: skew join in pig

2010-06-16 Thread Alan Gates



On Jun 16, 2010, at 8:36 AM, Gang Luo wrote:


Hi,
there is something confusing me in the skew join (http://wiki.apache.org/pig/PigSkewedJoinSpec 
)
1. does the sampling job sample and build histogram on both tables,  
or just one table (in this case, which one) ?

Just the left one.

2. the join job still take the two table as inputs, and shuffle  
tuples from partitioned table to particular reducer (one tuple to  
one reducer), and shuffle tuples from streamed table to all reducers  
associative to one partition (one tuple to multiple reducers). Is  
that correct?
Keys with small enough values to fit in memory are shuffled to  
reducers as normal.  Keys that are too large are split between  
reducers on the left side, and replicated to all of those reducers  
that have the splits (not all reducers) on the right side.  Does that  
answer your question?


3. Hot keys need more than one reducers. Are these reducers  
dedicated to this key only? Could they also take other keys at the  
same time?

They take other keys at the same time.

4. for non-hot keys, my understanding is that they are shuffled to  
reducers based on default hash partitioner. However, it could happen  
all the keys shuffled to one reducers incurs skew even none of them  
is skewed individually.
This is always the case in map reduce, though a good hash function  
should minimize the occurrences of this.




Can someone give me some ideas on these? Thanks.

-Gang




Alan.

Re: skew join in pig

2010-06-18 Thread Alan Gates

Are you asking how many reducers are used to split a hot key? If so,
the answer is as many as we estimate it will take to make the the
records for the key fit into memory. For example, if we have a key
which we estimate has 10 million records, each record being about 100
bytes and for each reduce task we have 400M available, then we will
allocate 3 reducers for that hot key. We do not need to take into
account any other keys sent to this reducer because reducers process
rows one key at a time.

Alan.

On Jun 16, 2010, at 11:51 AM, Gang Luo wrote:

Thanks for replying. It is much clear now. One more thing to ask
about the third question is, how to allocate reducers to several hot
keys? Hashing? Further, Pig doesn't divide the reducers into hot-key
reducers and non-hot-key reducers, is it right?

Thanks,
-Gang

- 原始邮件
发件人： Alan Gates ga...@yahoo-inc.com
收件人： pig-dev@hadoop.apache.org
发送日期： 2010/6/16 (周三) 12:16:13 下午
主 题： Re: skew join in pig

On Jun 16, 2010, at 8:36 AM, Gang Luo wrote:

Hi,
there is something confusing me in the skew join (http://wiki.apache.org/pig/PigSkewedJoinSpec
)
1. does the sampling job sample and build histogram on both tables,
or just one table (in this case, which one) ?

Just the left one.

2. the join job still take the two table as inputs, and shuffle
tuples from partitioned table to particular reducer (one tuple to
one reducer), and shuffle tuples from streamed table to all
reducers associative to one partition (one tuple to multiple
reducers). Is that correct?
Keys with small enough values to fit in memory are shuffled to
reducers as normal. Keys that are too large are split between
reducers on the left side, and replicated to all of those reducers
that have the splits (not all reducers) on the right side. Does
that answer your question?

3. Hot keys need more than one reducers. Are these reducers
dedicated to this key only? Could they also take other keys at the
same time?

They take other keys at the same time.

4. for non-hot keys, my understanding is that they are shuffled to
reducers based on default hash partitioner. However, it could
happen all the keys shuffled to one reducers incurs skew even none
of them is skewed individually.
This is always the case in map reduce, though a good hash function
should minimize the occurrences of this.

Can someone give me some ideas on these? Thanks.

-Gang

Alan.

Re: Avoiding serialization/de-serialization in pig

2010-06-30 Thread Alan Gates



On Jun 28, 2010, at 5:51 PM, Dmitriy Ryaboy wrote:

For what it's worth, I saw very significant speed improvements  
(order of
magnitude for wide tables with few projected columns) when I  
implemented (2)

for our protocol buffer - based loaders.

I have a feeling that propagating schemas when known, and using them  
to for
(de)serialization instead of reflecting every field, would also be a  
big

win.

Thoughts on just using Avro for the internal PigStorage?
I'm been trying to play with this in my spare time but haven't gotten  
far yet.  We're certain open to looking at it and seeing how it  
performs.


Alan.



-D

On Mon, Jun 28, 2010 at 5:08 PM, Thejas Nair te...@yahoo-inc.com  
wrote:



I have created a wiki which puts together some ideas that can help in
improving performance by avoiding/delaying serialization/de- 
serialization .


http://wiki.apache.org/pig/AvoidingSedes

These are ideas that don't involve changes to optimizer. Most of them
involve changes in the load/store functions.

Your feedback is welcome.

Thanks,
Thejas

Notes from Pig contributor workshop

2010-07-13 Thread Alan Gates

On June 30th Yahoo hosted a Pig contributor workshop.  Pig  
contributors from Yahoo, Twitter, LinkedIn, and Cloudera were  
present.  The slides used for the presentations that day have been  
uploaded to http://wiki.apache.org/pig/PigTalksPapers  Here's a digest  
of what was discussed there.  For those who were there, if I forgot  
anything please feel free to add it in.


Thejas Nair discussed his work on performance.  In particular he has  
been looking into how to more efficiently de/serialize complex data  
types and when Pig can make use of lazy deserialization.  Dmitriy  
Ryaboy brought up the question of whether Pig would be open to using  
Avro for de/serialization between Map and Reduce and between MR jobs.   
We concluded that we are open to using whatever is fast.


Richard Ding discussed the work he has been doing to make Pig run  
statistics available to users via the logs, applications running Pig  
(such as workflow systems) via a new PigRunner API, and to developers  
via Hadoop job history files. Russell Jurney brought up that it would  
be nice if this API also included record input and output on a per MR  
job level so that users diagnosing issues with their Pig Latin scripts  
would have a better idea in which MR job things went wrong.


Ashutosh Chauhan gave an overview of the work that has been going on  
to add UDFs in scripting languages to Pig (PIG-928).


Daniel Dai talked about the rewrite of the logical optimizer that he  
has been doing, including an overview of the major rules being  
implemented in the new optimizer framework.  Dmitriy indicated that he  
would really like to see pushing of limits into the RecordReader (so  
that we can terminate reading early) added to the list of rules.  This  
would involve making use of the new optimizer framework in the MR  
optimizer.  Alan Gates indicated that while he does not believe we  
should translate the entire set of MR optimizer visitors into the new  
framework until we've further tested the framework, this might be a  
good first test for the new optimizer in the MR optimizer.


Aniket Mokashi showed the work he's been doing to add a custom  
partitioner to Pig.  He also covered his work to add the ability to re- 
use a relation that contains a single record with a single field as a  
scalar.  Dmitriy pointed out that we need to make sure this uses the  
distributed cache to minimize strain on the namenode.


Pradeep Kamath gave a short presentation on Howl, the work he is  
leading to create a shared metadata system between Pig, Hive, and Map  
Reduce.  Dmitriy noted that we need to get this work more in the open  
so others can participate and contribute.


Russell Jurney talked about his work on adding datetime types to Pig.   
He indicated he was interested in using Jodatime as the basis for  
this.  There were some questions on how these types would be  
serialized in text files where the type information might be lost.


Olga Natkovich talked about areas the Yahoo Pig team would like to  
work on in the future, mostly focussed in the areas of usability.   
These included changing our parser to one that will allow us to give  
better error messages.  Dmitriy indicated he strongly preferred  
Antlr.  It also includes resurrecting support for the illustrate  
command, which we have let lapse.  Richard and Ashutosh noted that how  
illustrate works internally needs some redesign, because currently it  
requires special code inside each physical operator.  This makes it  
hard to maintain illustrate in the face of new operators, and pollutes  
the main code path during execution.  Instead it should be done via  
callbacks or some other solution.


After these presentations the group took on a couple of topics for  
discussion.  The first was how Pig should grow to become Turing  
complete.  For this Dmitriy and Ning Liang presented Piglet, a Ruby  
library they use at Twitter to wrap Pig and provide branching,  
looping, functions, and modules.  Several people in the group  
expressed concerns that growing Pig Latin itself to be Turing complete  
will result in a poorly thought out language with insufficient tools  
and too much maintenance in the future.  One suggestion that was made  
was to create a Java interface that allowed users to directly  
construct Pig data flows.  That is, this interface would (roughly)  
have a method for each Pig operator.  Users could then construct Pig  
data flows directly in Java.  Users who wished to use scripting  
languages could still access this with no additional work via Jython,  
JRuby, Groovy, etc.


The second discussion centered on Pig's support for workflow systems  
such as Oozie and Azkaban.  There have been proposals in the past that  
Pig switch to generate Oozie workflows instead of MR jobs.  Alan  
indicated that he does not see the value of this.  There have been  
proposals that Pig Latin be extended to include workflow controls.   
Dmitriy and Russell both

Announcing Howl development list

2010-07-20 Thread Alan Gates



On Jul 14, 2010, at 2:11 AM, Jeff Hammerbacher wrote:


Hey,

Thanks for writing up these notes, they're very useful.

Pradeep Kamath gave a short presentation on Howl, the work he is  
leading to
create a shared metadata system between Pig, Hive, and Map Reduce.   
Dmitriy

noted that we need to get this work more in the open so others can
participate and contribute.



Is there a public JIRA where one could follow this work? Any chance  
we can
break it up into incremental milestones rather than have a single  
code drop
as with previous large features in Pig? I understand it may be  
difficult to
coordinate internal development with external user groups, but I  
hope the

feedback from third parties might make such a process worthwhile.



A wiki page outlining Howl is at http://wiki.apache.org/pig/Howl

A howldev mailing list has been set up on Yahoo! groups for  
discussions on Howl.  You can subscribe by sending mail to howldev-subscr...@yahoogroups.com 
.  We plan on putting the code on github in a read only repository.   
It will be a few more days before we get there.  It will be announced  
on the list when it is.


Alan.

Restarting discussion on Pig as a TLP

2010-08-16 Thread Alan Gates

Five months ago I started a discussion on whether Pig should become a
top level project (TLP) at Apache instead of remaining a subproject of
Hadoop (http://mail-archives.apache.org/mod_mbox/hadoop-pig-dev/201003.mbox/%3c006aea7c-8829-4788-ad7b-822396fa2...@yahoo-inc.com%3e
). At the time I voted against it (http://mail-archives.apache.org/mod_mbox/hadoop-pig-dev/201003.mbox/%3cf1484964-e774-48b7-9d45-6e57c7b09...@yahoo-inc.com%3e
), as did many others. However, I would like to restart that
discussion now.

I gave several reasons for voting against it :

First, I was worried that by loosing our connection to Hadoop, Pig
would loose its source of new users. I have since been assured by
Hadoop members that Pig would be free to keep our tab on their page
(as Hbase has). Also, obviously we would still be welcomed at Hadoop
get togethers such as the various HUGs, Hadoop Summits, etc. So our
connection does not seem in danger.

Two, I was concerned that by not being members of the Hadoop community
we would loose influence with Hadoop. It is true that Pig developers
will have to stay active in the Hadoop community, which will put a
slightly extra burden on them. But they are already bearing this
burden, and whether or not the communities are governed by the same or
separate PMCs will not affect this.

Finally, I said that philosophically it makes sense to me that all
Hadoop related projects should stay under one umbrella. This still
makes sense to me, and I do see this as a downside of Pig moving out
of Hadoop.

In addition to the above, a few other things have happened over the
intervening months to cause me to reconsider. Most importantly, it
has become clear to me that Pig is operating as if it were a TLP
inside Hadoop. We have four members on the Hadoop PMC, which means we
have sufficient votes to elect our committers and release our products.

Also, several Hadoop PMC members who have long experience in Apache
projects have made clear to me that they believe Pig is ready to be a
TLP.

I was also concerned about diversity in our PMC, since our project is
Yahoo heavy. Given that 10 out of 12 committers are Yahoo employees
we need to work on this. But we do have experienced committers in
three different organizations, and I think this gives us sufficient
base to to work on it as a TLP.

So, in summary, I have switched my view on this from not yet to now
is a good time. I think Pig is ready to be a TLP. We have a
community of contributors and users that is growing both in numbers
and in diversity. We have a strong group of committers who I believe
are ready to take on leadership of the project and who will benefit
from being mentored by the larger Apache community.

Thoughts?

Alan.

August Pig contributor workshop

2010-08-17 Thread Alan Gates


All,

We will be holding the next Pig contributor workshop at Twitter on  
Wednesday, August 25 from 4-6.  The tentative agenda is to discuss:


Making Piggybank better
Pig and Azkaban integration
Plans for features in 0.9
An update on the Howl project

Anyone contributing to or interested in contributing to Pig  
development is welcome to attend.  Please RSVP by Friday, August 20th.


Twitter is located at 795 Folsom St., Suite 600 in San Francisco.

Alan.

[VOTE] Pig to become a top level Apache project

2010-08-18 Thread Alan Gates

Earlier this week I began a discussion on Pig becoming a TLP (http://bit.ly/byD7L8 
 ).  All of the received feedback was positive.  So, let's have a  
formal vote.


I propose we move Pig to a top level Apache project.

I propose that the initial PMC of this project be the list of all  
currently active Pig committers (http://hadoop.apache.org/pig/whoweare.html 
 ) as of 18 August 2010.


I nominate Olga Natkovich as the chair of the PMC.  (PMC chairs have  
no more power than other PMC members, but they are responsible for  
writing regular reports for the Apache board, assigning rights to new  
committers, etc.)


I propose that as part of the resolution that will be forwarded to the  
Apache board we include that one of the first tasks of the new Pig PMC  
will be to adopt bylaws for the governance of the project.


Alan.

P.S.
If this vote passes, the next step is that the proposal will be  
forwarded to the Hadoop PMC for discussion and vote.
If the Hadoop PMC vote passes, a formal resolution is then drafted  
(see http://bit.ly/bvOTRq for an example resolution) and sent to the  
Apache board.

The Apache board will then vote on whether to make Pig a TLP.

Re: August Pig contributor workshop

2010-08-18 Thread Alan Gates


Confirming Olga and I will be there.

Alan.

On Aug 18, 2010, at 4:45 PM, Dmitriy Ryaboy wrote:


Hi folks,
Please do RSVP so that we know how many people are coming.

Thanks,
-Dmitriy

On Tue, Aug 17, 2010 at 4:04 PM, Alan Gates ga...@yahoo-inc.com  
wrote:



All,

We will be holding the next Pig contributor workshop at Twitter on
Wednesday, August 25 from 4-6.  The tentative agenda is to discuss:

Making Piggybank better
Pig and Azkaban integration
Plans for features in 0.9
An update on the Howl project

Anyone contributing to or interested i

Re: release notes in JIRA

2010-08-20 Thread Alan Gates


+1

Backloading documentation is error prone and leads to not getting  
documentation done.


Alan.

On Aug 20, 2010, at 4:11 PM, Olga Natkovich wrote:


Guys,

After spending the last couple of days collecting information for  
Pig 0.8.0 documentation, I would like to propose a change for our  
patch process that would make my life easier :).


I would like to ask developers working on patches with new customer  
facing features or user visible modifications to the existing  
features to fill in the Release Notes part of JIRA as part of their  
patch submission process. The Release Notes section should contain  
all the information that would be needed to create user  
documentation including


-  Feature definition
-  Cases in which feature is applicable
-  Notes indicating if this feature/changes to the feature  
breaks backward compatibility
-  Usage examples. (Please, make sure you actually run all  
the examples.)

-  Anything else that would assist users in using the feature.

I would like to ask the reviewers to review the Release Notes as  
part of their patch review process.


Please, let me know if you have any questions or concerns.

Thanks,

Olga

Re: [VOTE] Pig to become a top level Apache project

2010-08-23 Thread Alan Gates

With 9 +1 votes and no -1s the vote passes.  I will begin a vote on  
Hadoop general.


Alan.

On Aug 18, 2010, at 10:34 AM, Alan Gates wrote:


Earlier this week I began a discussion on Pig becoming a TLP 
(http://bit.ly/byD7L8
 ).  All of the received feedback was positive.  So, let's have a
formal vote.

I propose we move Pig to a top level Apache project.

I propose that the initial PMC of this project be the list of all
currently active Pig committers (http://hadoop.apache.org/pig/whoweare.html
 ) as of 18 August 2010.

I nominate Olga Natkovich as the chair of the PMC.  (PMC chairs have
no more power than other PMC members, but they are responsible for
writing regular reports for the Apache board, assigning rights to new
committers, etc.)

I propose that as part of the resolution that will be forwarded to the
Apache board we include that one of the first tasks of the new Pig PMC
will be to adopt bylaws for the governance of the project.

Alan.

P.S.
If this vote passes, the next step is that the proposal will be
forwarded to the Hadoop PMC for discussion and vote.
If the Hadoop PMC vote passes, a formal resolution is then drafted
(see http://bit.ly/bvOTRq for an example resolution) and sent to the
Apache board.
The Apache board will then vote on whether to make Pig a TLP.

Re: Caster interface and byte conversion

2010-08-24 Thread Alan Gates

This seems fine.  Is the Pig engine at any point testing to see if the  
interface is implemented and if so calling toBytes, or is this totally  
for use inside the store functions themselves to serialize Pig data  
types?


Alan.

On Aug 22, 2010, at 1:40 AM, Dmitriy Ryaboy wrote:

The current HBase patch on PIG-1205 (patch 7) includes this  
refactoring.

Please take a look if you have concerns.

Or just if you feel like reviewing the code... :)

-D

On Sat, Aug 21, 2010 at 5:22 PM, Dmitriy Ryaboy dvrya...@gmail.com  
wrote:


I just noticed that even though Utf8StorageConverter implements the  
various
byte[] toBytes(Obj o) methods, they are not part of the LoadCaster  
interface
-- and therefore can't be relied on when using modular Casters,  
like I am

trying to do for the HBaseLoader.

Since we don't want to introduce backwards-incompatible changes, I  
propose
adding a ByteCaster interface that defines these methods, and  
extending

Utf8StorageConverter to implement them (without actually changing the
implementation at all).
That way StoreFuncs that need to convert to bytes can use pluggable
converters. Objections?

-D

Re: is Hudson awol?

2010-08-24 Thread Alan Gates

Yes, our friend Hudson is ill again.  Giri, Hudson's doctor, should  
get a chance to look at it in a few days.


Alan.

On Aug 23, 2010, at 3:31 PM, Dmitriy Ryaboy wrote:


Haven't heard anything from Hudson in a while...

-D

Re: Caster interface and byte conversion

2010-08-24 Thread Alan Gates

One other comment.  By making this part of an interface that extends  
LoadCaster you are assuming the implementing class is both a load and  
store function.  It makes more sense to have a separate StoreCaster  
interface rather than extending LoadCaster.


Alan.

On Aug 24, 2010, at 9:18 AM, Alan Gates wrote:


This seems fine.  Is the Pig engine at any point testing to see if the
interface is implemented and if so calling toBytes, or is this totally
for use inside the store functions themselves to serialize Pig data
types?

Alan.

On Aug 22, 2010, at 1:40 AM, Dmitriy Ryaboy wrote:


The current HBase patch on PIG-1205 (patch 7) includes this
refactoring.
Please take a look if you have concerns.

Or just if you feel like reviewing the code... :)

-D

On Sat, Aug 21, 2010 at 5:22 PM, Dmitriy Ryaboy dvrya...@gmail.com
wrote:


I just noticed that even though Utf8StorageConverter implements the
various
byte[] toBytes(Obj o) methods, they are not part of the LoadCaster
interface
-- and therefore can't be relied on when using modular Casters,  
like I am

trying to do for the HBaseLoader.

Since we don't want to introduce backwards-incompatible changes, I
propose
adding a ByteCaster interface that defines these methods, and
extending
Utf8StorageConverter to implement them (without actually changing  
the

implementation at all).
That way StoreFuncs that need to convert to bytes can use pluggable
converters. Objections?

-D

Re: Caster interface and byte conversion

2010-08-24 Thread Alan Gates



On Aug 24, 2010, at 1:22 PM, Dmitriy Ryaboy wrote:

As far as the toBytes methods -- I am not sure what they were  
originally
for. They aren't actually called anywhere that I can find, except my  
new

HBase stuff.
You are right, I could make it two interfaces, but I consolidated  
them for
simplicity of use/implementation. Now that I think about it, I can  
put all

the methods into StoreCaster and just have a unioning interface for
simplicity:

@InterfaceAudience.Public
@InterfaceStability.Evolving
public interface LoadStoreCaster extends LoadCaster, StoreCaster {

}

Does that seem ok?


Yeah, makes sense.

Alan.



-D

On Tue, Aug 24, 2010 at 10:01 AM, Alan Gates ga...@yahoo-inc.com  
wrote:



One other comment.  By making this part of an interface that extends
LoadCaster you are assuming the implementing class is both a load  
and store
function.  It makes more sense to have a separate StoreCaster  
interface

rather than extending LoadCaster.

Alan.


On Aug 24, 2010, at 9:18 AM, Alan Gates wrote:

This seems fine.  Is the Pig engine at any point testing to see if  
the
interface is implemented and if so calling toBytes, or is this  
totally

for use inside the store functions themselves to serialize Pig data
types?

Alan.

On Aug 22, 2010, at 1:40 AM, Dmitriy Ryaboy wrote:

The current HBase patch on PIG-1205 (patch 7) includes this

refactoring.
Please take a look if you have concerns.

Or just if you feel like reviewing the code... :)

-D

On Sat, Aug 21, 2010 at 5:22 PM, Dmitriy Ryaboy  
dvrya...@gmail.com

wrote:

I just noticed that even though Utf8StorageConverter implements the

various
byte[] toBytes(Obj o) methods, they are not part of the LoadCaster
interface
-- and therefore can't be relied on when using modular Casters,  
like I

am
trying to do for the HBaseLoader.

Since we don't want to introduce backwards-incompatible changes, I
propose
adding a ByteCaster interface that defines these methods, and
extending
Utf8StorageConverter to implement them (without actually  
changing the

implementation at all).
That way StoreFuncs that need to convert to bytes can use  
pluggable

converters. Objections?

-D

Fwd: hudson patch test jobs : hadoop pig and zookeeper

2010-08-24 Thread Alan Gates

Begin forwarded message:

From: Giridharan  Kesavan gkesa...@yahoo-inc.com
Date: August 24, 2010 4:38:46 PM PDT
To: gene...@hadoop.apache.org gene...@hadoop.apache.org
Subject: hudson patch test jobs : hadoop pig and zookeeper
Reply-To: gene...@hadoop.apache.org gene...@hadoop.apache.org

Hi,

We have a new hudson master hudson.apache.org and  
hudson.zones.apache.org is retired.
This means that we need to port all our patch test admin jobs for  
hadoop(common,hdfs,mapred), pig and zookeeper to the new hudson  
master.

I'm working on configuring patch admin jobs with the new hudson  
master: hudson.apache.org. (this is exactly the reason for why the  
patch test builds are not running at the moment)

Thanks
Giri

Re: Pig Contributor meeting notes

2010-08-26 Thread Alan Gates



On Aug 26, 2010, at 12:55 AM, Jeff Zhang wrote:


Wonderful, Dmitriy, It's pity for me missing the contributor meeting.
And any ppt shared ?


Jeff,

We don't want to exclude our contributors who don't happen to live in  
the San Francisco Bay Area.  If we could include you via Skype or some  
other technology we'd be happy to set it up on our end.  Do you think  
something like that would work for you?


Alan.

Re: Does Pig Re-Use FileInputLoadFuncs Objects?

2010-09-07 Thread Alan Gates

I'm not 100% sure I understand the question.  Are you asking if it re- 
uses instances of a given load or store function?  It should not.


Alan.

On Aug 31, 2010, at 7:28 PM, Russell Jurney wrote:

Pardon the cross-post: Does Pig ever re-use FileInputLoadFunc  
objects?  We

suspect state is being retained between different stores, but we don't
actually know this.  Figured I'd ask to verify the hunch.

Our load func for our in-house format works fine with Pig scripts
normally... but I have a pig script that looks like this:

LOAD thing1
SPLIT thing1 INTO thing2, thing3
STORE thing2 INTO thing2
STORE thing3 INTO thing3

LOAD thing4
SPLIT thing4 INTO thing5, thing6
STORE thing5 INTO thing5
STORE thing6 INTO thing6


And it works via PigStorage, but not via our FileInputLoadFunc.

Russ

Re: help : error run pig

2010-09-27 Thread Alan Gates

Pig is failing to connect to your namenode.  Is the address Pig is  
trying to use (hdfs://master:54310/) correct?  Can you connect using  
that string from the same machine using bin/hadoop?


Alan.

On Sep 27, 2010, at 8:45 AM, Ngô Văn Vĩ wrote:


I run Pig at Hadoop Mode
(Pig-0.7.0 and hadoop-0.20.2)
have error?
ng...@master:~/pig-0.7.0$ bin/pig
10/09/27 08:39:40 INFO pig.Main: Logging error messages to:
/home/ngovi/pig-0.7.0/pig_1285601980268.log
2010-09-27 08:39:40,538 [main] INFO
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -  
Connecting

to hadoop file system at: hdfs://master:54310/
2010-09-27 08:39:41,760 [main] INFO  org.apache.hadoop.ipc.Client -  
Retrying
connect to server: master/192.168.230.130:54310. Already tried 0  
time(s).
2010-09-27 08:39:42,762 [main] INFO  org.apache.hadoop.ipc.Client -  
Retrying
connect to server: master/192.168.230.130:54310. Already tried 1  
time(s).
2010-09-27 08:39:43,763 [main] INFO  org.apache.hadoop.ipc.Client -  
Retrying
connect to server: master/192.168.230.130:54310. Already tried 2  
time(s).
2010-09-27 08:39:44,765 [main] INFO  org.apache.hadoop.ipc.Client -  
Retrying
connect to server: master/192.168.230.130:54310. Already tried 3  
time(s).
2010-09-27 08:39:45,766 [main] INFO  org.apache.hadoop.ipc.Client -  
Retrying
connect to server: master/192.168.230.130:54310. Already tried 4  
time(s).
2010-09-27 08:39:46,767 [main] INFO  org.apache.hadoop.ipc.Client -  
Retrying
connect to server: master/192.168.230.130:54310. Already tried 5  
time(s).
2010-09-27 08:39:47,768 [main] INFO  org.apache.hadoop.ipc.Client -  
Retrying
connect to server: master/192.168.230.130:54310. Already tried 6  
time(s).
2010-09-27 08:39:48,769 [main] INFO  org.apache.hadoop.ipc.Client -  
Retrying
connect to server: master/192.168.230.130:54310. Already tried 7  
time(s).
2010-09-27 08:39:49,770 [main] INFO  org.apache.hadoop.ipc.Client -  
Retrying
connect to server: master/192.168.230.130:54310. Already tried 8  
time(s).
2010-09-27 08:39:50,771 [main] INFO  org.apache.hadoop.ipc.Client -  
Retrying
connect to server: master/192.168.230.130:54310. Already tried 9  
time(s).

2010-09-27 08:39:50,780 [main] ERROR org.apache.pig.Main - ERROR 2999:
Unexpected internal error. Failed to create DataStorage

Help me??
Thanks
--
Ngô Văn Vĩ
Công Nghệ Phần Mềm
Phone: 01695893851

[jira] Updated: (PIG-519) allow for '#' to signify a comment in a PIG script

2008-11-14 Thread Alan Gates (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-519:
---

   Resolution: Fixed
Fix Version/s: types_branch
 Hadoop Flags: [Reviewed]
   Status: Resolved  (was: Patch Available)

Checked in modified version of the patch that just supported #!.  Thanks Ian 
for the contribution.

 allow for '#' to signify a comment in a PIG script
 --

 Key: PIG-519
 URL: https://issues.apache.org/jira/browse/PIG-519
 Project: Pig
  Issue Type: Wish
  Components: grunt
 Environment: linux/unix
Reporter: Ian Holsman
Priority: Trivial
 Fix For: types_branch

 Attachments: comment.patch, pig.pig


 in unix type operating systems, it is common to just run scripts directly 
 from the shell.
 In order to do this scripts need to have the command to run them on the first 
 line similar to
 #!/usr/bin/env pig -
 this patch allows you to just run scripts without specifying pig -f XXX

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-512) Expressions in foreach lead to errors

2008-11-14 Thread Alan Gates (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647700#action_12647700
 ] 

Alan Gates commented on PIG-512:


In LogicalPlanCloneHelper, why do you need this:

{code}
protected void visit(LOCross cs) throws VisitorException {
super.visit(cs);
}
{code}

Won't java do that for you?

What is the significance of the changes in TypeCheckingVisitor?

Neither of these issues are big enough to require a new patch.  The current one 
looks good (and big :) ).

 Expressions in foreach lead to errors
 -

 Key: PIG-512
 URL: https://issues.apache.org/jira/browse/PIG-512
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: Santhosh Srinivasan
Assignee: Santhosh Srinivasan
 Fix For: types_branch

 Attachments: PIG-512.patch, PIG-512_1.patch


 Use of expressions that use the same sub-expressions in foreach lead to 
 translation errors. This issue is caused due to sharing operators across 
 nested plans. To remedy this issue, logical operators should be cloned and 
 not shared across plans.
 {code}
 grunt a = load 'a' as (x, y, z);
 grunt b = foreach a {
  exp1 = x + y;
  exp2 = exp1 + x;
  generate exp1, exp2;
  }
 grunt explain b;
 2008-10-30 15:38:40,257 [main] WARN  org.apache.pig.PigServer - bytearray is 
 implicitly casted to double under LOAdd Operator
 2008-10-30 15:38:40,258 [main] WARN  org.apache.pig.PigServer - bytearray is 
 implicitly casted to double under LOAdd Operator
 2008-10-30 15:38:40,258 [main] WARN  org.apache.pig.PigServer - bytearray is 
 implicitly casted to double under LOAdd Operator
 Logical Plan:
 Store sms-Thu Oct 30 11:27:27 PDT 2008-2609 Schema: {double,double} Type: 
 Unknown
 |
 |---ForEach sms-Thu Oct 30 11:27:27 PDT 2008-2605 Schema: {double,double} 
 Type: bag
 |   |
 |   Add sms-Thu Oct 30 11:27:27 PDT 2008-2600 FieldSchema: double Type: 
 double
 |   |
 |   |---Cast sms-Thu Oct 30 11:27:27 PDT 2008-2606 FieldSchema: double 
 Type: double
 |   |   |
 |   |   |---Project sms-Thu Oct 30 11:27:27 PDT 2008-2598 Projections: 
 [0] Overloaded: false FieldSchema: x: bytearray Type: bytearray
 |   |   Input: Load sms-Thu Oct 30 11:27:27 PDT 2008-2597
 |   |
 |   |---Cast sms-Thu Oct 30 11:27:27 PDT 2008-2607 FieldSchema: double 
 Type: double
 |   |
 |   |---Project sms-Thu Oct 30 11:27:27 PDT 2008-2599 Projections: 
 [1] Overloaded: false FieldSchema: y: bytearray Type: bytearray
 |   Input: Load sms-Thu Oct 30 11:27:27 PDT 2008-2597
 |   |
 |   Add sms-Thu Oct 30 11:27:27 PDT 2008-2603 FieldSchema: double Type: 
 double
 |   |
 |   |---Project sms-Thu Oct 30 11:27:27 PDT 2008-2601 Projections:  [*]  
 Overloaded: false FieldSchema: double Type: double
 |   |   Input: Add sms-Thu Oct 30 11:27:27 PDT 2008-2600|
 |   |   |---Add sms-Thu Oct 30 11:27:27 PDT 2008-2600 FieldSchema: double 
 Type: double
 |   |   |
 |   |   |---Project sms-Thu Oct 30 11:27:27 PDT 2008-2598 
 Projections: [0] Overloaded: false FieldSchema: x: bytearray Type: bytearray
 |   |   |   Input: Load sms-Thu Oct 30 11:27:27 PDT 2008-2597
 |   |   |
 |   |   |---Project sms-Thu Oct 30 11:27:27 PDT 2008-2599 
 Projections: [1] Overloaded: false FieldSchema: y: bytearray Type: bytearray
 |   |   Input: Load sms-Thu Oct 30 11:27:27 PDT 2008-2597
 |   |
 |   |---Cast sms-Thu Oct 30 11:27:27 PDT 2008-2608 FieldSchema: double 
 Type: double
 |   |
 |   |---Project sms-Thu Oct 30 11:27:27 PDT 2008-2602 Projections: 
 [0] Overloaded: false FieldSchema: x: bytearray Type: bytearray
 |   Input: Load sms-Thu Oct 30 11:27:27 PDT 2008-2597
 |
 |---Load sms-Thu Oct 30 11:27:27 PDT 2008-2597 Schema: {x: bytearray,y: 
 bytearray,z: bytearray} Type: bag
 2008-10-30 15:38:40,272 [main] ERROR org.apache.pig.impl.plan.OperatorPlan - 
 Attempt to give operator of type 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject
  multiple outputs.  This operator does not support multiple outputs.
 2008-10-30 15:38:40,272 [main] ERROR 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor
  - Invalid physical operators in the physical planAttempt to give operator of 
 type 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject
  multiple outputs.  This operator does not support multiple outputs.
 2008-10-30 15:38:40,273 [main] ERROR org.apache.pig.tools.grunt.GruntParser - 
 java.io.IOException: Unable to explain alias b 
 [org.apache.pig.impl.plan.VisitorException]
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile

[jira] Commented: (PIG-460) PERFORMANCE: Order by done in 3 MR jobs, could be done in 2

2008-12-02 Thread Alan Gates (JIRA)

[
https://issues.apache.org/jira/browse/PIG-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12652642#action_12652642
]

Alan Gates commented on PIG-460:

Here's a quick write up of what will need to be done to change order by from
being a 3 mr job process to 2. Currently sampling is done via
org.apache.pig.impl.builtin.RandomSampleLoader. Since this loader extends
BinStorage the first mr job reads the data in whatever format and then stores
it again using BinStorage. It is then read in the second job using
RandomSampleLoader. The tuples that are selected by RandomSampleLoader are
grouped into a single reducer and then fed to
org.apache.pig.impl.builtin.FindQuantiles, which builds a side file containing
partitioning information. The third mr job again reads the data and uses the
side file in the SortPartitioner. (It may be helpful to do an explain on a
simple order by query to see all this.)

What needs to change is that RandomSampleLoader should instead become an
EvalFunc, RandomSampler. The logic inside can remain the same. The MRCompiler
will need to change to create two mr jobs for the sort instead of 3. The first
job should contain a ForEach operator with the new RandomSampler function in
the map. It's reduce should look just like the reduce of the second mr job in
the current system (that is, singular and having a ForEach operator that calls
FindQuantiles). The second job should remain exactly the same as the third job
in the current system. Take a look at MRCompiler.visitSort() for an idea of
how sort jobs are constructed now. It's this function and the functions it
calls that you'll be changing in MRCompiler.

PERFORMANCE: Order by done in 3 MR jobs, could be done in 2

Key: PIG-460
URL: https://issues.apache.org/jira/browse/PIG-460
Project: Pig
Issue Type: Bug
Affects Versions: types_branch
Reporter: Alan Gates
Assignee: Alan Gates
Fix For: types_branch

Currently order by is done in three MR jobs:
job 1: read data in whatever loader the user requests, store using BinStorage
job 2: load using RandomSampleLoader, find quantiles
job 3: load data again and sort
It is done this way because RandomSampleLoader extends BinStorage, and so
needs the data in that format to read it.
If the logic in RandomSampleLoader was made into an operator instead of being
in a loader then jobs 1 and 2 could be merged. On average job 1 takes about
15% of the time of an order by script.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PIG-6) Addition of Hbase Storage Option In Load/Store Statement

2008-12-16 Thread Alan Gates (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-6?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates resolved PIG-6.
--

   Resolution: Fixed
Fix Version/s: types_branch
 Hadoop Flags: [Reviewed]

V01 patch checked in.  Thanks Sam for stepping up and taking on this issue that 
many people had requested.

 Addition of Hbase Storage Option In Load/Store Statement
 

 Key: PIG-6
 URL: https://issues.apache.org/jira/browse/PIG-6
 Project: Pig
  Issue Type: New Feature
 Environment: all environments
Reporter: Edward J. Yoon
 Fix For: types_branch

 Attachments: hbase-0.18.1-test.jar, hbase-0.18.1.jar, PIG-6.patch, 
 PIG-6_V01.patch


 It needs to be able to load full table in hbase.  (maybe ... difficult? i'm 
 not sure yet.)
 Also, as described below, 
 It needs to compose an abstract 2d-table only with certain data filtered from 
 hbase array structure using arbitrary query-delimited. 
 {code}
 A = LOAD table('hbase_table');
 or
 B = LOAD table('hbase_table') Using HbaseQuery('Query-delimited by attributes 
  timestamp') as (f1, f2[, f3]);
 {code}
 Once test is done on my local machines, 
 I will clarify the grammars and give you more examples to help you explain 
 more storage options. 
 Any advice welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-554) Fragment Replicate Join

2008-12-22 Thread Alan Gates (JIRA)

[
https://issues.apache.org/jira/browse/PIG-554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12658573#action_12658573
]

Alan Gates commented on PIG-554:

A couple of questions:

1) I'm still not clear on why the additional maps are needed to load the
replicated inputs into files. Those inputs are already in files. Are you
somehow transforming them? Isn't this exactly where we should be using the
DistributedCache? Rather than having map jobs that transform them I think the
best thing would be to have the MRCompiler set a flag for the
JobControlCompiler to load those files into the DC for this job.

2) You are using POLocalRearrange both in setting up the hash table and in
reading the fragmented table before the join. What benefit is being derived
from this? LR adds a lot of extra weight to the tuple that I don't think is
needed. I suspect we could fit more tuples into memory if we loaded them
directly rather than using LR.

Fragment Replicate Join
---

Key: PIG-554
URL: https://issues.apache.org/jira/browse/PIG-554
Project: Pig
Issue Type: New Feature
Affects Versions: types_branch
Reporter: Shravan Matthur Narayanamurthy
Assignee: Shravan Matthur Narayanamurthy
Fix For: types_branch

Attachments: frjofflat.patch, frjofflat1.patch

Fragment Replicate Join(FRJ) is useful when we want a join between a huge
table and a very small table (fitting in memory small) and the join doesn't
expand the data by much. The idea is to distribute the processing of the huge
files by fragmenting it and replicating the small file to all machines
receiving a fragment of the huge file. Because of the availability of the
entire small file, the join becomes a trivial task without needing any break
in the pipeline. Exhaustive test have done to determine the improvement we
get out of FRJ. Here are the details: http://wiki.apache.org/pig/PigFRJoin
The patch makes changes to parts of the code where new operators are
introduced. Currently, when a new operator is introduced, its alias is not
set. For schema computation I have modified this behaviour to set the alias
of the new operator to that of its predecessor. The logical side of the patch
mimics the cogroup behavior as join syntax closely resembles that of cogroup.
Currently, this patch doesn't have support for joins other than inner joins.
The rest of the code has been documented.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-572) A PigServer.registerScript() method, which lets a client programmatically register a Pig Script.

2008-12-31 Thread Alan Gates (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660148#action_12660148
 ] 

Alan Gates commented on PIG-572:


The code in the patch looks fine.  I have a couple of questions:

# What's the use case driving this?  If a user has their pig script in a file 
why do we expect them to be using PigServer directly instead of grunt?
# Why does the logical plan need to be serializable?

 A PigServer.registerScript() method, which lets a client programmatically 
 register a Pig Script.
 

 Key: PIG-572
 URL: https://issues.apache.org/jira/browse/PIG-572
 Project: Pig
  Issue Type: New Feature
Affects Versions: types_branch
Reporter: Shubham Chopra
Priority: Minor
 Fix For: types_branch

 Attachments: registerScript.patch


 A PigServer.registerScript() method, which lets a client programmatically 
 register a Pig Script.
 For example, say theres a script my_script.pig with the following content:
 a = load '/data/my_data.txt';
 b = filter a by $0  '0';
 The function lets you use something like the following:
 pigServer.registerScript(my_script.pig);
 pigServer.registerQuery(c = foreach b generate $2, $3;);
 pigServer.store(c);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-596) Anonymous tuples in bags create ParseExceptions

2009-01-02 Thread Alan Gates (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660351#action_12660351
 ] 

Alan Gates commented on PIG-596:


Flattening a bag gets rid of two layers of containment, both the bag and the 
tuple.  So the result of FLATTEN(bag(tuple(x, y, z)) is x, y, z not tuple(x, y, 
z).

At this point I believe tuples must be named in the LOAD statement as well as 
in foreach.  I'm not necessarily voting against anonymous tuples.  But I do 
believe Pig Latin is consistent in requiring names for tuples at the moment.

 Anonymous tuples in bags create ParseExceptions
 ---

 Key: PIG-596
 URL: https://issues.apache.org/jira/browse/PIG-596
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: David Ciemiewicz

 {code}
 One = load 'one.txt' using PigStorage() as ( one: int );
 LabelledTupleInBag = foreach One generate { ( 1, 2 ) } as mybag { tuplelabel: 
 tuple ( a, b ) };
 AnonymousTupleInBag = foreach One generate { ( 2, 3 ) } as mybag { tuple ( a, 
 b ) }; -- Anonymous tuple creates bug
 Tuples = union LabelledTupleInBag, AnonymousTupleInBag;
 dump Tuples;
 {code}
 java.io.IOException: Encountered { tuple at line 6, column 66.
 Was expecting one of:
 parallel ...
 ; ...
 , ...
 : ...
 ( ...
 { IDENTIFIER ...
 { } ...
 [ ...
 
 at org.apache.pig.PigServer.parseQuery(PigServer.java:298)
 at org.apache.pig.PigServer.registerQuery(PigServer.java:263)
 at 
 org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:439)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:249)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:84)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:64)
 at org.apache.pig.Main.main(Main.java:306)
 Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException: 
 Encountered { tuple at line 6, column 66.
 Why can't there be an anonymous tuple at the top level of a bag?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-572) A PigServer.registerScript() method, which lets a client programmatically register a Pig Script.

2009-01-02 Thread Alan Gates (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660388#action_12660388
 ] 

Alan Gates commented on PIG-572:


Passes all the tests.  I'd like to wait until the Christmas vacation is over to 
give other committers a chance to comment before checking it in.  If I don't 
see any comments after a few days I'll check it in.

 A PigServer.registerScript() method, which lets a client programmatically 
 register a Pig Script.
 

 Key: PIG-572
 URL: https://issues.apache.org/jira/browse/PIG-572
 Project: Pig
  Issue Type: New Feature
Affects Versions: types_branch
Reporter: Shubham Chopra
Priority: Minor
 Fix For: types_branch

 Attachments: registerScript.patch


 A PigServer.registerScript() method, which lets a client programmatically 
 register a Pig Script.
 For example, say theres a script my_script.pig with the following content:
 a = load '/data/my_data.txt';
 b = filter a by $0  '0';
 The function lets you use something like the following:
 pigServer.registerScript(my_script.pig);
 pigServer.registerQuery(c = foreach b generate $2, $3;);
 pigServer.store(c);

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (PIG-580) PERFORMANCE: Combiner should also be used when there are distinct aggregates in a foreach following a group provided there are no non-algebraics in the foreach

2009-01-05 Thread Alan Gates (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660944#action_12660944
 ] 

alangates edited comment on PIG-580 at 1/5/09 1:51 PM:


In CombinerOptimizer.visitDistinct you have:

{code}
+if(sawDistinctAgg) {
+// we want to combine only in the case where there is only
+// one PODistinct which is the only input to an agg
+// we apparently have seen a PODistinct before, so lets not
+// combine.
+sawNonAlgebraic = true;
+}
{code}

but I can envision a case where you want to count multiple distinct things:

{code}
A = load ...
B = group A by $0;
C = foreach B {
   Aa = B.$1;
   Ab = distinct Aa;
   Ba = B.$2;
   Bb = distinct Ba;
   generate group, COUNT(Ab), COUNT(Bb);
}
{code}

Is there a reason we need to not use the combiner with multiple distincts?

  was (Author: alangates):
In CombinerOptimizer.visitDistinct you have:

{code}
+if(sawDistinctAgg) {
+// we want to combine only in the case where there is only
+// one PODistinct which is the only input to an agg
+// we apparently have seen a PODistinct before, so lets not
+// combine.
+sawNonAlgebraic = true;
+}
{code}

but I can envision a case where you want to count multiple distinct things:

{code}
A = load ...
B = group A by $0;
C = foreach B {
   Aa = B.$1;
   Ab = distinct Aa;
   Ba = B.$2;
   Bb = distinct Ba;
   generate group, COUNT(Ab), COUNT(Bb);
}

Is there a reason we need to not use the combiner with multiple distincts?
  
 PERFORMANCE: Combiner should also be used when there are distinct aggregates 
 in a foreach following a group provided there are no non-algebraics in the 
 foreach 
 

 Key: PIG-580
 URL: https://issues.apache.org/jira/browse/PIG-580
 Project: Pig
  Issue Type: Improvement
Affects Versions: types_branch
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: PIG-580-v2.patch, PIG-580.patch


 Currently Pig uses the combiner only when there is foreach following a group 
 when the elements in the foreach generate have the following characteristics:
 1) simple project of the group column
 2) Algebraic UDF
 The above conditions exclude use of the combiner for distinct aggregates - 
 the distinct operation itself is combinable (irrespective of whether it feeds 
 to an algebraic or non algebraic udf). So if the following foreach should 
 also be combinable:
 {code}
 ..
 b = group a by $0;
 c = foreach b generate { x = distinct a; generate group, COUNT(x), SUM(x.$1) }
 {code}
 The combiner optimizer should cause the distinct to be combined and the final 
 combine output should feed the COUNT() and SUM() in the reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-580) PERFORMANCE: Combiner should also be used when there are distinct aggregates in a foreach following a group provided there are no non-algebraics in the foreach

2009-01-05 Thread Alan Gates (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660944#action_12660944
 ] 

Alan Gates commented on PIG-580:


In CombinerOptimizer.visitDistinct you have:

{code}
+if(sawDistinctAgg) {
+// we want to combine only in the case where there is only
+// one PODistinct which is the only input to an agg
+// we apparently have seen a PODistinct before, so lets not
+// combine.
+sawNonAlgebraic = true;
+}
{code}

but I can envision a case where you want to count multiple distinct things:

{code}
A = load ...
B = group A by $0;
C = foreach B {
   Aa = B.$1;
   Ab = distinct Aa;
   Ba = B.$2;
   Bb = distinct Ba;
   generate group, COUNT(Ab), COUNT(Bb);
}

Is there a reason we need to not use the combiner with multiple distincts?

 PERFORMANCE: Combiner should also be used when there are distinct aggregates 
 in a foreach following a group provided there are no non-algebraics in the 
 foreach 
 

 Key: PIG-580
 URL: https://issues.apache.org/jira/browse/PIG-580
 Project: Pig
  Issue Type: Improvement
Affects Versions: types_branch
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: PIG-580-v2.patch, PIG-580.patch


 Currently Pig uses the combiner only when there is foreach following a group 
 when the elements in the foreach generate have the following characteristics:
 1) simple project of the group column
 2) Algebraic UDF
 The above conditions exclude use of the combiner for distinct aggregates - 
 the distinct operation itself is combinable (irrespective of whether it feeds 
 to an algebraic or non algebraic udf). So if the following foreach should 
 also be combinable:
 {code}
 ..
 b = group a by $0;
 c = foreach b generate { x = distinct a; generate group, COUNT(x), SUM(x.$1) }
 {code}
 The combiner optimizer should cause the distinct to be combined and the final 
 combine output should feed the COUNT() and SUM() in the reduce.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-599) BufferedPositionedInputStream isn't buffered

2009-01-05 Thread Alan Gates (JIRA)

BufferedPositionedInputStream isn't buffered


 Key: PIG-599
 URL: https://issues.apache.org/jira/browse/PIG-599
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: types_branch
Reporter: Alan Gates
Assignee: Alan Gates
 Fix For: types_branch


org.apache.pig.impl.io.BufferedPositionedInputStream is not actually buffered.  
This is because it sits atop a FSDataInputStream (somewhere down the stack), 
which is buffered.  So to avoid double buffering, which can be bad, 
BufferedPositionedInputStream was written without buffering.  But the 
FSDataInputStream is far enough down the stack that it is still quite costly to 
call read() individually for each byte.  A run through a profiler shows that a 
fair amount of time is being spent in BufferedPositionedInputStream.read().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-599) BufferedPositionedInputStream isn't buffered

2009-01-05 Thread Alan Gates (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-599:
---

Status: Patch Available  (was: Open)

 BufferedPositionedInputStream isn't buffered
 

 Key: PIG-599
 URL: https://issues.apache.org/jira/browse/PIG-599
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: types_branch
Reporter: Alan Gates
Assignee: Alan Gates
 Fix For: types_branch

 Attachments: loadperf.patch


 org.apache.pig.impl.io.BufferedPositionedInputStream is not actually 
 buffered.  This is because it sits atop a FSDataInputStream (somewhere down 
 the stack), which is buffered.  So to avoid double buffering, which can be 
 bad, BufferedPositionedInputStream was written without buffering.  But the 
 FSDataInputStream is far enough down the stack that it is still quite costly 
 to call read() individually for each byte.  A run through a profiler shows 
 that a fair amount of time is being spent in 
 BufferedPositionedInputStream.read().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 958 matches

Mail list logo