What is a relation?
All, A question on types in pig. When you say: A = load 'myfile'; what exactly is A? For the moment let us call A a relation, since it is a set of records, and we can pass it to a relational operator, such as FILTER, ORDER, etc. To clarify the question, is a relation equivalent to a bag? In some ways it seems to be in our current semantics. Certainly you can turn a relation into a bag: A = load 'myfile'; B = group A all; The schema of the relation B at this point is group, A, where A is a bag. This does not necessarily mean that a relation is a bag, because an operation had to occur to turn the relation into a bag (the group all). But bags can be turned into relations, and then treated again as if they were bags: C = foreach B { C1 = filter A by $0 0; generate COUNT(C1); } Here the bag A created in the previous grouping step is being treated as it were a relation and passed to a relational operator, and the resulting relation (C1) treated as a bag to be passed COUNT. So at a very minimum it seems that a bag is a type of a relation, even if not all relations are bags. But, if top level (non-nested) relations are bags, why isn't it legal to do: A = load 'myfile'; B = A.$0; The second statement would be legal nested inside a foreach, but is not legal at the top level. We have been aware of this discrepancy for a while, and lived with it. But I believe it is time to resolve it. We've noticed that some parts of pig assume an equivalence between bag and relation (e.g. the typechecker) and other parts do not (e.g. the syntax example above). This inconsistency is confusing to users and developers alike. As Pig Latin matures we need to strive to make it a logically coherent and complete language. So, thoughts on how it ought to be? The advantage I see for saying a relation is equivalent to a bag is simplicity of the language. There is no need to introduce another data type. And it allows full relational operations to occur at both the top level and nested inside foreach. But this simplicity also seems me the downside. Are we decoupling the user so far from the underlying implementation that he will not be able to see side effects of his actions? A top level relation is assumably spread across many chunks and any operation on it will require one or more map reduce jobs, whereas a relation nested in a foreach is contained on one node. This also makes pig much more complex, because while it may hide this level of detail from the user, it clearly has to understand the difference between top level and nested operations and handle both cases. Alan.
Re: Pig Team now has two new committers!
Congrats to both of you, an honor well earned. Alan. On Dec 9, 2008, at 8:51 AM, Olga Natkovich wrote: Hi, I am happy to announce that Hadoop PMC voted to make Pradeep Kamath and Santhosh Srinivasan Pig Committer to acknowledge their significant contribution to the project! Congratulation to Santhosh and Pradeep! Olga
Re: Pig performance
I left a comment on the blog addressing some of the issues he brought up. Alan. On Dec 20, 2008, at 1:00 AM, Jeff Hammerbacher wrote: Hey Pig team, Did anyone check out the recent claims about Pig's poor performance versus Cascading? Though I haven't worked extensively with either system, I found the statements made fairly bold and am curious to hear more about their validity from the Pig development team: http://www.manamplified.org/archives/2008/12/cascading-and-pig- planners.html . Thanks, Jeff
Re: Adaptive Query Optimization
There is no concept of costing in pig at this point. Currently we let the script writer decide when to choose an FR Join over a symmetric hash join. We certainly welcome any work on an optimizer in pig. Be sure and take a look at https://issues.apache.org/jira/browse/PIG-360 where some work on an optimizer has already started. Alan. On Jan 16, 2009, at 10:51 AM, nitesh bhatia wrote: Hi I am working on addition of adaptive behavior in Pig Execution Model. Is there any pre-defined method to estimate execution time for Pig Simple join? I think for FRJoin some method will be required to estimate it. My idea is to design an Adaptive query optimizer similar to that for Glue-Nail Deductive database system (http://portal.acm.org/citation.cfm?id=615194 ). --nitesh -- Nitesh Bhatia Dhirubhai Ambani Institute of Information Communication Technology Gandhinagar Gujarat Life is never perfect. It just depends where you draw the line. visit: http://www.awaaaz.com - connecting through music http://www.volstreet.com - lets volunteer for better tomorrow http://www.instibuzz.com - Voice opinions, Transact easily, Have fun
Re: switching to different parser in Pig
or a C++ parser. JavaCC generates only Java parsers. Another concern about ANTLR was that it was reputed to change a lot as the guru, Terence Parr, experimented with new syntax and functionality. JavaCC, at least at the time, was reputed to be more stable, perhaps stable to a fault. I wanted stability and reliability. 2. SableCC is much like JavaCC; it generates a Java parser from a grammar description; but it had, in my opinion, less flexible abstract-syntax-tree building than JJTree/JavaCC. In SableCC (when I looked at it), the AST it built was always a direct reflection of your grammar, generating one tree node for each grammar expansion involved in a parse, much like using JavaCC with Java Tree Builder (JTB http://www.cs.purdue.edu/jtb/). When using JavaCC, JTB is the alternative to using JJTree. Using SableCC, or the combination JavaCC/JTB, should be _very_ similar indeed. In my opinion, SableCC and JavaCC/JTB have made a conscious choice to simplify AST building--you get trees that reflect the expansions in your grammar. Period. But often these default trees will be big, full of extraneous nodes that reflect precedence hierarchies in the recursive-descent parsing. If you want to have more control over AST building, to get more compact and tailored ASTs, you need to pay the price of learning JJTree. Assuming that you need to build ASTs, with JavaCC you have the choice between JJTree and JTB. With SableCC, when I last looked at it, you only get the JTB-like option. *** (Again, corrections and expansions would be much appreciated.) Ken Beesley --- Of course, no two software tools are likely to do _exactly_ the same job. Someone already pointed you to ANTLR, which is probably the best-known alternative to JavaCC. Another possibility is SableCC. http://sablecc.org The criteria include stability, documentation, language of the parser generated, and abstract-syntax-tree building. When I last looked (a couple of years ago) at ANTLR, SableCC and JavaCC, I chose JavaCC for the following reasons: 1. ANTLR could not handle Unicode input. Things change, of course, so ANTLR might now be more Unicode-friendly. Unicode was important to me, so this was a big factor in my decision. On the plus side for ANTLR, it has better abstract-syntax-tree building capabilities (in my opinion) than JJTree/JavaCC. You can learn to use JJTree commands, but it's not easy for most people. And ANTLR can generate either a Java or a C++ parser. JavaCC generates only Java parsers. Another concern about ANTLR was that it was reputed to change a lot as the guru, Terence Parr, experimented with new syntax and functionality. JavaCC, at least at the time, was reputed to be more stable, perhaps stable to a fault. I wanted stability and reliability. 2. SableCC is much like JavaCC; it generates a Java parser from a grammar description; but it had, in my opinion, less flexible abstract-syntax-tree building than JJTree/JavaCC. In SableCC (when I looked at it), the AST it built was always a direct reflection of your grammar, generating one tree node for each grammar expansion involved in a parse, much like using JavaCC with Java Tree Builder (JTB http://www.cs.purdue.edu/jtb/). When using JavaCC, JTB is the alternative to using JJTree. Using SableCC, or the combination JavaCC/JTB, should be _very_ similar indeed. In my opinion, SableCC and JavaCC/JTB have made a conscious choice to simplify AST building--you get trees that reflect the expansions in your grammar. Period. But often these default trees will be big, full of extraneous nodes that reflect precedence hierarchies in the recursive-descent parsing. If you want to have more control over AST building, to get more compact and tailored ASTs, you need to pay the price of learning JJTree. Assuming that you need to build ASTs, with JavaCC you have the choice between JJTree and JTB. With SableCC, when I last looked at it, you only get the JTB-like option. -- On Mon, Feb 23, 2009 at 10:06 PM, Alan Gates ga...@yahoo-inc.com wrote: We looked into antlr. It appears to be very similar to javacc, with the added feature that the java code it generates is humanly readable. That isn't why we want to switch off of javacc. Olga listed the 3 things we want out of a parser that javacc isn't giving us (lack of docs, no easy customization of error handle, decoupling of scanning and parsing). So antlr doesn't look viable. In response to Pi's suggestion that we could use the logical plan, I hope we could use something close to it. Whatever we choose we want it to be flexible enough to represent richer language constructs (like branch and loop). I'm not sure our current logical plan can do that. At the same time, we don't need another layer of translation (we already have
Fwd: Core For Paper ---- Grid and Cloud Middleware Workshop, in conjunction with GCC2009
Begin forwarded message: From: Yongqiang He heyongqi...@software.ict.ac.cn Date: March 1, 2009 10:18:03 PM PST To: core-u...@hadoop.apache.org core-u...@hadoop.apache.org, core-...@hadoop.apache.org core-...@hadoop.apache.org, hbase-u...@hadoop.apache.org hbase-u...@hadoop.apache.org , hive-u...@hadoop.apache.org hive-u...@hadoop.apache.org, hive-...@hadoop.apache.org hive-...@hadoop.apache.org Subject: Core For Paper Grid and Cloud Middleware Workshop, in conjunction with GCC2009 Reply-To: hive-...@hadoop.apache.org Call for Paper The grid and cloud computing technologies both aim to aggregate distributed resources in local- or wide-area environment and to provide a uniform computing environment. The fundamental of grid and cloud systems is the underlying middleware which sustains the variety of applications by system level abstracts and common functionalities. The grid and cloud middleware synthesizes multiple similar research issues. Software architecture, naming space, distributed data organization and storage, high performance data processing, task scheduling and so on are all corresponding focuses. This workshop is conveyed to promote the related information exchange and communication. We also hope to advance the research and development of grid and cloud middleware. Topics include but are not limited to: l Middleware architecture and implementation, l Virtualization, isolation and multi-tenant environment, l Distributed information organization, l Structured, semi-structured and unstructured data management and processing, l Map-Reduce or other novel programming models, l Language, language extension and tools for large scale computing, l Performance analysis/benchmark, l Web based user interface, l Scheduling, security, monitoring, and accounting, l Application or case study. Important dates Deadline of submission April 15, 2009 Notification of acceptance May 15, 2009 Delivery of camera-ready June 5, 2009 For More, Please Visit: http://grid.lzu.edu.cn/gcc2009/item/item.jsp?id=5 -- Best regards! He Yongqiang Email: heyongqi...@software.ict.ac.cn Tel: 86-10-62600969(O) Fax:86-10-626000900 Key Laboratory of Network Science and Technology/ Research Center for Grid and Service Computing, Institute of Computing Technology, Chinese Academy of Sciences, No.3 Kexueyuan South Road, Beijing 100190, China
Re: scope string in OperatorKey
The purpose of the scope string is to allow us to have multiple sessions of pig running and distinguish the operators. It's one of those things that was put in before an actual requirement, so whether it will prove useful or not remains to be seen. As for removing it from explain, is it still reasonably easy to distinguish operators without it? IIRC the OperatorKey includes an operator number. When looking at the explain plans this is useful for cases where there is more than one of a given type of operator and you want to be able to distinguish between them. Alan. On Mar 6, 2009, at 3:14 PM, Thejas Nair wrote: What is the purpose of scope string in org.apache.pig.impl.plan.OperatorKey ?Is it meant to be used if we have a pig deamon process ? Is it ok to stop printing the scope part in explain output? It does not seem to add value to it and makes the output more verbose. Thanks, Thejas
Re: [VOTE] Release Pig 1.0.0 (candidate 0)
README.txt still has the incubator text in it. This needs to be removed. I'll roll a new package and call a new vote. Alan. On Mar 17, 2009, at 3:21 PM, Olga Natkovich wrote: Pig Committers, I have created a candidate build for Pig 1.0.0. This release represents a major rewrite of Pig from the parser down. It also introduced type system into Pig and greatly improved system performance. The rat report is attached. Note that there are many java files listed as being without a license header. All these files are generated by javacc. Keys used to sign the release are available at http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS?view=markup. Please download, test, and try it out: http://people.apache.org/~olga/pig-1.0.0-candidate-0 http://people.apache.org/~olga/pig-1.0.0-candidate-0 Should we release this? Vote closes on Friday, March 20th. Olga
Re: Ajax library for Pig
Sorry if these are silly questions, but I'm not very familiar with some of these technologies. So what you propose is that Pig would be installed on some dedicated server machine and a web server would be placed in front of it. Then client libraries would be developed that made calls to the web server. Would these client side libraries include presentation in the browser, both for user's submitting queries and receiving results? Also, pig currently does not have a server mode, thus any web server would have to spin off threads that ran a pig job. If the above is what you're proposing, I think it would be great. Opening up pig to more users by making it browser accessible would be nice. Alan. On Apr 3, 2009, at 5:36 AM, nitesh bhatia wrote: Hi Since pig is getting a lot of usage in industries and universities; how about adding a front-end support for Pig? The plan is to write a jquery/dojo type of general JavaScript/AJAX library which can be used over any server technologies (php, jsp, asp, etc.) to call pig functions over web. Direct Web Remoting (DWR- http://directwebremoting.org ), an open source project at Java.net gives a functionality that allows JavaScript in a browser to interact with Java on a server. Can we write a JavaScript library exclusively for Pig using DWR? I am not sure about licensing issues. The major advantages I can point is -Use of Pig over HTTP rather SSH. -User management will become easy as this can be handled easily using any CMS --nitesh -- Nitesh Bhatia Dhirubhai Ambani Institute of Information Communication Technology Gandhinagar Gujarat Life is never perfect. It just depends where you draw the line. visit: http://www.awaaaz.com - connecting through music http://www.volstreet.com - lets volunteer for better tomorrow http://www.instibuzz.com - Voice opinions, Transact easily, Have fun
Pig release 0.2.0
The Pig team is happy to announce Pig 0.2.0 has been released. This release includes the addition of a types, better error detection and handling, and 5x performance improvement over 0.1.1. The details of the release can be found at http://hadoop.apache.org/pig/releases.html . Pig is a Hadoop subproject which provides high-level data-flow language and execution framework for parallel computation on Hadoop clusters. More details about Pig can be found at http://hadoop.apache.org/pig/ .
Re: Ajax library for Pig
Would you want to contribute this to the Pig project or release it separately? Either way, keep us posted on your progress. It sounds interesting. Alan. On Apr 9, 2009, at 9:28 PM, nitesh bhatia wrote: Hi Thanks for the reply. This will be the architecture: 1. Pig would be installed on some dedicated server machine (say P) with hadoop support. 2. In front of it will be a web server (say S) 2.1 A web server will consist of a dedicated tomcat server (say St) for handling dwr servlets. 2.2 PigScript.js proposed javascript. 2.2 If user is using some other server than tomcat for presentation layer (say http for php or IIS for asp.net); the server (say Su) will appear in front of St. -Connections between Su and St will be done through PigScript.js - St and P will be done through dwr - To get the results from server, this system will be using Reverse- ajax calls ( i.e async call from server to browser an inbuilt feature in DWR). DWR is under Apache Licence V2. --nitesh On Wed, Apr 8, 2009 at 9:11 PM, Alan Gates ga...@yahoo-inc.com wrote: Sorry if these are silly questions, but I'm not very familiar with some of these technologies. So what you propose is that Pig would be installed on some dedicated server machine and a web server would be placed in front of it. Then client libraries would be developed that made calls to the web server. Would these client side libraries include presentation in the browser, both for user's submitting queries and receiving results? Also, pig currently does not have a server mode, thus any web server would have to spin off threads that ran a pig job. If the above is what you're proposing, I think it would be great. Opening up pig to more users by making it browser accessible would be nice. Alan. On Apr 3, 2009, at 5:36 AM, nitesh bhatia wrote: Hi Since pig is getting a lot of usage in industries and universities; how about adding a front-end support for Pig? The plan is to write a jquery/dojo type of general JavaScript/AJAX library which can be used over any server technologies (php, jsp, asp, etc.) to call pig functions over web. Direct Web Remoting (DWR- http://directwebremoting.org ), an open source project at Java.net gives a functionality that allows JavaScript in a browser to interact with Java on a server. Can we write a JavaScript library exclusively for Pig using DWR? I am not sure about licensing issues. The major advantages I can point is -Use of Pig over HTTP rather SSH. -User management will become easy as this can be handled easily using any CMS --nitesh -- Nitesh Bhatia Dhirubhai Ambani Institute of Information Communication Technology Gandhinagar Gujarat Life is never perfect. It just depends where you draw the line. visit: http://www.awaaaz.com - connecting through music http://www.volstreet.com - lets volunteer for better tomorrow http://www.instibuzz.com - Voice opinions, Transact easily, Have fun -- Nitesh Bhatia Dhirubhai Ambani Institute of Information Communication Technology Gandhinagar Gujarat Life is never perfect. It just depends where you draw the line. visit: http://www.awaaaz.com - connecting through music http://www.volstreet.com - lets volunteer for better tomorrow http://www.instibuzz.com - Voice opinions, Transact easily, Have fun
Re: [Pig Wiki] Update of HowToContribute by AlanGates
At this point these are all proposed, none are yet realized. So there is no code for any of them. The place to track these proposals are in the referenced JIRAs. Alan. On Apr 15, 2009, at 6:44 PM, zhang jianfeng wrote: Hi Alan, Thank you for your guideline. So where's code of these ProposedProjects. Are they in different branch or in the trunk? How can I track the progress of these ProposedProjects ? Thank you. On Thu, Apr 16, 2009 at 7:17 AM, Apache Wiki wikidi...@apache.org wrote: Dear Wiki user, You have subscribed to a wiki page or wiki category on Pig Wiki for change notification. The following page has been changed by AlanGates: http://wiki.apache.org/pig/HowToContribute -- * [http://www.apache.org/dev/contributors.html Apache contributor documentation] * [http://www.apache.org/foundation/voting.html Apache voting documentation] + == Picking Something to Work On == + Looking for a place to start? A great first place is to peruse the + [https://issues.apache.org/jira/browse/PIG JIRA] and find an issue that needs + resolved. If you're looking for a bigger project, try ProposedProjects. This + gives a list of projects the Pig team would like to see worked on. +
Re: [Pig Wiki] Update of ProposedProjects by AlanGates
Your understanding of the proposal is correct. The goal would be to produce Java code rather than a pipeline configuration. But the reasoning is not so that users can then take that and modify themselves. There's nothing preventing them from doing it, but it has a couple of major drawbacks. 1) Code generators generally generate horrific looking code, because they are going for speed and compactness not human maintainability. Trying to work in that code would be very difficult. 2) If you start adding code to generated code, you can no longer use the original Pig Latin. You are from that point forward stuck in Java, since you can't backport your Java into the Pig Latin. The proposal is designed to test the performance of Pig based on generated Java (or for that matter any other language, it need not be Java). For the idea you suggest, the NATIVE keyword (proposed here https://issues.apache.org/jira/browse/PIG-506) is a better solution. Alan. On Apr 16, 2009, at 12:54 AM, nitesh bhatia wrote: Hi Can you briefly explain what is required in the first project? After reading the description my impression is, currently when we are executing commands on Pig Shell, Pig is first converting to map-reduce jobs and then feeding it to hadoop. In this project are we proposing that, the execution plan made by Pig will be first converted to a java file for map-reduce procedure and then feed onto hadoop network ? If this is the case then I am sure it will be great help to users as this functionality can be used to write complicated map-reduce jobs very easily. Initially user can write the Pig scripts / commands required for his job and get the map-reduce java files. Then he can edit map-reduce files to extend the functionality and add extra procedures that are not provided by Pig but can be executed over hadoop. --nitesh On Wed, Apr 15, 2009 at 9:57 PM, Apache Wiki wikidi...@apache.org wrote: Dear Wiki user, You have subscribed to a wiki page or wiki category on Pig Wiki for change notification. The following page has been changed by AlanGates: http://wiki.apache.org/pig/ProposedProjects New page: = Proposed Pig Projects = This page describes projects what we (the committers) would like to see added to Pig. The scale of these projects vary, but they are larger projects, usually on the weeks or months scale. We have not yet filed [https://issues.apache.org/jira/browse/PIG JIRAs] for some of these because they are still in the vague idea stage. As they become more concrete, [https://issues.apache.org/jira/browse/PIG JIRAs] will be filed for them. We welcome contributers to take on one of these projects. If you would like to do so, please file a JIRA (if one does not already exist for the project) with a proposed solution. Pig's committers will work with you from there to help refine your solution. Once a solution is agreed upon, you can begin implementation. If you see a project here that you would like to see Pig implement but you are not in a position to implement the solution right now, feel free to vote for the project. Add your name to the list of supporters. This will help contributers looking for a project to select one that will benefit many users. If you would like to propose a project for Pig, feel free to add to this list. If it is a smaller project, or something you plan to begin work on immediately, filing a [https://issues.apache.org/jira/browse/PIG JIRA] is a better route. || Catagory || Project || JIRA || Proposed By || Votes For || || Execution || Pig currently executes scripts by building a pipeline of pre-built operators and running data through those operators in map reduce jobs. We need to investigate instead have Pig generate java code specific to a job, and then compiling that code and using it to run the map reduce jobs. || || Many conference attendees || gates || || Language || Currently only DISTINCT, ORDER BY, and FILTER are allowed inside FOREACH. All operators should be allowed in FOREACH. (Limit is being worked on [https://issues.apache.org/jira/browse/PIG-741 741] || || gates || || || Optimization || Speed up comparison of tuples during shuffle for ORDER BY || [https://issues.apache.org/jira/browse/PIG-659 659] || olgan || || || Optimization || Order by should be changed to not use POPackage to put all of the tuples in a bag on the reduce side, as the bag is just immediately flattened. It can instead work like join does for the last input in the join. || || gates || || || Optimization || Often in a Pig script that produces a chain of MR jobs, the map phases of 2nd and subsequent jobs very little. What little they do should be pushed into the proceeding reduce and the map replaced by the identity mapper. Initial tests showed that the identity mapper was 50% faster than using a Pig mapper (because Pig uses the loader to parse out tuples
Re: A proposal for changing pig's memory management
The claims in the paper I was interested in were not issues like non- blocking I/O etc. The claim that is of interest to pig is that a memory allocation and garbage collection scheme that is beyond the control of the programmer is a bad fit for a large data processing system. This is a fundamental design choice in Java, and fits it well for the vast majority of its uses. But for systems like Pig there seems to be no choice but to work around Java's memory management. I'll clarify this point in the document. I took a closer look at NIO. My concern is that it does not give the level of control I want. NIO allows you to force a buffer to disk and request a buffer to load, but you cannot force a page out of memory. It doesn't even guarantee that after you load a page it will really be loaded. One of the biggest issues in pig right now is that we run out memory or get the garbage collector in a situation where it can't make sufficient progress. Perhaps switching to large buffers instead of having many individual objects will address this. But I'm concerned that if we cannot explicitly force data out of memory onto disk then we'll be back in the same boat of trusting the Java memory manager. Alan. On May 14, 2009, at 7:43 PM, Ted Dunning wrote: That Telegraph dataflow paper is pretty long in the tooth. Certainly several of their claims have little force any more (lack of non- blocking I/O, poor thread performance, no unmap, very expensive synchronization for uncontested locks). It is worth that they did all of their tests on the 1.3 JVM and things have come an enormous way since then. Certainly, it is worth having opaque contains based on byte arrays, but isn't that pretty much what the NIO byte buffers are there to provide? Wouldn't a virtual tuple type that was nothing more than a byte buffer, type and an offset do almost all of what is proposed here? On Thu, May 14, 2009 at 5:33 PM, Alan Gates ga...@yahoo-inc.com wrote: http://wiki.apache.org/pig/PigMemory Alan.
Re: A proposal for changing pig's memory management
On May 19, 2009, at 10:30 PM, Mridul Muralidharan wrote: I am still not very convinced about the value about this implementation - particularly considering the advances made since 1.3 in memory allocators and garbage collection. My fundamental concern is not with the slowness of garbage collection. I am asserting (along with the paper) that garbage collection is not an optimal choice for a large data processing system. I don't want to improve the garbage collector, I want to manage a subset of the memory without it. The side effect of this proposal is many, and sometimes non-obvious. Like implicitly moving young generation data into older generation, causing much more memory pressure for gc, fragmentation of memory blocks causing quite a bit of memory pressure, replicating quite a bit of functionality with garbage collection, possibility of bugs with ref counting, etc. I don't understand your concerns regarding the load on the gc and memory fragmentation. Let's say I have 10,000 tuples, each with 10 fields. Let's also assume that these tuples live long enough to make it into the old memory pool, since this is the interesting case where objects live long enough to cause a problem. In the current implementation there will be 110,000 objects that the gc has to manage moving into the old pool, and check every time it cleans the old pool. In the proposed implementation there would be 10,001 objects (assuming all the data fit into one buffer) to manage. And rather than allocating 100,000 small pieces of memory, we would have allocated one large segment. My belief is that this would lighten the load on the gc. This does replicate some of the functionality of the garbage collector. Complex systems frequently need to re-implement foundational functionality in order to optimize it for their needs. Hence many RDBMS engines have their own implementations of memory management, file I/O, thread scheduling, etc. As for bugs in ref counting, I agree that forgetting to deallocate is one of the most pernicious problems of allowing programmers to do memory management. But in this case all that will happen is that a buffer will get left around that isn't needed. If the system needs more memory then that buffer will eventually get selected for flushing to disk, and then it will stay there as no one will call it back into memory. So the cost of forgetting to deallocate is minor. If assumption that current working set of bag/tuple does not need to be spilled, and anything else can be, then this will pretty much deteriorate to current impl in worst case. That is not the assumption. There are two issues: 1) trying to spill bags only when we determine we need to is highly error prone, because we can't accurately determine when we need to and because we sometimes can't dump fast enough to survive; 2) current memory usage is far too high, and needs to be reduced. A much more simpler method to gain benefits would be to handle primitives as ... primitives and not through the java wrapper classes for them. It should be possible to write schema aware tuples which make use of the primitives specified to take a fraction of memory required (4 bytes + null_check boolean for int + offset mapping instead of 24/32 bytes it currently is, etc). In my observation, at least 50% of the data in pig is untyped, which means it's a byte array. Of the 50% that people declare or is determined by the program, probably 50-80% of that are chararrays and maps. So that means that somewhere around 25% of the data is numeric. Shrinking that 25% by 75% will be nice, but not adequate. And it does nothing to help with the issue of being able to spill in a controlled way instead of only in emergency situations. Alan.
Re: UDF with parameters?
Yes, it is possible. The UDF should take the percentage you want as a constructor argument. It will have to be passed as a string and converted. Then in your Pig Latin, you will use the DEFINE statement to pass the argument to the constructor. REGISTER /src/myfunc.jar DEFINE percentile myfunc.percentile('90'); A = LOAD 'students' as (name, gpa); B = FOREACH A GENERATE percentile(gpa); See http://hadoop.apache.org/pig/docs/r0.2.0/piglatin.html#DEFINE for more details. Alan. On May 22, 2009, at 3:37 PM, Brian Long wrote: Hi, I'm interested in developing a PERCENTILE UDF, e.g. for calculating a median, 99th percentile, 90th percentile, etc. I'd like the UDF to be parametric with respect to the percentile being requested, but I don't see any way to do that, and it seems like I might need to create PERCENTILE_50, PERCENTILE_90, etc type UDFs explicitly, versus being able to do something like GENERATE PERCENTILE(90, duration) I'm new to Pig, so I might be missing the way to do this... is it possible? Thanks, Brian
Proposed design for new merge join in pig
http://wiki.apache.org/pig/PigMergeJoin Alan.
Updated PigMix numbers for latest top of trunk
http://wiki.apache.org/pig/PigMix Alan.
Re: PigPen Source
It has not yet been integrated into contrib because it requires the eclipse libraries to build, and those weren't integrated. The ivy stuff used by pig's build should be configured to pick up the appropriate eclipse jars so that this can be added to contrib. Alan. On Jun 15, 2009, at 12:09 PM, Russell Jurney wrote: I want to play with PigPen, but although I can find the patches here: https://issues.apache.org/jira/browse/PIG-366 on the Jira, I cannot find the source in trunk/contrib/pigpen, or in any path in any branch. Where does the PigPen source reside? Does it exist only as a patch? Russell Jurney rjur...@cloudstenography.com
Re: Rewire and multi-query load/store optimization
+1 on option one. The use of store-load was only to overcome a temporary problem in Pig. We've fixed the problem, so let's not propagate it. We will need to document this very clearly (maybe even to the point of issuing warnings in the parser when we see this combo) so users understand that this is now a hinderance rather than a help. Alan. On Jun 12, 2009, at 2:19 PM, Santhosh Srinivasan wrote: With the implementation of rewire as part of the optimizer infrastructure, a bug was exposed in the load/store optimization in the multi-query feature. Below, I will articulate the bug and the ramifications of a few possible solutions. Load/store optimization in the multi-query feature? --- If a script has an explicit store and a corresponding load which loads the output of the store, the store-load combination can be optimized. An example will illustrate the concept. Pre-conditions: 1. The store location and the load location should match 2. The store format and the load format should be compatible {code} A = load 'input'; B = group A by $0; store B into 'output'; C = load 'output'; D = group C by $0; store D into 'some_other_output'; {code} In the script above, the output of the first store serves as input of the second load (C). In addition, the store and load use PigStorage() as the store/load mechanism. In the logical plan this combination by splitting B into the store and D. Bug --- When the load in the store/load combination was removed, the inner plans of the load's successors (in this case D), were not updated correctly. As a result, the projections in the inner plans still held references to non-existing operators. Consequence of the bug fix --- During the map-reduce (M/R) compilation the split operator is compiled into a store and a load. Prior to multi-query, for each M/R boundary resulted in a temporary store using BinStorage. The subsequent load could infer the type as BinStorage returns typed records, i.e., non- byte array records. With multi-query and the load/store optimization, the temporary BinStorage data is not generated. Instead, the subsequent load uses the output of the previous store as its input. Here, the loader can get typed or untyped records based on the loader. As a result, the operators in the map phase that rely on the type information (inferred from the logical plan) will fail due to type mismatch. Possible Solutions -- Solution 1 == Switch the load/store optimization. Users were primarily storing intermediate data within the same script to overcome Pig's limitation, i.e., absence of the multi-query feature. Going forward, with multi-query turned on, users who store intermediate data will not enjoy all the benefits of the optimization. Solution 2 == After the M/R compilation is completed, during the final pass of the plan, fix the types of the projections to reflect typed/untyped data. In other words, if the loader is returning typed data then retain the types else change the types to bytearray. In order to make this decision, loaders should support an interface to indicate if the records are typed or untyped. Thanks, Santhosh
Re: [VOTE] Release Pig 0.3.0 (candidate 0)
Downloaded, ran, ran tutorial, built piggybank. All looks good. +1 Alan. On Jun 18, 2009, at 12:30 PM, Olga Natkovich wrote: Hi, I created a candidate build for Pig 0.3.0 release. The main feature of this release is support for multiquery which allows to share computation across multiple queries within the same script. We see significant performance improvements (up to order of magnitude) as the result of this optimization. I ran the rat report and made sure that all the source files contain proper headers. (Not attaching the report since it caused trouble with the last release.) Keys used to sign the release candidate are at http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS. Please, download and try the release candidate: http://people.apache.org/~olga/pig-0.3.0-candidate-0/. Please, vote by Wednesday, June 24th. Olga
Re: asking for comments on benchmark queries
Zheng, I don't think you're subscribed to pig-dev (your emails have been bouncing to the moderator). So I've cc'd you explicitly on this. I don't think we need a Pig JIRA, it's probably easier if we all work on the hive one. I'll post my comments on the various scripts to that bug. I've also attached them here since pig-dev won't see the updates to that bug. Alan. grep_select.pig: Adding types in the LOAD statement will force Pig to cast the key field, even though it doesn't need to (it only reads and writes the key field). So I'd change the query to be: rmf output/PIG_bench/grep_select; a = load '/data/grep/*' using PigStorage as (key,field); b = filter a by field matches '.*XYZ.*'; store b into 'output/PIG_bench/grep_select'; field will still be cast to a chararray for the matches, but we won't waste time casting key and then turning it back into bytes for the store. rankings_select.pig: Same comment, remove the casts. pagerank will be properly cast to an integer. rmf output/PIG_bench/rankings_select; a = load '/data/rankings/*' using PigStorage('|') as (pagerank,pageurl,aveduration); b = filter a by pagerank 10; store b into 'output/PIG_bench/rankings_select'; rankings_uservisits_join.pig: Here you want to keep the casts of pagerank so that it is handled as the right type. adRevenue will default to double in SUM when you don't specify a type. You also want to project out all unneeded columns as soon as possible. You should set PARALLEL on the join to use the number of reducers appropriate for your cluster. Given that you have 10 machines and 5 reduce slots per machine, and speculative execution is off you probably want 50 reducers. I notice you set parallel to 60 on the group by. That will give you 10 trailing reducers. Unless you have a need for the result to be split 60 ways you should reduce that to 50 as well. (I'm assuming here when you say you have a 10 node cluster you mean 10 data nodes, not counting your name node and task tracker. The reduce formula should be 5 * number of data nodes.) A last question is how large are the uservisits and rankings data sets? If either is 80M or so you can use the fragment/replicate join, which is much faster than the general join. The following script assumes that isn't the case; but if it is let me know and I can show you the syntax for it. So the end query looks like: rmf output/PIG_bench/html_join; a = load '/data/uservisits/*' using PigStorage('|') as (sourceIP ,destURL ,visitDate ,adRevenue,userAgent,countryCode,languageCode:,searchWord,duration); b = load '/data/rankings/*' using PigStorage('|') as (pagerank:int,pageurl,aveduration); c = filter a by visitDate '1999-01-01' AND visitDate '2000-01-01'; c1 = fjjkkoreach c generate sourceIP, destURL, addRevenue; b1 = foreach b generate pagerank, pageurl; d = JOIN c1 by destURL, b1 by pageurl parallel 50; d1 = foreach d generate sourceIP, pagerank, adRevenue; e = group d1 by sourceIP parallel 50; f = FOREACH e GENERATE group, AVG(d1.pagerank), SUM(d1.adRevenue); store f into 'output/PIG_bench/html_join'; uservisists_agrre.pig: Same comments as above on projecting out as early as possible and on setting parallel appropriately for your cluster. rmf output/PIG_bench/uservisits_aggre; a = load '/data/uservisits/*' using PigStorage('|') as (sourceIP ,destURL ,visitDate ,adRevenue,userAgent,countryCode,languageCode,searchWord,duration); a1 = foreach a generate sourceIP, adRevenue; b = group a by sourceIP parallel 50; c = FOREACH b GENERATE group, SUM(a. adRevenue); store c into 'output/PIG_bench/uservisits_aggre'; On Jun 22, 2009, at 10:36 PM, Zheng Shao wrote: Hi Pig team, We’d like to get your feedback on a set of queries we implemented on Pig. We’ve attached the hadoop configuration and pig queries in the email. We start the queries by issuing “pig xxx.pig”. The queries are from SIGMOD’2009 paper. More details are athttps:// issues.apache.org/jira/browse/HIVE-396 (Shall we open a JIRA on PIG for this?) One improvement is that we are going to change hadoop to use LZO as intermediate compression algorithm very soon. Previously we used gzip for all performance tests including hadoop, hive and pig. The reason that we specify the number of reducers in the query is to try to match the same number of reducer as Hive automatically suggested. Please let us know what is the best way to set the number of reducers in Pig. Are there any other improvements we can make to the Pig query and the hadoop configuration? Thanks, Zheng hadoop-site.xmlhive-default.xmlhadoop-env.sh.txt
Re: requirements for Pig 1.0?
Integration with Owl is something we want for 1.0. I am hopeful that by Pig's 1.0 Owl will have flown the coop and become either a subproject or found a home in Hadoop's common, since it will hopefully be used by multiple other subprojects. Alan. On Jun 23, 2009, at 11:42 PM, Russell Jurney wrote: For 1.0 - complete Owl? http://wiki.apache.org/pig/Metadata Russell Jurney rjur...@cloudstenography.com On Jun 23, 2009, at 4:40 PM, Alan Gates wrote: I don't believe there's a solid list of want to haves for 1.0. The big issue I see is that there are too many interfaces that are still shifting, such as: 1) Data input/output formats. The way we do slicing (that is, user provided InputFormats) and the equivalent outputs aren't yet solid. They are still too tied to load and store functions. We need to break those out and understand how they will be expressed in the language. Related to this is the semantics of how Pig interacts with non-file based inputs and outputs. We have a suggestion of moving to URLs, but we haven't finished test driving this to see if it will really be what we want. 2) The memory model. While technically the choices we make on how to represent things in memory are internal, the reality is that these changes may affect the way we read and write tuples and bags, which in turn may affect our load, store, eval, and filter functions. 3) SQL. We're working on introducing SQL soon, and it will take it a few releases to be fully baked. 4) Much better error messages. In 0.2 our error messages made a leap forward, but before we can claim to be 1.0 I think they need to make 2 more leaps: 1) they need to be written in a way end users can understand them instead of in a way engineers can understand them, including having sufficient error documentation with suggested courses of action, etc.; 2) they need to be much better at tying errors back to where they happened in the script, right now if one of the MR jobs associated with a Pig Latin script fails there is no way to know what part of the script it is associated with. There are probably others, but those are the ones I can think of off the top of my head. The summary from my viewpoint is we still have several 0.x releases before we're ready to consider 1.0. It would be nice to be 1.0 not too long after Hadoop is, which still gives us at least 6-9 months. Alan. On Jun 22, 2009, at 10:58 AM, Dmitriy Ryaboy wrote: I know there was some discussion of making the types release (0.2) a Pig 1 release, but that got nixed. There wasn't a similar discussion on 0.3. Has the list of want-to-haves for Pig 1.0 been discussed since?
Re: requirements for Pig 1.0?
To be clear, going to 1.0 is not about having a certain set of features. It is about stability and usability. When a project declares itself 1.0 it is making some guarantees regarding the stability of its interfaces (in Pig's case this is Pig Latin, UDFs, and command line usage). It is also declaring itself ready for the world at large, not just the brave and the free. New features can come in as experimental once we're 1.0, but the semantics of the language and UDFs can't be shifting (as we've done the last several releases and will continue to do for a bit I think). With that in mind, further comments inlined. On Jun 24, 2009, at 10:18 AM, Dmitriy Ryaboy wrote: Alan, any thoughts on performance baselines and benchmarks? Meaning do we need to reach a certain speed before 1.0? I don't think so. Pig is fast enough now that many people find it useful. We want to continue working to shrink the gap between Pig and MR, but I don't see this as a blocker for 1.0. I am a little surprised that you think SQL is a requirement for 1.0, since it's essentially an overlay, not core functionality. If we were debating today whether to go 1.0, I agree that we would not wait for SQL. But given that we aren't (at least I wouldn't vote for it now) and that SQL will be in soon, it will need to stabilize. What about the storage layer rewrite (or is that what you referred to with your first bullet-point)? To be clear, the Zebra (columnar store stuff) is not a rewrite of the storage layer. It is an additional storage option we want to support. We aren't changing current support for load and store. Also, the subject of making more (or all) operators nestable within a foreach comes up now and then.. would you consider this important for 1.0, or something that can wait? This would be an added feature, not a semantic change in Pig Latin. Integration with other languages (a-la PyPig)? Again, this is a new feature, not a stability issue. The Roadmap on the Wiki is still as of Q3 2007 makes it hard for an outside contributor to know where to jump :-). Agreed. Olga has given me the task of updating this soon. I'm going to try to get to that over the next couple of weeks. This discussion will certainly provide input to that update. Alan.
Re: Is it a bug ?
It looks wrong to me, but I don't have a deep understanding of that code. Alan. On Jul 15, 2009, at 6:03 PM, zhang jianfeng wrote: Hi all, Today, when I read the source code, I find a piece of suspicious code: (PigServer.java Line 1047) graph.ignoreNumStores = processedStores;// I think here should be graph.ignoreNumStores = ignoreNumStores graph.processedStores = processedStores; graph.fileNameMap = fileNameMap; I think this may be a typing mistake. Can anyone confirm it ? Thank you. Jeff Zhang
Re: Pig 0.4.0 release
On Aug 18, 2009, at 10:05 AM, Dmitriy Ryaboy wrote: I am about to submit a cleaned up patch for 924. It works fine as a static patch (in fact I can attach it to 660 as well) -- compiling with -Dhadoop.version=XX works as proposed for the static shims. It does the necessary prep for the code to be able to switch based on what's in its classpath, but it does not require unbundling to work statically. Ok, we'll take a look. The hadoop20 jar attached to the zebra ticket is built in a different way than 18 and 19; it does not report its version (18 and 19 do). Right now I get around it by hard-coding a special case (Unknown = 20), but that's obviously suboptimal. Could someone rebuild hadoop20.jar the way Pig wants it, and with the proper version identification? If that happens, 924/660 can go in together with hadoop20.jar and users will at least be able to build against a static version of hadoop without requiring a patch. The hadoop 0.20 jar submitted with Zebra is not a standard jar. It has extra tfile functionality that was not in 0.20, but will be in 0.20.1. It isn't something we should publish. If we put a hadoop20.jar into pig's lib, it should be from 0.20 (or when available, 0.20.1). Alan. -Dmitriy On Tue, Aug 18, 2009 at 9:56 AM, Alan Gatesga...@yahoo-inc.com wrote: Non-committers certainly get a vote, it just isn't binding. I agree on PIG-925 as a blocker. I don't see PIG-859 as a blocker since there is a simple work around. If we want to release 0.4.0 within a week or so, dynamic shims won't be an option because we won't be able to solve the bundled hadoop lib problem in that amount of time. I agree that we are not making life easy enough for users who want to build with hadoop 0.20. Based on comments on the JIRA, I'm not sure the patch for the static shims is ready. What if instead we checked in a version of hadoop20.jar that will work for users who want to build with 0.20. This way users can still build this if they want and our release isn't blocked on the patch. Alan. On Aug 17, 2009, at 12:03 PM, Dmitriy Ryaboy wrote: Olga, Do non-commiters get a vote? Zebra is in trunk, but relies on 0.20, which is somewhat inconsistent even if it's in contrib/ Would love to see dynamic (or at least static) shims incorporated into the 0.4 release (see PIG-660, PIG-924) There are a couple of bugs still outstanding that I think would need to get fixed before a release: https://issues.apache.org/jira/browse/PIG-859 https://issues.apache.org/jira/browse/PIG-925 I think all of these can be solved within a week; assuming we are talking about a release after these go into trunk, +1. -D On Mon, Aug 17, 2009 at 11:46 AM, Olga Natkovichol...@yahoo- inc.com wrote: Pig Developers, We have made several significant performance and other improvements over the last couple of months: (1) Added an optimizer with several rules (2) Introduced skew and merge joins (3) Cleaned COUNT and AVG semantics I think it is time for another release to make this functionality available to users. I propose that Pig 0.4.0 is released against Hadoop 18 since most users are still using this version. Once Hadoop 20.1 is released, we will roll Pig 0.5.0 based on Hadoop 20. Please, vote on the proposal by Thursday. Olga
Re: questions about integration of pig and HBase
See the JIRA PIG-6. See also the HbaseStorage unit test that tests the functionality. Alan. On Sep 9, 2009, at 5:31 AM, Vincent BARAT wrote: Thank you for the link. Anyway, what I was looking for is an example of PIG syntax loading from a HBase table, is it something like: queries = LOAD 'HBase Table USING HBaseStorage() ? Jeff Zhang a écrit : Using HBaseStorage as your loadFunc, it uses a customer slicer HBaseSlice You can refer this link for more information http://hadoop.apache.org/pig/docs/r0.3.0/udf.html#Custom+Slicer 2009/9/9 Vincent BARAT vincent.ba...@ubikod.com Alan Gates a écrit : Pig supports reading from Hbase (in Hadoop/Hbase 0.18 only). Hello, Do you have any link to the documentation about how to do that? I can't find any example... Thanks,
Re: Request for feedback: cost-based optimizer
This is a good start at adding a cost based optimizer to Pig. I have a number of comments: 1) Your argument for putting it in the physical layer rather than the logical is that the logical layer does not know physical statistics. This need not be true. You suggest adding a getStatistics call to the loader to give statistics. The logical layer can make this call and make decisions based on the results without understanding the underlying physical layer. It seems that the real reason you want to put the optimizer in the physical layer is, rather than trying to do predictive statistics (such as we guess this join will result in a 2x data explosion) you want to see the results of actual MR jobs and then make decisions. This seems like a reasonable choice for a couple of reasons: a) statistical guesses are hard to get right, and Pig has limited statistics to begin with; b) since Pig Latin scripts can be arbitrarily long, bad guesses at the beginning will have a worse ripple effect than bad guesses in a SQL optimizer. 2) The changes you propose in Pig Server are quite complex. Would it be possible instead to put the changes in MapReduceLauncher? It could run the first MR job in a Pig Latin script, look at the results, and then rerun your CBO on the remaining physical plan and re-translate this to a new MR plan and resubmit. This would require annotations to the MR plan to indicate where in a physical plan the MR boundaries fall, so that correct portions of the original physical plan could be used for reoptimization and recompilation. But it would contain the complexity of your changes to MapReduceLauncher instead of scattering them through the entire system. 3) On adding getStatistics, I am currently working on a proposal to make a number of changes to the load interface, including getStatistics. I hope to publish that proposal by next week. Similarly I am working on a proposal of how Pig will interact with metadata systems (such as Owl) which I also hope to propose next week. We will be actively working in these areas because we need them for our SQL implementation. So, one, you'll get a lot of this for free; two, we should stay connected on these things so what we implement works for what you need. Alan. On Sep 1, 2009, at 9:54 AM, Dmitriy Ryaboy wrote: Whoops :-) Here's the Google doc: http://docs.google.com/Doc?docid=0Adqb7pZsloe6ZGM4Z3o1OG1fMjFrZjViZ21jdAhl=en -Dmitriy On Tue, Sep 1, 2009 at 12:51 PM, Santhosh Srinivasans...@yahoo- inc.com wrote: Dmitriy and Gang, The mailing list does not allow attachments. Can you post it on a website and just send the URL ? Thanks, Santhosh -Original Message- From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com] Sent: Tuesday, September 01, 2009 9:48 AM To: pig-dev@hadoop.apache.org Subject: Request for feedback: cost-based optimizer Hi everyone, Attached is a (very) preliminary document outlining a rough design we are proposing for a cost-based optimizer for Pig. This is being done as a capstone project by three CMU Master's students (myself, Ashutosh Chauhan, and Tejal Desai). As such, it is not necessarily meant for immediate incorporation into the Pig codebase, although it would be nice if it, or parts of it, are found to be useful in the mainline. We would love to get some feedback from the developer community regarding the ideas expressed in the document, any concerns about the design, suggestions for improvement, etc. Thanks, Dmitriy, Ashutosh, Tejal
Re: [VOTE] Release Pig 0.4.0 (candidate 0)
When I run this against a Hadoop 0.18.3 instance I can do DFS operations, but MR operations fail with: Error message from job controller - java.lang.AbstractMethodError: org.apache.xerces.dom.DocumentImpl.getXmlStandalone()Z at com .sun .org .apache.xalan.internal.xsltc.trax.DOM2TO.setDocumentInfo(DOM2TO.java: 373) at com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java: 127) at com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java: 94) at com .sun .org .apache .xalan .internal .xsltc.trax.TransformerImpl.transformIdentity(TransformerImpl.java:662) at com .sun .org .apache .xalan .internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:708) at com .sun .org .apache .xalan .internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:313) at org.apache.hadoop.conf.Configuration.write(Configuration.java: 994) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:780) at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:370) at org .apache .hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247) at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279) at java.lang.Thread.run(Thread.java:619) Pig Stack Trace --- ERROR 6015: During execution, encountered a Hadoop error. org.apache.pig.backend.executionengine.ExecException: ERROR 6015: During execution, encountered a Hadoop error. at com .sun .org .apache.xalan.internal.xsltc.trax.DOM2TO.setDocumentInfo(DOM2TO.java: 373) at com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java: 127) at com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java: 94) at com .sun .org .apache .xalan .internal .xsltc.trax.TransformerImpl.transformIdentity(TransformerImpl.java:662) at com .sun .org .apache .xalan .internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:708) at com .sun .org .apache .xalan .internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:313) at org.apache.hadoop.conf.Configuration.write(Configuration.java: 994) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:780) at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:370) at org .apache .hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247) at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279) Caused by: java.lang.AbstractMethodError: org.apache.xerces.dom.DocumentImpl.getXmlStandalone()Z ... 11 more = = = = = = = = This doesn't look good. Alan. On Sep 14, 2009, at 2:05 PM, Olga Natkovich wrote: Hi, I created a candidate build for Pig 0.4.0 release. The highlights of this release are - Performance improvements especially in the area of JOIN support where we introduced two new join types: skew join to deal with data skew and sort merge join to take advantage of the sorted data sets. - Support for Outer join. - Works with Hadoop 18 I ran the release audit and rat report looked fine. The relevant part is attached below. Keys used to sign the release are available at http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS?view=markup. Please download the release and try it out: http://people.apache.org/~olga/pig-0.4.0-candidate-0. Should we release this? Vote closes on Thursday, 9/17. Olga [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/contrib/ CHANGES.txt [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/contrib/zebra/ CHANG ES.txt [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/broken- links.x ml [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ cookbook.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/index.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/linkmap.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ piglatin_refer ence.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ piglatin_users .html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/setup.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ tutorial.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/udf.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/api/ package-li st [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes. html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ missingS inces.txt [java] !?
Re: [VOTE] Release Pig 0.4.0 (candidate 0)
When I run it as: java -cp ./pig.jar:/home/y/conf/pig/piglet/released -Dhod.server= org.apac he.pig.Main /d1/pig_harness/out/pigtest/gates/gates.1253134669/ Checkin_2.pig it works. When I run it as: JAVA_HOME=/usr PIG_CONF_DIR=/home/y/conf/pig/piglet/released/ bin/pig ~/pig/scripts/Checkin_2.pig it fails with the stack given earlier. Alan. On Sep 16, 2009, at 12:46 PM, Olga Natkovich wrote: Alan, I tried the jar packaged in the release and I am able to successfully run tests. Could you give it another try? Thanks, Olga -Original Message- From: Alan Gates [mailto:ga...@yahoo-inc.com] Sent: Wednesday, September 16, 2009 9:53 AM To: pig-dev@hadoop.apache.org Cc: priv...@hadoop.apache.org Subject: Re: [VOTE] Release Pig 0.4.0 (candidate 0) When I run this against a Hadoop 0.18.3 instance I can do DFS operations, but MR operations fail with: Error message from job controller - java.lang.AbstractMethodError: org.apache.xerces.dom.DocumentImpl.getXmlStandalone()Z at com .sun .org .apache.xalan.internal.xsltc.trax.DOM2TO.setDocumentInfo(DOM2TO.java: 373) at com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java: 127) at com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java: 94) at com .sun .org .apache .xalan .internal .xsltc.trax.TransformerImpl.transformIdentity(TransformerImpl.java: 662) at com .sun .org .apache .xalan .internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java: 708) at com .sun .org .apache .xalan .internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java: 313) at org.apache.hadoop.conf.Configuration.write(Configuration.java: 994) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java: 780) at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:370) at org .apache .hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java: 247) at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java: 279) at java.lang.Thread.run(Thread.java:619) Pig Stack Trace --- ERROR 6015: During execution, encountered a Hadoop error. org.apache.pig.backend.executionengine.ExecException: ERROR 6015: During execution, encountered a Hadoop error. at com .sun .org .apache.xalan.internal.xsltc.trax.DOM2TO.setDocumentInfo(DOM2TO.java: 373) at com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java: 127) at com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java: 94) at com .sun .org .apache .xalan .internal .xsltc.trax.TransformerImpl.transformIdentity(TransformerImpl.java: 662) at com .sun .org .apache .xalan .internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java: 708) at com .sun .org .apache .xalan .internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java: 313) at org.apache.hadoop.conf.Configuration.write(Configuration.java: 994) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java: 780) at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:370) at org .apache .hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java: 247) at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java: 279) Caused by: java.lang.AbstractMethodError: org.apache.xerces.dom.DocumentImpl.getXmlStandalone()Z ... 11 more = = = = = = = = = = == This doesn't look good. Alan. On Sep 14, 2009, at 2:05 PM, Olga Natkovich wrote: Hi, I created a candidate build for Pig 0.4.0 release. The highlights of this release are - Performance improvements especially in the area of JOIN support where we introduced two new join types: skew join to deal with data skew and sort merge join to take advantage of the sorted data sets. - Support for Outer join. - Works with Hadoop 18 I ran the release audit and rat report looked fine. The relevant part is attached below. Keys used to sign the release are available at http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS?view=markup. Please download the release and try it out: http://people.apache.org/~olga/pig-0.4.0-candidate-0. Should we release this? Vote closes on Thursday, 9/17. Olga [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/contrib/ CHANGES.txt [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/contrib/zebra/ CHANG ES.txt [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/broken- links.x ml [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ cookbook.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/index.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ linkmap.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ piglatin_refer ence.html
Re: [VOTE] Release Pig 0.4.0 (candidate 1)
Now the code won't build because there's no hadoop jar in the lib directory. Alan. On Sep 17, 2009, at 12:09 PM, Olga Natkovich wrote: Hi, I have fixed the issue causing the failure that Alan reported. Please test the new release: http://people.apache.org/~olga/pig-0.4.0-candidate-1/. Vote closes on Tuesday, 9/22. Olga -Original Message- From: Olga Natkovich [mailto:ol...@yahoo-inc.com] Sent: Monday, September 14, 2009 2:06 PM To: pig-dev@hadoop.apache.org; priv...@hadoop.apache.org Subject: [VOTE] Release Pig 0.4.0 (candidate 0) Hi, I created a candidate build for Pig 0.4.0 release. The highlights of this release are - Performance improvements especially in the area of JOIN support where we introduced two new join types: skew join to deal with data skew and sort merge join to take advantage of the sorted data sets. - Support for Outer join. - Works with Hadoop 18 I ran the release audit and rat report looked fine. The relevant part is attached below. Keys used to sign the release are available at http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS?view=markup. Please download the release and try it out: http://people.apache.org/~olga/pig-0.4.0-candidate-0. Should we release this? Vote closes on Thursday, 9/17. Olga [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/contrib/ CHANGES.txt [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/contrib/zebra/ CHANG ES.txt [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/broken- links.x ml [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ cookbook.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/index.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/linkmap.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ piglatin_refer ence.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ piglatin_users .html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/setup.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ tutorial.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/udf.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/api/ package-li st [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes. html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ missingS inces.txt [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ user_com ments_for_pig_0.3.1_to_pig_0.5.0-dev.xml [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes/ alldiffs_index_additions.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes/ alldiffs_index_all.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes/ alldiffs_index_changes.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes/ alldiffs_index_removals.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes/ changes-summary.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes/ classes_index_additions.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes/ classes_index_all.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes/ classes_index_changes.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes/ classes_index_removals.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes/ constructors_index_additions.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes/ constructors_index_all.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes/ constructors_index_changes.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes/ constructors_index_removals.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes/ fields_index_additions.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes/ fields_index_all.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes/ fields_index_changes.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes/ fields_index_removals.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes/ jdiff_help.html
Re: Revisit Pig Philosophy?
I agree with Milind that we should move to saying that Pig Latin is a data flow language independent of any particular platform, while the current implementation of Pig is tied to Hadoop. I'm not sure how thin that implementation will be, but I'm in favor of making it thin where possible (such as the recent proposal to shift LoadFunc to directly use InputFormat). I also strongly agree that we need to be more precise in our terminology between Pig (the platform) and Pig Latin (the language), especially as we're working on making Pig bilingual (with the addition of SQL). I am fine with saying that Pig SQL adheres as much as possible (given the underlying systems, etc.) to ANSI SQL semantics. And where there is shared functionality such as UDFs we again adhere to SQL semantics when it does not conflict with other Pig goals. So COUNT, and SUM should handle nulls the way SQL does, for example. But we need to craft the statement carefully. To see why, consider Pig's data model. We would like our types to map nicely into SQL types, so that if Pig SQL users declare a column to be of type VARCHAR(32) or FLOAT(10) we can map those onto some Pig type. But we don't want to use SQL types directly inside Pig, as they aren't a good match for much of Pig processing. So any statement of using SQL semantics needs caveats. I would also vote for modifying our Pigs Live Anywhere dictum to be: Pig Latin is intended to be a language for parallel data processing. It is not tied to one particular parallel framework. The initial implementation of Pig is on Hadoop and seeks to leverage the power of Hadoop wherever possible. However, nothing Hadoop specific should be exposed in Pig Latin. We may also want to add a vocabulary section to the philosophy statement to clarify between Pig and Pig Latin. Alan. On Sep 18, 2009, at 8:01 PM, Milind A Bhandarkar wrote: It's Friday evening, so I have some time to discuss philosophy ;-) Before we discuss any question about revisiting pig philosophy, the first question that needs to be answered is what is pig ? (this corresponds to the Hindu philosophy's basic argument, that any deep personal philosophical investigations need to start with a question koham? (in Sanskrit, it means 'who am I?')) So, coming back to approx 4000 years after the origin of that philosophy, we need to ask what is pig? (incidentally, pig, or varaaha in Sanskrit, was the second incarnation of lord Vishnu in hindu scriptures, but that's not relevant here.) What we need to decide is, is pig is a dataflow language ? I think not. Pig Latin is the language. Pig is referred to in countless slide decks ( aka pig scriptures, btw I own 50% of these scriptures) as a runtime system that interprets pig Latin, kind of like java and jvm. (Duality of nature, called dwaita philosophy in sanskrit is applicable here. But I won't go deeper than that.) So, pig-Latin-the-language's stance could still be that it could be implemented on any runtime. But pig the runtime's philosophy could be that it is a thin layer on top of hadoop. And all the world could breathe a sigh of relief. (mostly, by not having to answer these philosophical questions.) So, 'koham' is the 4000 year old question this project needs to answer. That's all. AUM.. (it's Friday.) - (swami) Milind ;-) On Sep 18, 2009, at 19:05, Jeff Hammerbacher ham...@cloudera.com wrote: Hey, 2. Local mode and other parallel frameworks snip Pigs Live Anywhere Pig is intended to be a language for parallel data processing. It is not tied to one particular parallel framework. It has been implemented first on hadoop, but we do not intend that to be only on hadoop. /snip Are we still holding onto this? What about local mode? Local mode is not being treated on equal footing with that of Hadoop for practical reasons. However, users expect things that work on local mode to work without any hitches on Hadoop. Are we still designing the system assuming that Pig will be stacked on top of other parallel frameworks? FWIW, I appreciate this philosophical stance from Pig. Allowing locally tested scripts to be migrated to the cluster without breakage is a noble goal, and keeping the option of (one day) developing an alternative execution environment for Pig that runs over HDFS but uses a richer physical set of operators than MapReduce would be great. Of course, those of you who are running Pig in production will have a much better sense of the feasibility, rather than desirability, of this philosophical stance. Later, Jeff
Re: [VOTE] Release Pig 0.4.0 (candidate 2)
private is the pmc list. Releases need pmc votes, hence we send to private. Alan. On Sep 21, 2009, at 7:46 PM, Milind A Bhandarkar wrote: Unrelated to the message content: why is there a priv...@hadoop.apache.org on the cc here? Is this even a valid alias? An open source project needs to conduct it's discussions in public, so an email address (even) named private makes me very nervous about the development process. - Milind On Sep 21, 2009, at 18:56, Olga Natkovich ol...@yahoo-inc.com wrote: Hi, The new version is available in http://people.apache.org/~olga/pig-0.4.0-candidate-2/. I see one failure in a unit test in piggybank (contrib.) but it is not related to the functions themselves but seems to be an issue with MiniCluster and I don't feel we need to chase this down. I made sure that the same test runs ok with Hadoop 20. Please, vote by end of day on Thursday, 9/24. Olga -Original Message- From: Olga Natkovich [mailto:ol...@yahoo-inc.com] Sent: Thursday, September 17, 2009 12:09 PM To: pig-dev@hadoop.apache.org; priv...@hadoop.apache.org Subject: [VOTE] Release Pig 0.4.0 (candidate 1) Hi, I have fixed the issue causing the failure that Alan reported. Please test the new release: http://people.apache.org/~olga/pig-0.4.0-candidate-1/. Vote closes on Tuesday, 9/22. Olga -Original Message- From: Olga Natkovich [mailto:ol...@yahoo-inc.com] Sent: Monday, September 14, 2009 2:06 PM To: pig-dev@hadoop.apache.org; priv...@hadoop.apache.org Subject: [VOTE] Release Pig 0.4.0 (candidate 0) Hi, I created a candidate build for Pig 0.4.0 release. The highlights of this release are - Performance improvements especially in the area of JOIN support where we introduced two new join types: skew join to deal with data skew and sort merge join to take advantage of the sorted data sets. - Support for Outer join. - Works with Hadoop 18 I ran the release audit and rat report looked fine. The relevant part is attached below. Keys used to sign the release are available at http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS?view=markup. Please download the release and try it out: http://people.apache.org/~olga/pig-0.4.0-candidate-0. Should we release this? Vote closes on Thursday, 9/17. Olga [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/contrib/ CHANGES.txt [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/contrib/zebra/ CHANG ES.txt [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/broken- links.x ml [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ cookbook.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/index.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ linkmap.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ piglatin_refer ence.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ piglatin_users .html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/setup.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/ tutorial.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/udf.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/api/ package-li st [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes. html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ missingS inces.txt [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ user_com ments_for_pig_0.3.1_to_pig_0.5.0-dev.xml [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes/ alldiffs_index_additions.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes/ alldiffs_index_all.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes/ alldiffs_index_changes.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes/ alldiffs_index_removals.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes/ changes-summary.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes/ classes_index_additions.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes/ classes_index_all.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes/ classes_index_changes.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes/ classes_index_removals.html [java] !? /home/olgan/src/pig-apache/trunk/build/pig-0.5.0-dev/docs/jdiff/ changes/ constructors_index_additions.html [java] !?
Re: High(er) res Pig logo?
I have a couple of higher resolution pigs in overalls and a pig on the Hadoop elephant. I've checked them into src/docs/src/documentation/ resources/images/ so all can use them. Also, we're working on cleaning up the Pig with Y! logo issue. Alan. On Sep 27, 2009, at 9:59 AM, Dmitriy Ryaboy wrote: Where can one find the Pig logo in a size/resolution suitable for presentations? Also, I went on the website and noticed that the Y! reappeared on Pig's chest. -D
Re: LocalRearrange out of bounds exception - tips for debugging?
Have you checked that each record your input data has at least the number of fields you specify? Have you checked that the field separator in your data matches the default for PigPerformanceLoader (^A I think)? Alan. On Oct 13, 2009, at 10:28 AM, Dmitriy Ryaboy wrote: We ran into what looks like some edge case bug in Pig, which causes it to throw an IndexOutOfBoundsException (stack trace below). The script just joins two relations; it looks like our data was generated incorrectly, and the join is empty, which may be what's causing the failure. It also appears to only happen when at least one of the inputs is on the large size (at least a few hundred megs). Any ideas on what could be happening and how to zoom in on the underlying cause? We are running off unmodified trunk. Script: register datagen.jar; E = load 'Employee' using org.apache.pig.test.utils.datagen.PigPerformanceLoader() as (id,name,cc,dc); D = load 'Department' using org.apache.pig.test.utils.datagen.PigPerformanceLoader() as (dept_id,dept_nm); P = load 'Project' using org.apache.pig.test.utils.datagen.PigPerformanceLoader() as (id,emp_id,role); R1 = JOIN E by dc, D by dept_id; R2 = JOIN R1 by E::id, P by emp_id; store R2 into 'TestCase2Output'; R2 join fails with the stack trace below. It also fails if we pre-calculate R1, store it, and load it directly (so, load R1, load P, join R1 by $0, P by emp_id). We've verified that the records in R1 and R2 have the expected fields, etc. Stack Trace: java.lang.IndexOutOfBoundsException: Index: 1, Size: 1 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:143) at org .apache .pig .backend .hadoop .executionengine .physicalLayer.expressionOperators.POProject.getNext(POProject.java: 148) at org .apache .pig .backend .hadoop .executionengine .physicalLayer.expressionOperators.POProject.getNext(POProject.java: 226) at org .apache .pig .backend .hadoop .executionengine .physicalLayer .relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java: 260) at org .apache .pig .backend .hadoop .executionengine .physicalLayer.relationalOperators.POUnion.getNext(POUnion.java:162) at org .apache .pig .backend .hadoop .executionengine .mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249) at org .apache .pig .backend .hadoop .executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240) at org .apache .pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce $Map.map(PigMapReduce.java:93) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java: 358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170)
Hudson testing of patches
We've had many questions on this, so I'm sending this to everyone on the dev list in hopes of clarifying the situation. Our Hudson setup for testing patches is falsely returning failures on all or most unit tests for all patches. So if you submit a patch and all the unit tests fail, don't worry. We are working on getting Hudson fixed. We committers are working through the patch queue manually, running the unit tests ourselves. As we don't work all night like Hudson and each run of the unit tests takes about 3 hours, this is going slowly. But please know we will get to your patches, even if it takes us a day or two. Alan.
Re: [VOTE] Release Pig 0.5.0 (candidate 0)
+1 On my laptop (mac) ran tutorial in both local and hadoop modes, ran a join/group/sort/limit script in both local and hadoop modes, did build of pig and contrib. On linux box did build of both pig and contrib, ran a join/group/sort/ limit script in both local and hadoop modes. Alan. On Oct 25, 2009, at 1:17 PM, Olga Natkovich wrote: Hi, I created a candidate build for Pig 0.5.0 release. It contains the same functionality as Pig 0.4.0 except it works with Hadoop 20.x releases. I ran the release audit and rat report looked fine. The relevant part is attached below. Keys used to sign the release are available at http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS?view=markup. Please download the release and try it out: http://people.apache.org/~olga/pig-0.5.0-candidate-0. Should we release this? Vote closes on Thursday, 10/29. Olga [java] !? /home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/src/org/ apache /pig/StoreConfig.java [java] !? /home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/src/org/ apache /pig/backend/hadoop/executionengine/util/MapRedUtil.java [java] !? /home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/src/org/ apache /pig/impl/logicalLayer/schema/SchemaUtil.java [java] !? /home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/test/org/ apach e/pig/test/TestDataBagAccess.java [java] !? /home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/test/org/ apach e/pig/test/TestNullConstant.java [java] !? /home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/test/org/ apach e/pig/test/TestSchemaUtil.java [java] !? /home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/test/org/ apach e/pig/test/utils/dotGraph/parser/DOTParser.java [java] !? /home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/test/org/ apach e/pig/test/utils/dotGraph/parser/DOTParserConstants.java [java] !? /home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/test/org/ apach e/pig/test/utils/dotGraph/parser/DOTParserTokenManager.java [java] !? /home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/test/org/ apach e/pig/test/utils/dotGraph/parser/DOTParserTreeConstants.java [java] !? /home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/test/org/ apach e/pig/test/utils/dotGraph/parser/JJTDOTParserState.java [java] !? /home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/test/org/ apach e/pig/test/utils/dotGraph/parser/ParseException.java [java] !? /home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/test/org/ apach e/pig/test/utils/dotGraph/parser/SimpleCharStream.java [java] !? /home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/test/org/ apach e/pig/test/utils/dotGraph/parser/Token.java [java] !? /home/olgan/src/pig-apache/branch-0.5/build/pig-0.5.0-dev/test/org/ apach e/pig/test/utils/dotGraph/parser/TokenMgrError.java
Re: LoadFunc.skipNext() function for faster sampling ?
We definitely want to avoid parsing every tuple when sampling. But do we need to implement a special function for it? Pig will have access to the InputFormat instance, correct? Can it not call InputFormat.getNext the desired number of times (which will not parse the tuple) and then call LoadFunc.getNext to get the next parsed tuple? Alan. On Nov 3, 2009, at 4:28 PM, Thejas Nair wrote: In the new implementation of SampleLoader subclasses (used by order- by, skew-join ..) as part of the loader redesign, we are not only reading all the records input but also parsing them as pig tuples. This is because the SampleLoaders are wrappers around the actual input loaders specified in the query. We can make things much faster by having a skipNext() function (or skipNext(int numSkip) ) which will avoid parsing the record into a pig tuple. LoadFunc could optionally implement this (easy to implement) function (which will be part of an interface) for improving speed of queries such as order-by. -Thejas
Re: [VOTE] Branch for Pig 0.6.0 release
+1. In addition to the new features we've added, our change to use Hadoop's LineRecordReader brought Pig to parity with Hadoop in the PigMix tests, about a 30% average performance improvement. This should be huge for our users. Alan. On Nov 9, 2009, at 12:26 PM, Olga Natkovich wrote: Hi, I would like to propose to branch for Pig 0.6.0 release with the intent to have a release before the end of the year. We have done a lot of work since branching for Pig 0.5.0 that we would like to share with users. This includes changing how bags are spilled onto disk (PIG-975, PIG-1037), skewed and fragment-replicated outer join plus many other performance improvements and bug fixes. Please vote by Thursday. Thanks, Olga
Re: package org.apache.hadoop.zebra.parse missing
The parser package is generated as part of the build. Doing invoking ant in the contrib/zebra directory should result in the parser package being created at ./src-gen/org/apache/hadoop/zebra/parser Alan. On Nov 11, 2009, at 12:54 AM, Min Zhou wrote: Hi guys, I checked out pig from trunk, and found package org.apache.hadoop.zebra.parse missing. Do you assure this package has been committed? see this link http://svn.apache.org/repos/asf/hadoop/pig/trunk/contrib/zebra/src/java/org/apache/hadoop/zebra/ Min -- My research interests are distributed systems, parallel computing and bytecode based virtual machine. My profile: http://www.linkedin.com/in/coderplay My blog: http://coderplay.javaeye.com
Re: FYI - forking TFile off Hadoop into Zebra
On Nov 11, 2009, at 4:13 PM, Ashutosh Chauhan wrote: On Wed, Nov 11, 2009 at 18:26, Chao Wang ch...@yahoo-inc.com wrote: Last, we would like to point out that this is a short term solution for Zebra and we plan to: 1) port all changes to Zebra TFile back into Hadoop TFile. 2) in the long run have a single unified solution for this. Just for clarity, in long run as Zebra stabilizes and Pig adopts hadoop-0.22, Zebra will get rid of this fork? I think the promise is they'll get rid of the fork at some point, not necessarily at 0.22 though. Alan. Ashutosh
Re: optimizer hints in Pig
In general I think optimizer hints fit well with Pig's approach to data processing, as expressed in our philosophic statement that Pigs are domestic animals (see http://hadoop.apache.org/pig/ philosophy.html ). At least in the examples you give, I don't see 'with' as binding. The user is giving Pig information; it can choose how to use it, or to not use it all. I would like 'using' to continue to be binding as in that case the user is explicitly telling Pig to do something in a particular way. Alan. On Nov 14, 2009, at 2:07 PM, Ashutosh Chauhan wrote: Hi All, We would like to know what Pig devs feel about optimizer hints. Traditionally, optimizer hints have been received with mixed reactions in RDBMS world. Oracle provides lots of knobs[1][2] to turn and tune, while postgres[3][4] have tried to stay away from them. Mysql have few of them (e.g., straight_join). Surajit Chaudhary [5] (Microsoft) is making case in favor of them. More specifically, I am talking of hints like following a = filter 'mydata' by myudf ($1) with selectivity 0.5; // This is letting user to tell Pig that myudf filters out nearly half of tuples of 'mydata'. c = join a by $0, b by $0 with selectivity a.$0 = b.$0, 0.1; // This is letting user to tell Pig that only 10% of keys in a will match with those in b. Exact syntax isn't important it could be adapted. But, question is does it seem to be a useful enough idea to be added in Pig Latin. Pig's case is slightly different from other sql engines in that while other systems treats them as hints and thus are free to ignore them Pig treats hints as commands in a sense that it will fail even if it can figure out that hint will result in failure of query. Perhaps, Pig can interpret using as command and with as hint. Thoughts? Ashutosh [1] http://www.dba-oracle.com/art_otn_cbo_p7.htm [2] http://www.dba-oracle.com/oracle11g/oracle_11g_extended_optimizer_statistics.htm [3] http://archives.postgresql.org/pgsql-hackers/2006-10/msg00663.php [4] http://archives.postgresql.org/pgsql-hackers/2006-08/msg00506.php [5] portal.acm.org/ft_gateway.cfm?id=1559955type=pdf
Welcome Jeff Zhang
All, I would like to welcome Jeff Zhang as our newest Pig committer. Jeff has been contributing to Pig for about nine months now. He's been active on the mailing lists, in contributing patches, and in helping other users with their patches. Congratulations Jeff, and thanks for your contributions to Pig. Alan.
Yahoo is hiring for Hadoop development
All, Yahoo has a number of Hadoop development positions open. There are engineering, architect, management, and QA positions all open. See http://developer.yahoo.net/blogs/hadoop/2009/11/updated_do_you_have_what_it_ta.html for details. Alan.
Re: TPC-H benchmark
I don't know of any. Officially Pig cannot publish a TPC-H number because it is not a transaction based store. But I still think it would be very interesting to see the results if someone took the time to translate the queries. Alan. On Nov 22, 2009, at 6:20 PM, RichardGUO Fei wrote: Hi, Apart from Pig Performance and Pig Mix, do you know any TPC-H benchmark rewritten for Pig? Thanks, Richard _ MSN十周年庆典,查看MSN注册时间,赢取神秘大奖 http://10.msn.com.cn
Re: Why we name it zebra ?
On Nov 26, 2009, at 7:39 AM, Jeff Zhang wrote: Hi all, I'd like to know where's the name zebra come from ? does it convey the meaning of this meta data system that the columnar storage format is like the lines on the zebra's skin. Pretty much, yes. We've fallen into the habit of giving animal names to projects. We discussed several animals but zebra won. Alan. Thank you Jeff Zhang
Re: Pig reading hive columnar rc tables
On Nov 30, 2009, at 12:18 PM, Dmitriy Ryaboy wrote: That's awesome, I've been itching to do that but never got around to it.. Garrit, do you have any benchmarks on read speeds? I don't know about putting this in piggybank, as it carries with it pretty significant dependencies, increasing the size of the jar and making it difficult for users to don't need it to build piggybank in the first place. We might want to consider some other contrib for it -- maybe a misc contrib that would have indivudual ant targets for these kinds of compatibility submissions? Does it have to increase the size of the piggybank jar? Instead of including hive in our piggybank jar, which I agree would be bad, can we just say that if you want to use this function you need to provide the appropriate hive jar yourself? This way we could use ivy to pull the jars and build piggybank. I'm not really wild about creating a new section of contrib just for functions that have heavier weight requirements. Alan. -D On Mon, Nov 30, 2009 at 3:09 PM, Olga Natkovich ol...@yahoo- inc.com wrote: Hi Garrit, It would be great if you could contribute the code. The process is pretty simple: - Open a JIRA that describes what the loader does and that you would like to contribute it to the Piggybank. - Submit the patch that contains the loader. Make sure it has unit tests and javadoc. On this is done, one of the committers will review and commit the patch. More details on how to contribute are in http://wiki.apache.org/pig/PiggyBank. Olga -Original Message- From: Gerrit van Vuuren [mailto:gvanvuu...@specificmedia.com] Sent: Friday, November 27, 2009 2:42 AM To: pig-dev@hadoop.apache.org Subject: Pig reading hive columnar rc tables Hi, I've coded a LoadFunc implementation that can read from Hive Columnar RC tables, this is needed for a project that I'm working on because all our data is stored using the Hive thrift serialized Columnar RC format. I have looked at the piggy bank but did not find any implementation that could do this. We've been running it on our cluster for the last week and have worked out most bugs. There are still some improvements to be done but I would need like setting the amount of mappers based on date partitioning. Its been optimized so as to read only specific columns and can churn through a data set almost 8 times faster with this improvement because not all column data is read. I would like to contribute the class to the piggybank can you guide me in what I need to do? I've used hive specific classes to implement this, is it possible to add this to the piggy bank build ivy for automatic download of the dependencies? Thanks, Gerrit Jansen van Vuuren
Re: SQL in Pig?
We are still actively working on adding SQL to Pig. We hope to have an updated patch posted to that JIRA in February or March. Alan. On Jan 18, 2010, at 4:15 PM, Michael Dalton wrote: Hi, What's the current status of SQL support in Pig? I looked at the JIRA ( http://issues.apache.org/jira/browse/PIG-824) and it seems like there hasn't been any activity on adding SQL to Pig since August. I was just curious if that's something that's still being actively developed and is of interest to the Pig development team, and will be integrated at some point. Thanks Best regards, Mike
Backward compatibility
Over the last year the number of Pig users has grown, both in terms of absolute number and the number of different companies using it. However, it is going to be a little while yet before Pig reaches a maturity level that it can declare a 1.0 release and promise it won't break backward compatibility until 2.0 So I think we need to discuss how we intend to handle backward compatibility across releases. The scope of what I'm covering in backwards compatibility are all the interfaces and classes in org.apache.pig, the Pig Latin language, and data formats that Pig's bundled loaders read and write. I propose the following criteria for deciding when to break backward compatibility: 1) We shouldn't break it without a strong reason. A strong reason is a show stopping bug, a compelling new feature, large gains in performance, or a change in architecture that significantly eases Pig use or development. Examples would be things like load/store redesign, which should make it much easier to write load and store functions. 2) Where possible we should bundle disruptions of an interface together rather than spread them across releases. This avoids the death by a thousand cuts of having interfaces change a little bit each release. Thoughts? Alan.
Re: reading/writing HBase in Pig
On Jan 18, 2010, at 10:14 PM, Michael Dalton wrote: I took a look at the load-store branch and that definitely seems like the right place to do this. So the right thing to do would be to just open up a JIRA and then post a patch against the load-store rewrite tree, correct? Yes. You should take a look at PIG-1200, which seems to be going part way towards doing what you want to do. Alan.
Begin a discussion about Pig as a top level project
You have probably heard by now that there is a discussion going on in the Hadoop PMC as to whether a number of the subprojects (Hbase, Avro, Zookeeper, Hive, and Pig) should move out from under the Hadoop umbrella and become top level Apache projects (TLP). This discussion has picked up recently since the Apache board has clearly communicated to the Hadoop PMC that it is concerned that Hadoop is acting as an umbrella project with many disjoint subprojects underneath it. They are concerned that this gives Apache little insight into the health and happenings of the subproject communities which in turn means Apache cannot properly mentor those communities. The purpose of this email is to start a discussion within the Pig community about this topic. Let me cover first what becoming TLP would mean for Pig, and then I'll go into what options I think we as a community have. Becoming a TLP would mean that Pig would itself have a PMC that would report directly to the Apache board. Who would be on the PMC would be something we as a community would need to decide. Common options would be to say all active committers are on the PMC, or all active committers who have been a committer for at least a year. We would also need to elect a chair of the PMC. This lucky person would have no additional power, but would have the additional responsibility of writing quarterly reports on Pig's status for Apache board meetings, as well as coordinating with Apache to get accounts for new committers, etc. For more information see http://www.apache.org/foundation/how-it-works.html#roles Becoming a TLP would not mean that we are ostracized from the Hadoop community. We would continue to be invited to Hadoop Summits, HUGs, etc. Since all Pig developers and users are by definition Hadoop users, we would continue to be a strong presence in the Hadoop community. I see three ways that we as a community can respond to this: 1) Say yes, we want to be a TLP now. 2) Say yes, we want to be a TLP, but not yet. We feel we need more time to mature. If we choose this option we need to be able to clearly articulate how much time we need and what we hope to see change in that time. 3) Say no, we feel the benefits for us staying with Hadoop outweigh the drawbacks of being a disjoint subproject. If we choose this, we need to be able to say exactly what those benefits are and why we feel they will be compromised by leaving the Hadoop project. There may other options that I haven't thought of. Please feel free to suggest any you think of. Questions? Thoughts? Let the discussion begin. Alan.
JIRA Fix Version
A reminder to Pig committers: When closing a JIRA issue as Resolved/ Fixed please make sure to set the Fix Version field. This helps our users know what versions they need to use to get fixes for their issues. And it helps release managers when they build releases to know what is and isn't in the release they're building. There were ~170 issues in Pig's JIRA marked fixed but with no version. I've assigned most of them to the appropriate version. Alan.
Re: Begin a discussion about Pig as a top level project
So far I haven't seen any feedback on this. Apache has asked the Hadoop PMC to submit input in April on whether some subprojects should be promoted to TLPs. We, the Pig community, need to give feedback to the Hadoop PMC on how we feel about this. Please make your voice heard. So now I'll head my own call and give my thoughts on it. The biggest advantage I see to being a TLP is a direct connection to Apache. Right now all of the Pig team's interaction with Apache is through the Hadoop PMC. Being directly connected to Apache would benefit Pig team members who would have a better view into Apache. It would also raise our profile in Apache and thus make other projects more aware of us. However, I am concerned about loosing Pig's explicit connection to Hadoop. This concern has a couple of dimensions. One, Hadoop and MapReduce are the current flavor of the month in computing. Given that Pig shares a name with the common farm animal, it's hard to be sure based on search statistics. But Google trends shows that hadoop is searched on much more frequently than hadoop pig or apache pig (see http://www.google.com/trends?q=hadoop%2Chadoop +pig). I am guessing that most Pig users come from Hadoop users who discover Pig via Hadoop's website. Loosing that subproject tab on Hadoop's front page may radically lower the number of users coming to Pig to check out our project. I would argue that this benefits Hadoop as well, since high level languages like Pig Latin have the potential to greatly extend the user base and usability of Hadoop. Two, being explicitly connected to Hadoop keeps our two communities aware of each others needs. There are features proposed for MR that would greatly help Pig. By staying in the Hadoop community Pig is better positioned to advocate for and help implement and test those features. The response to this will be that Pig developers can still subscribe to Hadoop mailing lists, submit patches, etc. That is, they can still be part of the Hadoop community. Which reinforces my point that it makes more sense to leave Pig in the Hadoop community since Pig developers will need to be part of that community anyway. Finally, philosophically it makes sense to me that projects that are tightly connected belong together. It strikes me as strange to have Pig as a TLP completely dependent on another TLP. Hadoop was originally a subproject of Lucene. It moved out to be a TLP when it became obvious that Hadoop had become independent of and useful apart from Lucene. Pig is not in that position relative to Hadoop. So, I'm -1 on Pig moving out. But this is a soft -1. I'm open to being persuaded that I'm wrong or my concerns can be addressed while still having Pig as a TLP. Alan. On Mar 19, 2010, at 10:59 AM, Alan Gates wrote: You have probably heard by now that there is a discussion going on in the Hadoop PMC as to whether a number of the subprojects (Hbase, Avro, Zookeeper, Hive, and Pig) should move out from under the Hadoop umbrella and become top level Apache projects (TLP). This discussion has picked up recently since the Apache board has clearly communicated to the Hadoop PMC that it is concerned that Hadoop is acting as an umbrella project with many disjoint subprojects underneath it. They are concerned that this gives Apache little insight into the health and happenings of the subproject communities which in turn means Apache cannot properly mentor those communities. The purpose of this email is to start a discussion within the Pig community about this topic. Let me cover first what becoming TLP would mean for Pig, and then I'll go into what options I think we as a community have. Becoming a TLP would mean that Pig would itself have a PMC that would report directly to the Apache board. Who would be on the PMC would be something we as a community would need to decide. Common options would be to say all active committers are on the PMC, or all active committers who have been a committer for at least a year. We would also need to elect a chair of the PMC. This lucky person would have no additional power, but would have the additional responsibility of writing quarterly reports on Pig's status for Apache board meetings, as well as coordinating with Apache to get accounts for new committers, etc. For more information see http://www.apache.org/foundation/how-it-works.html#roles Becoming a TLP would not mean that we are ostracized from the Hadoop community. We would continue to be invited to Hadoop Summits, HUGs, etc. Since all Pig developers and users are by definition Hadoop users, we would continue to be a strong presence in the Hadoop community. I see three ways that we as a community can respond to this: 1) Say yes, we want to be a TLP now. 2) Say yes, we want to be a TLP, but not yet. We feel we need more time to mature. If we
Re: Begin a discussion about Pig as a top level project
intend to position as a data flow language that is backend agnostic? If the answer is yes, then there is a strong case for making Pig a TLP. Are we influenced by Hadoop? A big YES! The reason Pig chose to become a Hadoop sub-project was to ride the Hadoop popularity wave. As a consequence, we chose to be heavily influenced by the Hadoop roadmap. Like a good lawyer, I also have rebuttals to Alan's questions :) 1. Search engine popularity - We can discuss this with the Hadoop team and still retain links to TLP's that are coupled (loosely or tightly). 2. Explicit connection to Hadoop - I see this as logical connection v/s physical connection. Today, we are physically connected as a sub-project. Becoming a TLP, will not increase/decrease our influence on the Hadoop community (think Logical, Physical and MR Layers :) 3. Philosophy - I have already talked about this. The tight coupling is by choice. If Pig continues to be a data flow language with clear syntax and semantics then someone can implement Pig on top of a different backend. Do we intend to take this approach? I just wanted to offer a different opinion to this thread. I strongly believe that we should think about the original philosophy. Will we have a Pig standards committee that will decide on the changes to the language (think C/C++) if there are multiple backend implementations? I will reserve my vote based on the outcome of the philosophy and backward compatibility discussions. If we decide that Pig will be treated and maintained like a true language with clear syntax and semantics then we have a strong case to make it into a TLP. If not, we should retain our existing ties to Hadoop and make Pig into a data flow language for Hadoop. Santhosh -Original Message- From: Thejas Nair [mailto:te...@yahoo-inc.com] Sent: Friday, April 02, 2010 4:08 PM To: pig-dev@hadoop.apache.org; Dmitriy Ryaboy Subject: Re: Begin a discussion about Pig as a top level project I agree with Alan and Dmitriy - Pig is tightly coupled with hadoop, and heavily influenced by its roadmap. I think it makes sense to continue as a sub-project of hadoop. -Thejas On 3/31/10 4:04 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: Over time, Pig is increasing its coupling to Hadoop (for good reasons), rather than decreasing it. If and when Pig becomes a viable entity without hadoop around, it might make sense as a TLP. As is, I think becoming a TLP will only introduce unnecessary administrative and bureaucratic headaches. So my vote is also -1. -Dmitriy On Wed, Mar 31, 2010 at 2:38 PM, Alan Gates ga...@yahoo-inc.com wrote: So far I haven't seen any feedback on this. Apache has asked the Hadoop PMC to submit input in April on whether some subprojects should be promoted to TLPs. We, the Pig community, need to give feedback to the Hadoop PMC on how we feel about this. Please make your voice heard. So now I'll head my own call and give my thoughts on it. The biggest advantage I see to being a TLP is a direct connection to Apache. Right now all of the Pig team's interaction with Apache is through the Hadoop PMC. Being directly connected to Apache would benefit Pig team members who would have a better view into Apache. It would also raise our profile in Apache and thus make other projects more aware of us. However, I am concerned about loosing Pig's explicit connection to Hadoop. This concern has a couple of dimensions. One, Hadoop and MapReduce are the current flavor of the month in computing. Given that Pig shares a name with the common farm animal, it's hard to be sure based on search statistics. But Google trends shows that hadoop is searched on much more frequently than hadoop pig or apache pig (see http://www.google.com/trends?q=hadoop%2Chadoop+pig). I am guessing that most Pig users come from Hadoop users who discover Pig via Hadoop's website. Loosing that subproject tab on Hadoop's front page may radically lower the number of users coming to Pig to check out our project. I would argue that this benefits Hadoop as well, since high level languages like Pig Latin have the potential to greatly extend the user base and usability of Hadoop. Two, being explicitly connected to Hadoop keeps our two communities aware of each others needs. There are features proposed for MR that would greatly help Pig. By staying in the Hadoop community Pig is better positioned to advocate for and help implement and test those features. The response to this will be that Pig developers can still subscribe to Hadoop mailing lists, submit patches, etc. That is, they can still be part of the Hadoop community. Which reinforces my point that it makes more sense to leave Pig in the Hadoop community since Pig developers will need to be part of that community anyway. Finally, philosophically it makes sense to me that projects that are tightly connected belong together. It strikes me
Re: Begin a discussion about Pig as a top level project
Prognostication is a difficult business. Of course I'd love it if someday there is an ISO Pig Latin committee (with meetings in cool exotic places) deciding the official standard for Pig Latin. But that seems like saying in your start up's business plan, When we reach Google's size, then we'll do x. If there ever is an ISO Pig Latin standard it will be years off. As others have noted, staying tight to Hadoop now has many advantages, both in technical and adoption terms. Hence my advocacy of keeping Pig Latin Hadoop agnostic while tightly integrating the backend. Which is to say that in my view, Pig is Hadoop specific now, but there may come a day when that is no longer true. Whether Pig will ever move past just running on Hadoop to running in other parallel systems won't be known for years to come. Given that, do you think it makes sense to say that Pig stays a subproject for now, but if it someday grows beyond Hadoop only it becomes a TLP? I could agree to that stance. Alan. On Apr 3, 2010, at 12:43 PM, Santhosh Srinivasan wrote: I see this as a multi-part question. Looking back at some of the significant roadmap/existential questions asked in the last 12 months, I see the following: 1. With the introduction of SQL, what is the philosophy of Pig (I sent an email about this approximately 9 months ago) 2. What is the approach to support backward compatibility in Pig (Alan had sent an email about this 3 months ago) 3. Should Pig be a TLP (the current email thread). Here is my take on answering the aforementioned questions. The initial philosophy of Pig was to be backend agnostic. It was designed as a data flow language. Whenever a new language is designed, the syntax and semantics of the language have to be laid out. The syntax is usually captured in the form of a BNF grammar. The semantics are defined by the language creators. Backward compatibility is then a question of holding true to the syntax and semantics. With Pig, in addition to the language, the Java APIs were exposed to customers to implement UDFs (load/store/filter/grouping/row transformation etc), provision looping since the language does not support looping constructs and also support a programmatic mode of access. Backward compatibility in this context is to support API versioning. Do we still intend to position as a data flow language that is backend agnostic? If the answer is yes, then there is a strong case for making Pig a TLP. Are we influenced by Hadoop? A big YES! The reason Pig chose to become a Hadoop sub-project was to ride the Hadoop popularity wave. As a consequence, we chose to be heavily influenced by the Hadoop roadmap. Like a good lawyer, I also have rebuttals to Alan's questions :) 1. Search engine popularity - We can discuss this with the Hadoop team and still retain links to TLP's that are coupled (loosely or tightly). 2. Explicit connection to Hadoop - I see this as logical connection v/s physical connection. Today, we are physically connected as a sub-project. Becoming a TLP, will not increase/decrease our influence on the Hadoop community (think Logical, Physical and MR Layers :) 3. Philosophy - I have already talked about this. The tight coupling is by choice. If Pig continues to be a data flow language with clear syntax and semantics then someone can implement Pig on top of a different backend. Do we intend to take this approach? I just wanted to offer a different opinion to this thread. I strongly believe that we should think about the original philosophy. Will we have a Pig standards committee that will decide on the changes to the language (think C/C++) if there are multiple backend implementations? I will reserve my vote based on the outcome of the philosophy and backward compatibility discussions. If we decide that Pig will be treated and maintained like a true language with clear syntax and semantics then we have a strong case to make it into a TLP. If not, we should retain our existing ties to Hadoop and make Pig into a data flow language for Hadoop. Santhosh -Original Message- From: Thejas Nair [mailto:te...@yahoo-inc.com] Sent: Friday, April 02, 2010 4:08 PM To: pig-dev@hadoop.apache.org; Dmitriy Ryaboy Subject: Re: Begin a discussion about Pig as a top level project I agree with Alan and Dmitriy - Pig is tightly coupled with hadoop, and heavily influenced by its roadmap. I think it makes sense to continue as a sub-project of hadoop. -Thejas On 3/31/10 4:04 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: Over time, Pig is increasing its coupling to Hadoop (for good reasons), rather than decreasing it. If and when Pig becomes a viable entity without hadoop around, it might make sense as a TLP. As is, I think becoming a TLP will only introduce unnecessary administrative and bureaucratic headaches. So my vote is also -1. -Dmitriy On Wed, Mar 31, 2010 at 2:38 PM, Alan Gates ga...@yahoo-inc.com wrote: So
Re: TypeCheckingVisitor and casting to less precise numeric types
You are correct that all of these casts can be done. We omitted them explicitly because of what you said that we did not want to loose precision. We should be able to downcast when users ask explicitly for it, but we don't want to do this implicitly. Alan. On Mar 24, 2010, at 2:47 PM, Anil Chawla wrote: Hi, I know that Pig has logic for casting inputs to the expected data types when invoking a UDF and I understand that this logic resides in the TypeCheckingVisitor class. I am curious to know why certain casts have been omitted from the castLookup map. Specifically, I do not see any entries for casting a more precise numeric type (e.g. Double) to a less precise numeric type (e.g. Integer). Any reason why all down conversions of numeric types have been omitted? Is it because we do not want to perform any automatic casts that lead to a loss of precision (loss of data)? In my situation, we are trying to abstract all numeric data types into a single number type. If a UDF takes a numeric parameter, we want Pig to invoke that UDF with any numeric argument, regardless of whether the argument must be upconverted or downconverted. We are OK with the loss of precision in that circumstance. As a result, we added the following to the castLookup map: castLookup.put(DataType.LONG, DataType.INTEGER); castLookup.put(DataType.FLOAT, DataType.LONG); castLookup.put(DataType.FLOAT, DataType.INTEGER); castLookup.put(DataType.DOUBLE, DataType.FLOAT); castLookup.put(DataType.DOUBLE, DataType.LONG); castLookup.put(DataType.DOUBLE, DataType.INTEGER); All of these casts seem to work fine our tests. Other than loss of precision, is there any reason why adding these casts might be a bad idea? Thanks, -Anil
Re: Shouldn't hadoop18.jar be removed from lib of trunk?
It should be removed. I filed https://issues.apache.org/jira/browse/PIG-1388 so we'll remember to remove it in 0.8. Alan. On Apr 21, 2010, at 10:24 PM, chaitanya krishna wrote: Hi, Since pig-trunk now supports hadoop-0.20 and as it already has hadoop20.jar, shouldn't the hadoop18.jar be removed from it? I think it is redundant from now. or am I missing something? Regards, V.V.Chaitanya.
Re: Consider cleaning up backend code
A couple of years ago we had this concept that Pig as is should be able to run on other backends (like say Dryad if it were open source). So we built this whole backend interface and (mostly) kept Hadoop specific objects out of the front end. Recently we have modified that stand and said that this implementation of Pig is Hadoop specific. Pig Latin itself will still stay Hadoop independent. So the ability to have multiple backends is fine. But the ability to have non-Hadoop backends is not really interesting now. So I at least see the proposal here as getting rid of generic code that tries to hide the fact that we are working on top of Hadoop (things like DataStorage and ExecutionEngine). Alan. On Apr 22, 2010, at 4:14 PM, Arun C Murthy wrote: I read it as getting rid of concepts parallel to hadoop in src/org/ apache/pig/backend/hadoop/datastorage. Is that true? thanks, Arun On Apr 22, 2010, at 1:34 PM, Dmitriy Ryaboy wrote: I kind of dig the concept of being able to plug in a different backend, though I definitely thing we should get rid of the dead localmode code. Can you give an example of how this will simplify the codebase? Is it more than just GenericClass foo = new SpecificClass(), and the associated extra files? -D On Thu, Apr 22, 2010 at 1:25 PM, Arun C Murthy a...@yahoo-inc.com wrote: +1 Arun On Apr 22, 2010, at 11:35 AM, Richard Ding wrote: Pig has an abstraction layer (interfaces and abstract classes) to support multiple execution engines. After PIG-1053, Hadoop is the only execution engine supported by Pig. I wonder if we should remove this layer of code, and make Hadoop THE execution engine for Pig. This will simplify a lot the backend code. Thanks, -Richard
Re: When is the pig-0.7.0 and pig-0.8.0 scheduled to be released?
We've already branched for 0.7, which means we're not putting any new features in there, just critical bug fixes. We're extensively testing it now and hope to release it soon. We don't have a date for 0.8 yet. Alan. On Apr 23, 2010, at 2:08 AM, chaitanya krishna wrote: Hi, Can someone please tell me when is pig-0.7.0 planned to be released, i.e., when is the code-freeze date? Also, can someone tell me the relevant dates for pig-0.8.0? Thanks, V.V.Chaitanya.
Re: [VOTE] Release Pig 0.7.0 (candidate 0)
+1. Ran the tutorial and some simple smoke tests on my mac and on linux. Checked that the signature keys are good. Alan. On May 5, 2010, at 11:44 AM, Daniel Dai wrote: Hi, I have created a candidate build for Pig 0.7.0. A description of what is new and different is included in the release notes: http://people.apache.org/~daijy/pig-0.7.0-candidate-0/RELEASE_NOTES.txt Keys used to sign the release are available at http://svn.apache.org/viewvc/hadoop/pig/trunk/KEYS?view=markup Please download, test, try it out and vote. The download link is: http://people.apache.org/~daijy/pig-0.7.0-candidate-0 Thanks Daniel
[Travel Assistance] - Applications Open for ApacheCon NA 2010
The Travel Assistance Committee is now taking in applications for those wanting to attend ApacheCon North America (NA) 2010, which is taking place between the 1st and 5th November in Atlanta. The Travel Assistance Committee is looking for people who would like to be able to attend ApacheCon, but who need some financial support in order to be able to get there. There are limited places available, and all applications will be scored on their individual merit. Financial assistance is available to cover travel to the event, either in part or in full, depending on circumstances. However, the support available for those attending only the barcamp is smaller than that for people attending the whole event. The Travel Assistance Committee aims to support all ApacheCons, and cross-project events, and so it may be prudent for those in Asia and the EU to wait for an event closer to them. More information can be found on the main Apache website at http://www.apache.org/travel/index.html - where you will also find a link to the online application and details for submitting. Applications for applying for travel assistance are now being accepted, and will close on the 7th July 2010. Good luck to all those that will apply. You are welcome to tweet, blog as appropriate. Regards, The Travel Assistance Committee.
Re: Code Repository
http://wiki.apache.org/pig/HowToContribute Alan. On May 20, 2010, at 9:15 PM, Renato Marroquín Mogrovejo wrote: Hi, is there a PIG coding standard? or any type of documentation I could follow? Thanks. Renato M.
Re: About PigPen
The one on the JIRA is more up to date. However, be aware that PigPen has not been updated since Pig 0.2 and does not work with new versions of Pig. Alan. On May 23, 2010, at 11:25 PM, Renato Marroquín Mogrovejo wrote: Hi, does anybody know which the PigPen release is? I found two links. The first one is from the wiki and the second one is from the jira. http://issues.apache.org/jira/secure/attachment/12393772/org.apache.pig.pigpen_0.0.1.jar https://issues.apache.org/jira/secure/attachment/12400858/PigPen.tgz Thanks in advance. Renato M.
Re: does EvalFunc generate the entire bag always ?
The default case is that a UDFs that take bags (such as COUNT, etc.) are handed the entire bag at once. In the case where all UDFs in a foreach implement the algebraic interface and the expression itself is algebraic than the combiner will be used, thus significantly limiting the size of the bag handed to the UDF. The accumulator does hand records to the UDF a few thousand at a time. Currently it has no way to turn off the flow of records. What you want might be accomplished by the LIMIT operator, which can be used inside a nested foreach. Something like: C = foreach B { C1 = sort A by $0; C2 = limit 5 C1; generate myUDF(C2); } Alan. On May 26, 2010, at 11:59 AM, hc busy wrote: Hey, guys, how are Bags passed to EvalFunc stored? I was looking at the Accumulator interface and it says that the reason why this needed for COUNT and SUM is because EvalFunc always gives you the entire bag when the EvalFunc is run on a bag. I always thought if I did COUNT(TABLE) or SUM(TABLE.FIELD), and the code inside that does for(Tuple entry:inputDataBag){ stuff } was an actual iterator that iterated on the bag sequentially without necessarily having the entire bag in memory all at once. ?? Because it's an iterator, so there's no way to do anything other than to stream through it. I'm looking at this because Accumulator has no way of telling Pig I've seen enough It streams through the entire bag no matter what happens. (like, hypothetically speaking, if I was writing 5th item of a sorted bag udf), after I see 5th of a 5 million entry bag, I want to stop executing if possible. Is there a easy way to make this happen?
Hudson returning -1 on javadoc
Since it's return from the hospital Hudson has been returning -1 on all patches submitted complaining about a broken javadoc tag. It turns out the bad tag snuck into the code whilst Hudson was away. I've checked in a fix, so Hudson should be happy again. Any patches that were flunked just for that 1 javadoc warning should be considered ok. Alan.
Re: does EvalFunc generate the entire bag always ?
I don't think it pushes limit yet in this case. Alan. On Jun 1, 2010, at 1:44 PM, hc busy wrote: well, see that's the thing, the 'sort A by $0' is already nlg(n) ahh, I see, my own example suffers from this problem. I guess I'm wondering how 'limit' works in conjunction with UDF's... A practical application escapes me right now, But if I do C = foreach B{ C1 = MyUdf(B.bag_on_b); C2 = limit C1 5; } does it know to push limit in this case? On Thu, May 27, 2010 at 2:32 PM, Alan Gates ga...@yahoo-inc.com wrote: The default case is that a UDFs that take bags (such as COUNT, etc.) are handed the entire bag at once. In the case where all UDFs in a foreach implement the algebraic interface and the expression itself is algebraic than the combiner will be used, thus significantly limiting the size of the bag handed to the UDF. The accumulator does hand records to the UDF a few thousand at a time. Currently it has no way to turn off the flow of records. What you want might be accomplished by the LIMIT operator, which can be used inside a nested foreach. Something like: C = foreach B { C1 = sort A by $0; C2 = limit 5 C1; generate myUDF(C2); } Alan. On May 26, 2010, at 11:59 AM, hc busy wrote: Hey, guys, how are Bags passed to EvalFunc stored? I was looking at the Accumulator interface and it says that the reason why this needed for COUNT and SUM is because EvalFunc always gives you the entire bag when the EvalFunc is run on a bag. I always thought if I did COUNT(TABLE) or SUM(TABLE.FIELD), and the code inside that does for(Tuple entry:inputDataBag){ stuff } was an actual iterator that iterated on the bag sequentially without necessarily having the entire bag in memory all at once. ?? Because it's an iterator, so there's no way to do anything other than to stream through it. I'm looking at this because Accumulator has no way of telling Pig I've seen enough It streams through the entire bag no matter what happens. (like, hypothetically speaking, if I was writing 5th item of a sorted bag udf), after I see 5th of a 5 million entry bag, I want to stop executing if possible. Is there a easy way to make this happen?
Re: algebraic optimization not invoked for filter following group?
For at least simple cases what's in the pseduo code should work. I hope someday soon we can start using the new logical optimizer work (in the experimental package) to build rules for the MR optimizer (like this combiner stuff) as well, which should be much easier to code. But it will be a while before we get there. I don't think this will automatically make it work for split, because I think it will see the split in the plan and that will make it choose not to optimize. Alan. On Jun 2, 2010, at 4:18 PM, Dmitriy Ryaboy wrote: It looks like right now, the combiner optimization does not kick in for a script like this: data = load 'foo' using PigStorage() as (a, b, c); grouped = group data by a; filtered = filter grouped by COUNT(data) 1000; Looking at the code in CombinerOptimizer, seems like the Filter bit is just pseudo-coded in comments. Are there complications there other than what is already noted, or is it just the matter of coding up the pseudo-code? On that note -- assuming the optimization was implemented for Filter following group, would it automagically start working for Splits, as well? -D
Re: SIZE() of relation
There have been several requests for this. I'm not a fan of it, because it makes it too easy to forget that you're forcing a single reducer MR job to accomplish this. But I'm open to persuasion if everyone else disagrees. Alan. On Jun 11, 2010, at 7:27 PM, Russell Jurney wrote: This would be great. Save us from GROUP ALL/FOREACH, which is awkward. On Fri, Jun 11, 2010 at 7:14 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: It would be cool to just treat relations as bags in the general case. They kind of are, and kind of are not. Causes lots of user confusion. There are obvious users-doing-dumb-stuff scenarios that arise though. I guess the Pig philosophy is that the user is the optimizer, though.. so maybe it's ok. -D On Fri, Jun 11, 2010 at 6:42 PM, Russell Jurney russell.jur...@gmail.com wrote: Would it be possible, and not a ton of work to make the builtin SIZE() work on a relation? Reason being, I frequently do this: B = GROUP A ALL; C = FOREACH B GENERATE SIZE(A) AS total; DUMP C; And I would rather do this: DUMP SIZE(A); Russ
Re: the last job in the mapreduce plan
I've never seen a case where this happens. Is this a theoretical question or are you seeing this issue? Alan. On Jun 15, 2010, at 8:49 AM, Gang Luo wrote: Hi, Is it possible the last MapReduce job in the MR plan only loads something and stores it without any other processing in between? For example, when visiting some physical operator, we need to end the current MR operator after embedding the physical operator into MR operator, and create a new MR operator for later physical operators. Unfortunately, the following physical operator is a store, the end of the entire query. In this case, the last MR operator only contain load and store without any meaningful work in between. This idle MapReduce job will degrade the performance. Will this happen in Pig? Thanks, -Gang
Re: skew join in pig
On Jun 16, 2010, at 8:36 AM, Gang Luo wrote: Hi, there is something confusing me in the skew join (http://wiki.apache.org/pig/PigSkewedJoinSpec ) 1. does the sampling job sample and build histogram on both tables, or just one table (in this case, which one) ? Just the left one. 2. the join job still take the two table as inputs, and shuffle tuples from partitioned table to particular reducer (one tuple to one reducer), and shuffle tuples from streamed table to all reducers associative to one partition (one tuple to multiple reducers). Is that correct? Keys with small enough values to fit in memory are shuffled to reducers as normal. Keys that are too large are split between reducers on the left side, and replicated to all of those reducers that have the splits (not all reducers) on the right side. Does that answer your question? 3. Hot keys need more than one reducers. Are these reducers dedicated to this key only? Could they also take other keys at the same time? They take other keys at the same time. 4. for non-hot keys, my understanding is that they are shuffled to reducers based on default hash partitioner. However, it could happen all the keys shuffled to one reducers incurs skew even none of them is skewed individually. This is always the case in map reduce, though a good hash function should minimize the occurrences of this. Can someone give me some ideas on these? Thanks. -Gang Alan.
Re: skew join in pig
Are you asking how many reducers are used to split a hot key? If so, the answer is as many as we estimate it will take to make the the records for the key fit into memory. For example, if we have a key which we estimate has 10 million records, each record being about 100 bytes and for each reduce task we have 400M available, then we will allocate 3 reducers for that hot key. We do not need to take into account any other keys sent to this reducer because reducers process rows one key at a time. Alan. On Jun 16, 2010, at 11:51 AM, Gang Luo wrote: Thanks for replying. It is much clear now. One more thing to ask about the third question is, how to allocate reducers to several hot keys? Hashing? Further, Pig doesn't divide the reducers into hot-key reducers and non-hot-key reducers, is it right? Thanks, -Gang - 原始邮件 发件人: Alan Gates ga...@yahoo-inc.com 收件人: pig-dev@hadoop.apache.org 发送日期: 2010/6/16 (周三) 12:16:13 下午 主 题: Re: skew join in pig On Jun 16, 2010, at 8:36 AM, Gang Luo wrote: Hi, there is something confusing me in the skew join (http://wiki.apache.org/pig/PigSkewedJoinSpec ) 1. does the sampling job sample and build histogram on both tables, or just one table (in this case, which one) ? Just the left one. 2. the join job still take the two table as inputs, and shuffle tuples from partitioned table to particular reducer (one tuple to one reducer), and shuffle tuples from streamed table to all reducers associative to one partition (one tuple to multiple reducers). Is that correct? Keys with small enough values to fit in memory are shuffled to reducers as normal. Keys that are too large are split between reducers on the left side, and replicated to all of those reducers that have the splits (not all reducers) on the right side. Does that answer your question? 3. Hot keys need more than one reducers. Are these reducers dedicated to this key only? Could they also take other keys at the same time? They take other keys at the same time. 4. for non-hot keys, my understanding is that they are shuffled to reducers based on default hash partitioner. However, it could happen all the keys shuffled to one reducers incurs skew even none of them is skewed individually. This is always the case in map reduce, though a good hash function should minimize the occurrences of this. Can someone give me some ideas on these? Thanks. -Gang Alan.
Re: Avoiding serialization/de-serialization in pig
On Jun 28, 2010, at 5:51 PM, Dmitriy Ryaboy wrote: For what it's worth, I saw very significant speed improvements (order of magnitude for wide tables with few projected columns) when I implemented (2) for our protocol buffer - based loaders. I have a feeling that propagating schemas when known, and using them to for (de)serialization instead of reflecting every field, would also be a big win. Thoughts on just using Avro for the internal PigStorage? I'm been trying to play with this in my spare time but haven't gotten far yet. We're certain open to looking at it and seeing how it performs. Alan. -D On Mon, Jun 28, 2010 at 5:08 PM, Thejas Nair te...@yahoo-inc.com wrote: I have created a wiki which puts together some ideas that can help in improving performance by avoiding/delaying serialization/de- serialization . http://wiki.apache.org/pig/AvoidingSedes These are ideas that don't involve changes to optimizer. Most of them involve changes in the load/store functions. Your feedback is welcome. Thanks, Thejas
Notes from Pig contributor workshop
On June 30th Yahoo hosted a Pig contributor workshop. Pig contributors from Yahoo, Twitter, LinkedIn, and Cloudera were present. The slides used for the presentations that day have been uploaded to http://wiki.apache.org/pig/PigTalksPapers Here's a digest of what was discussed there. For those who were there, if I forgot anything please feel free to add it in. Thejas Nair discussed his work on performance. In particular he has been looking into how to more efficiently de/serialize complex data types and when Pig can make use of lazy deserialization. Dmitriy Ryaboy brought up the question of whether Pig would be open to using Avro for de/serialization between Map and Reduce and between MR jobs. We concluded that we are open to using whatever is fast. Richard Ding discussed the work he has been doing to make Pig run statistics available to users via the logs, applications running Pig (such as workflow systems) via a new PigRunner API, and to developers via Hadoop job history files. Russell Jurney brought up that it would be nice if this API also included record input and output on a per MR job level so that users diagnosing issues with their Pig Latin scripts would have a better idea in which MR job things went wrong. Ashutosh Chauhan gave an overview of the work that has been going on to add UDFs in scripting languages to Pig (PIG-928). Daniel Dai talked about the rewrite of the logical optimizer that he has been doing, including an overview of the major rules being implemented in the new optimizer framework. Dmitriy indicated that he would really like to see pushing of limits into the RecordReader (so that we can terminate reading early) added to the list of rules. This would involve making use of the new optimizer framework in the MR optimizer. Alan Gates indicated that while he does not believe we should translate the entire set of MR optimizer visitors into the new framework until we've further tested the framework, this might be a good first test for the new optimizer in the MR optimizer. Aniket Mokashi showed the work he's been doing to add a custom partitioner to Pig. He also covered his work to add the ability to re- use a relation that contains a single record with a single field as a scalar. Dmitriy pointed out that we need to make sure this uses the distributed cache to minimize strain on the namenode. Pradeep Kamath gave a short presentation on Howl, the work he is leading to create a shared metadata system between Pig, Hive, and Map Reduce. Dmitriy noted that we need to get this work more in the open so others can participate and contribute. Russell Jurney talked about his work on adding datetime types to Pig. He indicated he was interested in using Jodatime as the basis for this. There were some questions on how these types would be serialized in text files where the type information might be lost. Olga Natkovich talked about areas the Yahoo Pig team would like to work on in the future, mostly focussed in the areas of usability. These included changing our parser to one that will allow us to give better error messages. Dmitriy indicated he strongly preferred Antlr. It also includes resurrecting support for the illustrate command, which we have let lapse. Richard and Ashutosh noted that how illustrate works internally needs some redesign, because currently it requires special code inside each physical operator. This makes it hard to maintain illustrate in the face of new operators, and pollutes the main code path during execution. Instead it should be done via callbacks or some other solution. After these presentations the group took on a couple of topics for discussion. The first was how Pig should grow to become Turing complete. For this Dmitriy and Ning Liang presented Piglet, a Ruby library they use at Twitter to wrap Pig and provide branching, looping, functions, and modules. Several people in the group expressed concerns that growing Pig Latin itself to be Turing complete will result in a poorly thought out language with insufficient tools and too much maintenance in the future. One suggestion that was made was to create a Java interface that allowed users to directly construct Pig data flows. That is, this interface would (roughly) have a method for each Pig operator. Users could then construct Pig data flows directly in Java. Users who wished to use scripting languages could still access this with no additional work via Jython, JRuby, Groovy, etc. The second discussion centered on Pig's support for workflow systems such as Oozie and Azkaban. There have been proposals in the past that Pig switch to generate Oozie workflows instead of MR jobs. Alan indicated that he does not see the value of this. There have been proposals that Pig Latin be extended to include workflow controls. Dmitriy and Russell both
Announcing Howl development list
On Jul 14, 2010, at 2:11 AM, Jeff Hammerbacher wrote: Hey, Thanks for writing up these notes, they're very useful. Pradeep Kamath gave a short presentation on Howl, the work he is leading to create a shared metadata system between Pig, Hive, and Map Reduce. Dmitriy noted that we need to get this work more in the open so others can participate and contribute. Is there a public JIRA where one could follow this work? Any chance we can break it up into incremental milestones rather than have a single code drop as with previous large features in Pig? I understand it may be difficult to coordinate internal development with external user groups, but I hope the feedback from third parties might make such a process worthwhile. A wiki page outlining Howl is at http://wiki.apache.org/pig/Howl A howldev mailing list has been set up on Yahoo! groups for discussions on Howl. You can subscribe by sending mail to howldev-subscr...@yahoogroups.com . We plan on putting the code on github in a read only repository. It will be a few more days before we get there. It will be announced on the list when it is. Alan.
Restarting discussion on Pig as a TLP
Five months ago I started a discussion on whether Pig should become a top level project (TLP) at Apache instead of remaining a subproject of Hadoop (http://mail-archives.apache.org/mod_mbox/hadoop-pig-dev/201003.mbox/%3c006aea7c-8829-4788-ad7b-822396fa2...@yahoo-inc.com%3e ). At the time I voted against it (http://mail-archives.apache.org/mod_mbox/hadoop-pig-dev/201003.mbox/%3cf1484964-e774-48b7-9d45-6e57c7b09...@yahoo-inc.com%3e ), as did many others. However, I would like to restart that discussion now. I gave several reasons for voting against it : First, I was worried that by loosing our connection to Hadoop, Pig would loose its source of new users. I have since been assured by Hadoop members that Pig would be free to keep our tab on their page (as Hbase has). Also, obviously we would still be welcomed at Hadoop get togethers such as the various HUGs, Hadoop Summits, etc. So our connection does not seem in danger. Two, I was concerned that by not being members of the Hadoop community we would loose influence with Hadoop. It is true that Pig developers will have to stay active in the Hadoop community, which will put a slightly extra burden on them. But they are already bearing this burden, and whether or not the communities are governed by the same or separate PMCs will not affect this. Finally, I said that philosophically it makes sense to me that all Hadoop related projects should stay under one umbrella. This still makes sense to me, and I do see this as a downside of Pig moving out of Hadoop. In addition to the above, a few other things have happened over the intervening months to cause me to reconsider. Most importantly, it has become clear to me that Pig is operating as if it were a TLP inside Hadoop. We have four members on the Hadoop PMC, which means we have sufficient votes to elect our committers and release our products. Also, several Hadoop PMC members who have long experience in Apache projects have made clear to me that they believe Pig is ready to be a TLP. I was also concerned about diversity in our PMC, since our project is Yahoo heavy. Given that 10 out of 12 committers are Yahoo employees we need to work on this. But we do have experienced committers in three different organizations, and I think this gives us sufficient base to to work on it as a TLP. So, in summary, I have switched my view on this from not yet to now is a good time. I think Pig is ready to be a TLP. We have a community of contributors and users that is growing both in numbers and in diversity. We have a strong group of committers who I believe are ready to take on leadership of the project and who will benefit from being mentored by the larger Apache community. Thoughts? Alan.
August Pig contributor workshop
All, We will be holding the next Pig contributor workshop at Twitter on Wednesday, August 25 from 4-6. The tentative agenda is to discuss: Making Piggybank better Pig and Azkaban integration Plans for features in 0.9 An update on the Howl project Anyone contributing to or interested in contributing to Pig development is welcome to attend. Please RSVP by Friday, August 20th. Twitter is located at 795 Folsom St., Suite 600 in San Francisco. Alan.
[VOTE] Pig to become a top level Apache project
Earlier this week I began a discussion on Pig becoming a TLP (http://bit.ly/byD7L8 ). All of the received feedback was positive. So, let's have a formal vote. I propose we move Pig to a top level Apache project. I propose that the initial PMC of this project be the list of all currently active Pig committers (http://hadoop.apache.org/pig/whoweare.html ) as of 18 August 2010. I nominate Olga Natkovich as the chair of the PMC. (PMC chairs have no more power than other PMC members, but they are responsible for writing regular reports for the Apache board, assigning rights to new committers, etc.) I propose that as part of the resolution that will be forwarded to the Apache board we include that one of the first tasks of the new Pig PMC will be to adopt bylaws for the governance of the project. Alan. P.S. If this vote passes, the next step is that the proposal will be forwarded to the Hadoop PMC for discussion and vote. If the Hadoop PMC vote passes, a formal resolution is then drafted (see http://bit.ly/bvOTRq for an example resolution) and sent to the Apache board. The Apache board will then vote on whether to make Pig a TLP.
Re: August Pig contributor workshop
Confirming Olga and I will be there. Alan. On Aug 18, 2010, at 4:45 PM, Dmitriy Ryaboy wrote: Hi folks, Please do RSVP so that we know how many people are coming. Thanks, -Dmitriy On Tue, Aug 17, 2010 at 4:04 PM, Alan Gates ga...@yahoo-inc.com wrote: All, We will be holding the next Pig contributor workshop at Twitter on Wednesday, August 25 from 4-6. The tentative agenda is to discuss: Making Piggybank better Pig and Azkaban integration Plans for features in 0.9 An update on the Howl project Anyone contributing to or interested i
Re: release notes in JIRA
+1 Backloading documentation is error prone and leads to not getting documentation done. Alan. On Aug 20, 2010, at 4:11 PM, Olga Natkovich wrote: Guys, After spending the last couple of days collecting information for Pig 0.8.0 documentation, I would like to propose a change for our patch process that would make my life easier :). I would like to ask developers working on patches with new customer facing features or user visible modifications to the existing features to fill in the Release Notes part of JIRA as part of their patch submission process. The Release Notes section should contain all the information that would be needed to create user documentation including - Feature definition - Cases in which feature is applicable - Notes indicating if this feature/changes to the feature breaks backward compatibility - Usage examples. (Please, make sure you actually run all the examples.) - Anything else that would assist users in using the feature. I would like to ask the reviewers to review the Release Notes as part of their patch review process. Please, let me know if you have any questions or concerns. Thanks, Olga
Re: [VOTE] Pig to become a top level Apache project
With 9 +1 votes and no -1s the vote passes. I will begin a vote on Hadoop general. Alan. On Aug 18, 2010, at 10:34 AM, Alan Gates wrote: Earlier this week I began a discussion on Pig becoming a TLP (http://bit.ly/byD7L8 ). All of the received feedback was positive. So, let's have a formal vote. I propose we move Pig to a top level Apache project. I propose that the initial PMC of this project be the list of all currently active Pig committers (http://hadoop.apache.org/pig/whoweare.html ) as of 18 August 2010. I nominate Olga Natkovich as the chair of the PMC. (PMC chairs have no more power than other PMC members, but they are responsible for writing regular reports for the Apache board, assigning rights to new committers, etc.) I propose that as part of the resolution that will be forwarded to the Apache board we include that one of the first tasks of the new Pig PMC will be to adopt bylaws for the governance of the project. Alan. P.S. If this vote passes, the next step is that the proposal will be forwarded to the Hadoop PMC for discussion and vote. If the Hadoop PMC vote passes, a formal resolution is then drafted (see http://bit.ly/bvOTRq for an example resolution) and sent to the Apache board. The Apache board will then vote on whether to make Pig a TLP.
Re: Caster interface and byte conversion
This seems fine. Is the Pig engine at any point testing to see if the interface is implemented and if so calling toBytes, or is this totally for use inside the store functions themselves to serialize Pig data types? Alan. On Aug 22, 2010, at 1:40 AM, Dmitriy Ryaboy wrote: The current HBase patch on PIG-1205 (patch 7) includes this refactoring. Please take a look if you have concerns. Or just if you feel like reviewing the code... :) -D On Sat, Aug 21, 2010 at 5:22 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: I just noticed that even though Utf8StorageConverter implements the various byte[] toBytes(Obj o) methods, they are not part of the LoadCaster interface -- and therefore can't be relied on when using modular Casters, like I am trying to do for the HBaseLoader. Since we don't want to introduce backwards-incompatible changes, I propose adding a ByteCaster interface that defines these methods, and extending Utf8StorageConverter to implement them (without actually changing the implementation at all). That way StoreFuncs that need to convert to bytes can use pluggable converters. Objections? -D
Re: is Hudson awol?
Yes, our friend Hudson is ill again. Giri, Hudson's doctor, should get a chance to look at it in a few days. Alan. On Aug 23, 2010, at 3:31 PM, Dmitriy Ryaboy wrote: Haven't heard anything from Hudson in a while... -D
Re: Caster interface and byte conversion
One other comment. By making this part of an interface that extends LoadCaster you are assuming the implementing class is both a load and store function. It makes more sense to have a separate StoreCaster interface rather than extending LoadCaster. Alan. On Aug 24, 2010, at 9:18 AM, Alan Gates wrote: This seems fine. Is the Pig engine at any point testing to see if the interface is implemented and if so calling toBytes, or is this totally for use inside the store functions themselves to serialize Pig data types? Alan. On Aug 22, 2010, at 1:40 AM, Dmitriy Ryaboy wrote: The current HBase patch on PIG-1205 (patch 7) includes this refactoring. Please take a look if you have concerns. Or just if you feel like reviewing the code... :) -D On Sat, Aug 21, 2010 at 5:22 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: I just noticed that even though Utf8StorageConverter implements the various byte[] toBytes(Obj o) methods, they are not part of the LoadCaster interface -- and therefore can't be relied on when using modular Casters, like I am trying to do for the HBaseLoader. Since we don't want to introduce backwards-incompatible changes, I propose adding a ByteCaster interface that defines these methods, and extending Utf8StorageConverter to implement them (without actually changing the implementation at all). That way StoreFuncs that need to convert to bytes can use pluggable converters. Objections? -D
Re: Caster interface and byte conversion
On Aug 24, 2010, at 1:22 PM, Dmitriy Ryaboy wrote: As far as the toBytes methods -- I am not sure what they were originally for. They aren't actually called anywhere that I can find, except my new HBase stuff. You are right, I could make it two interfaces, but I consolidated them for simplicity of use/implementation. Now that I think about it, I can put all the methods into StoreCaster and just have a unioning interface for simplicity: @InterfaceAudience.Public @InterfaceStability.Evolving public interface LoadStoreCaster extends LoadCaster, StoreCaster { } Does that seem ok? Yeah, makes sense. Alan. -D On Tue, Aug 24, 2010 at 10:01 AM, Alan Gates ga...@yahoo-inc.com wrote: One other comment. By making this part of an interface that extends LoadCaster you are assuming the implementing class is both a load and store function. It makes more sense to have a separate StoreCaster interface rather than extending LoadCaster. Alan. On Aug 24, 2010, at 9:18 AM, Alan Gates wrote: This seems fine. Is the Pig engine at any point testing to see if the interface is implemented and if so calling toBytes, or is this totally for use inside the store functions themselves to serialize Pig data types? Alan. On Aug 22, 2010, at 1:40 AM, Dmitriy Ryaboy wrote: The current HBase patch on PIG-1205 (patch 7) includes this refactoring. Please take a look if you have concerns. Or just if you feel like reviewing the code... :) -D On Sat, Aug 21, 2010 at 5:22 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: I just noticed that even though Utf8StorageConverter implements the various byte[] toBytes(Obj o) methods, they are not part of the LoadCaster interface -- and therefore can't be relied on when using modular Casters, like I am trying to do for the HBaseLoader. Since we don't want to introduce backwards-incompatible changes, I propose adding a ByteCaster interface that defines these methods, and extending Utf8StorageConverter to implement them (without actually changing the implementation at all). That way StoreFuncs that need to convert to bytes can use pluggable converters. Objections? -D
Fwd: hudson patch test jobs : hadoop pig and zookeeper
Begin forwarded message: From: Giridharan Kesavan gkesa...@yahoo-inc.com Date: August 24, 2010 4:38:46 PM PDT To: gene...@hadoop.apache.org gene...@hadoop.apache.org Subject: hudson patch test jobs : hadoop pig and zookeeper Reply-To: gene...@hadoop.apache.org gene...@hadoop.apache.org Hi, We have a new hudson master hudson.apache.org and hudson.zones.apache.org is retired. This means that we need to port all our patch test admin jobs for hadoop(common,hdfs,mapred), pig and zookeeper to the new hudson master. I'm working on configuring patch admin jobs with the new hudson master: hudson.apache.org. (this is exactly the reason for why the patch test builds are not running at the moment) Thanks Giri
Re: Pig Contributor meeting notes
On Aug 26, 2010, at 12:55 AM, Jeff Zhang wrote: Wonderful, Dmitriy, It's pity for me missing the contributor meeting. And any ppt shared ? Jeff, We don't want to exclude our contributors who don't happen to live in the San Francisco Bay Area. If we could include you via Skype or some other technology we'd be happy to set it up on our end. Do you think something like that would work for you? Alan.
Re: Does Pig Re-Use FileInputLoadFuncs Objects?
I'm not 100% sure I understand the question. Are you asking if it re- uses instances of a given load or store function? It should not. Alan. On Aug 31, 2010, at 7:28 PM, Russell Jurney wrote: Pardon the cross-post: Does Pig ever re-use FileInputLoadFunc objects? We suspect state is being retained between different stores, but we don't actually know this. Figured I'd ask to verify the hunch. Our load func for our in-house format works fine with Pig scripts normally... but I have a pig script that looks like this: LOAD thing1 SPLIT thing1 INTO thing2, thing3 STORE thing2 INTO thing2 STORE thing3 INTO thing3 LOAD thing4 SPLIT thing4 INTO thing5, thing6 STORE thing5 INTO thing5 STORE thing6 INTO thing6 And it works via PigStorage, but not via our FileInputLoadFunc. Russ
Re: help : error run pig
Pig is failing to connect to your namenode. Is the address Pig is trying to use (hdfs://master:54310/) correct? Can you connect using that string from the same machine using bin/hadoop? Alan. On Sep 27, 2010, at 8:45 AM, Ngô Văn Vĩ wrote: I run Pig at Hadoop Mode (Pig-0.7.0 and hadoop-0.20.2) have error? ng...@master:~/pig-0.7.0$ bin/pig 10/09/27 08:39:40 INFO pig.Main: Logging error messages to: /home/ngovi/pig-0.7.0/pig_1285601980268.log 2010-09-27 08:39:40,538 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://master:54310/ 2010-09-27 08:39:41,760 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: master/192.168.230.130:54310. Already tried 0 time(s). 2010-09-27 08:39:42,762 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: master/192.168.230.130:54310. Already tried 1 time(s). 2010-09-27 08:39:43,763 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: master/192.168.230.130:54310. Already tried 2 time(s). 2010-09-27 08:39:44,765 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: master/192.168.230.130:54310. Already tried 3 time(s). 2010-09-27 08:39:45,766 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: master/192.168.230.130:54310. Already tried 4 time(s). 2010-09-27 08:39:46,767 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: master/192.168.230.130:54310. Already tried 5 time(s). 2010-09-27 08:39:47,768 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: master/192.168.230.130:54310. Already tried 6 time(s). 2010-09-27 08:39:48,769 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: master/192.168.230.130:54310. Already tried 7 time(s). 2010-09-27 08:39:49,770 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: master/192.168.230.130:54310. Already tried 8 time(s). 2010-09-27 08:39:50,771 [main] INFO org.apache.hadoop.ipc.Client - Retrying connect to server: master/192.168.230.130:54310. Already tried 9 time(s). 2010-09-27 08:39:50,780 [main] ERROR org.apache.pig.Main - ERROR 2999: Unexpected internal error. Failed to create DataStorage Help me?? Thanks -- Ngô Văn Vĩ Công Nghệ Phần Mềm Phone: 01695893851
[jira] Updated: (PIG-519) allow for '#' to signify a comment in a PIG script
[ https://issues.apache.org/jira/browse/PIG-519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-519: --- Resolution: Fixed Fix Version/s: types_branch Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) Checked in modified version of the patch that just supported #!. Thanks Ian for the contribution. allow for '#' to signify a comment in a PIG script -- Key: PIG-519 URL: https://issues.apache.org/jira/browse/PIG-519 Project: Pig Issue Type: Wish Components: grunt Environment: linux/unix Reporter: Ian Holsman Priority: Trivial Fix For: types_branch Attachments: comment.patch, pig.pig in unix type operating systems, it is common to just run scripts directly from the shell. In order to do this scripts need to have the command to run them on the first line similar to #!/usr/bin/env pig - this patch allows you to just run scripts without specifying pig -f XXX -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-512) Expressions in foreach lead to errors
[ https://issues.apache.org/jira/browse/PIG-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12647700#action_12647700 ] Alan Gates commented on PIG-512: In LogicalPlanCloneHelper, why do you need this: {code} protected void visit(LOCross cs) throws VisitorException { super.visit(cs); } {code} Won't java do that for you? What is the significance of the changes in TypeCheckingVisitor? Neither of these issues are big enough to require a new patch. The current one looks good (and big :) ). Expressions in foreach lead to errors - Key: PIG-512 URL: https://issues.apache.org/jira/browse/PIG-512 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: Santhosh Srinivasan Assignee: Santhosh Srinivasan Fix For: types_branch Attachments: PIG-512.patch, PIG-512_1.patch Use of expressions that use the same sub-expressions in foreach lead to translation errors. This issue is caused due to sharing operators across nested plans. To remedy this issue, logical operators should be cloned and not shared across plans. {code} grunt a = load 'a' as (x, y, z); grunt b = foreach a { exp1 = x + y; exp2 = exp1 + x; generate exp1, exp2; } grunt explain b; 2008-10-30 15:38:40,257 [main] WARN org.apache.pig.PigServer - bytearray is implicitly casted to double under LOAdd Operator 2008-10-30 15:38:40,258 [main] WARN org.apache.pig.PigServer - bytearray is implicitly casted to double under LOAdd Operator 2008-10-30 15:38:40,258 [main] WARN org.apache.pig.PigServer - bytearray is implicitly casted to double under LOAdd Operator Logical Plan: Store sms-Thu Oct 30 11:27:27 PDT 2008-2609 Schema: {double,double} Type: Unknown | |---ForEach sms-Thu Oct 30 11:27:27 PDT 2008-2605 Schema: {double,double} Type: bag | | | Add sms-Thu Oct 30 11:27:27 PDT 2008-2600 FieldSchema: double Type: double | | | |---Cast sms-Thu Oct 30 11:27:27 PDT 2008-2606 FieldSchema: double Type: double | | | | | |---Project sms-Thu Oct 30 11:27:27 PDT 2008-2598 Projections: [0] Overloaded: false FieldSchema: x: bytearray Type: bytearray | | Input: Load sms-Thu Oct 30 11:27:27 PDT 2008-2597 | | | |---Cast sms-Thu Oct 30 11:27:27 PDT 2008-2607 FieldSchema: double Type: double | | | |---Project sms-Thu Oct 30 11:27:27 PDT 2008-2599 Projections: [1] Overloaded: false FieldSchema: y: bytearray Type: bytearray | Input: Load sms-Thu Oct 30 11:27:27 PDT 2008-2597 | | | Add sms-Thu Oct 30 11:27:27 PDT 2008-2603 FieldSchema: double Type: double | | | |---Project sms-Thu Oct 30 11:27:27 PDT 2008-2601 Projections: [*] Overloaded: false FieldSchema: double Type: double | | Input: Add sms-Thu Oct 30 11:27:27 PDT 2008-2600| | | |---Add sms-Thu Oct 30 11:27:27 PDT 2008-2600 FieldSchema: double Type: double | | | | | |---Project sms-Thu Oct 30 11:27:27 PDT 2008-2598 Projections: [0] Overloaded: false FieldSchema: x: bytearray Type: bytearray | | | Input: Load sms-Thu Oct 30 11:27:27 PDT 2008-2597 | | | | | |---Project sms-Thu Oct 30 11:27:27 PDT 2008-2599 Projections: [1] Overloaded: false FieldSchema: y: bytearray Type: bytearray | | Input: Load sms-Thu Oct 30 11:27:27 PDT 2008-2597 | | | |---Cast sms-Thu Oct 30 11:27:27 PDT 2008-2608 FieldSchema: double Type: double | | | |---Project sms-Thu Oct 30 11:27:27 PDT 2008-2602 Projections: [0] Overloaded: false FieldSchema: x: bytearray Type: bytearray | Input: Load sms-Thu Oct 30 11:27:27 PDT 2008-2597 | |---Load sms-Thu Oct 30 11:27:27 PDT 2008-2597 Schema: {x: bytearray,y: bytearray,z: bytearray} Type: bag 2008-10-30 15:38:40,272 [main] ERROR org.apache.pig.impl.plan.OperatorPlan - Attempt to give operator of type org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject multiple outputs. This operator does not support multiple outputs. 2008-10-30 15:38:40,272 [main] ERROR org.apache.pig.backend.hadoop.executionengine.physicalLayer.LogToPhyTranslationVisitor - Invalid physical operators in the physical planAttempt to give operator of type org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject multiple outputs. This operator does not support multiple outputs. 2008-10-30 15:38:40,273 [main] ERROR org.apache.pig.tools.grunt.GruntParser - java.io.IOException: Unable to explain alias b [org.apache.pig.impl.plan.VisitorException] at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile
[jira] Commented: (PIG-460) PERFORMANCE: Order by done in 3 MR jobs, could be done in 2
[ https://issues.apache.org/jira/browse/PIG-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12652642#action_12652642 ] Alan Gates commented on PIG-460: Here's a quick write up of what will need to be done to change order by from being a 3 mr job process to 2. Currently sampling is done via org.apache.pig.impl.builtin.RandomSampleLoader. Since this loader extends BinStorage the first mr job reads the data in whatever format and then stores it again using BinStorage. It is then read in the second job using RandomSampleLoader. The tuples that are selected by RandomSampleLoader are grouped into a single reducer and then fed to org.apache.pig.impl.builtin.FindQuantiles, which builds a side file containing partitioning information. The third mr job again reads the data and uses the side file in the SortPartitioner. (It may be helpful to do an explain on a simple order by query to see all this.) What needs to change is that RandomSampleLoader should instead become an EvalFunc, RandomSampler. The logic inside can remain the same. The MRCompiler will need to change to create two mr jobs for the sort instead of 3. The first job should contain a ForEach operator with the new RandomSampler function in the map. It's reduce should look just like the reduce of the second mr job in the current system (that is, singular and having a ForEach operator that calls FindQuantiles). The second job should remain exactly the same as the third job in the current system. Take a look at MRCompiler.visitSort() for an idea of how sort jobs are constructed now. It's this function and the functions it calls that you'll be changing in MRCompiler. PERFORMANCE: Order by done in 3 MR jobs, could be done in 2 Key: PIG-460 URL: https://issues.apache.org/jira/browse/PIG-460 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: Alan Gates Assignee: Alan Gates Fix For: types_branch Currently order by is done in three MR jobs: job 1: read data in whatever loader the user requests, store using BinStorage job 2: load using RandomSampleLoader, find quantiles job 3: load data again and sort It is done this way because RandomSampleLoader extends BinStorage, and so needs the data in that format to read it. If the logic in RandomSampleLoader was made into an operator instead of being in a loader then jobs 1 and 2 could be merged. On average job 1 takes about 15% of the time of an order by script. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-6) Addition of Hbase Storage Option In Load/Store Statement
[ https://issues.apache.org/jira/browse/PIG-6?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates resolved PIG-6. -- Resolution: Fixed Fix Version/s: types_branch Hadoop Flags: [Reviewed] V01 patch checked in. Thanks Sam for stepping up and taking on this issue that many people had requested. Addition of Hbase Storage Option In Load/Store Statement Key: PIG-6 URL: https://issues.apache.org/jira/browse/PIG-6 Project: Pig Issue Type: New Feature Environment: all environments Reporter: Edward J. Yoon Fix For: types_branch Attachments: hbase-0.18.1-test.jar, hbase-0.18.1.jar, PIG-6.patch, PIG-6_V01.patch It needs to be able to load full table in hbase. (maybe ... difficult? i'm not sure yet.) Also, as described below, It needs to compose an abstract 2d-table only with certain data filtered from hbase array structure using arbitrary query-delimited. {code} A = LOAD table('hbase_table'); or B = LOAD table('hbase_table') Using HbaseQuery('Query-delimited by attributes timestamp') as (f1, f2[, f3]); {code} Once test is done on my local machines, I will clarify the grammars and give you more examples to help you explain more storage options. Any advice welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-554) Fragment Replicate Join
[ https://issues.apache.org/jira/browse/PIG-554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12658573#action_12658573 ] Alan Gates commented on PIG-554: A couple of questions: 1) I'm still not clear on why the additional maps are needed to load the replicated inputs into files. Those inputs are already in files. Are you somehow transforming them? Isn't this exactly where we should be using the DistributedCache? Rather than having map jobs that transform them I think the best thing would be to have the MRCompiler set a flag for the JobControlCompiler to load those files into the DC for this job. 2) You are using POLocalRearrange both in setting up the hash table and in reading the fragmented table before the join. What benefit is being derived from this? LR adds a lot of extra weight to the tuple that I don't think is needed. I suspect we could fit more tuples into memory if we loaded them directly rather than using LR. Fragment Replicate Join --- Key: PIG-554 URL: https://issues.apache.org/jira/browse/PIG-554 Project: Pig Issue Type: New Feature Affects Versions: types_branch Reporter: Shravan Matthur Narayanamurthy Assignee: Shravan Matthur Narayanamurthy Fix For: types_branch Attachments: frjofflat.patch, frjofflat1.patch Fragment Replicate Join(FRJ) is useful when we want a join between a huge table and a very small table (fitting in memory small) and the join doesn't expand the data by much. The idea is to distribute the processing of the huge files by fragmenting it and replicating the small file to all machines receiving a fragment of the huge file. Because of the availability of the entire small file, the join becomes a trivial task without needing any break in the pipeline. Exhaustive test have done to determine the improvement we get out of FRJ. Here are the details: http://wiki.apache.org/pig/PigFRJoin The patch makes changes to parts of the code where new operators are introduced. Currently, when a new operator is introduced, its alias is not set. For schema computation I have modified this behaviour to set the alias of the new operator to that of its predecessor. The logical side of the patch mimics the cogroup behavior as join syntax closely resembles that of cogroup. Currently, this patch doesn't have support for joins other than inner joins. The rest of the code has been documented. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-572) A PigServer.registerScript() method, which lets a client programmatically register a Pig Script.
[ https://issues.apache.org/jira/browse/PIG-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660148#action_12660148 ] Alan Gates commented on PIG-572: The code in the patch looks fine. I have a couple of questions: # What's the use case driving this? If a user has their pig script in a file why do we expect them to be using PigServer directly instead of grunt? # Why does the logical plan need to be serializable? A PigServer.registerScript() method, which lets a client programmatically register a Pig Script. Key: PIG-572 URL: https://issues.apache.org/jira/browse/PIG-572 Project: Pig Issue Type: New Feature Affects Versions: types_branch Reporter: Shubham Chopra Priority: Minor Fix For: types_branch Attachments: registerScript.patch A PigServer.registerScript() method, which lets a client programmatically register a Pig Script. For example, say theres a script my_script.pig with the following content: a = load '/data/my_data.txt'; b = filter a by $0 '0'; The function lets you use something like the following: pigServer.registerScript(my_script.pig); pigServer.registerQuery(c = foreach b generate $2, $3;); pigServer.store(c); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-596) Anonymous tuples in bags create ParseExceptions
[ https://issues.apache.org/jira/browse/PIG-596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660351#action_12660351 ] Alan Gates commented on PIG-596: Flattening a bag gets rid of two layers of containment, both the bag and the tuple. So the result of FLATTEN(bag(tuple(x, y, z)) is x, y, z not tuple(x, y, z). At this point I believe tuples must be named in the LOAD statement as well as in foreach. I'm not necessarily voting against anonymous tuples. But I do believe Pig Latin is consistent in requiring names for tuples at the moment. Anonymous tuples in bags create ParseExceptions --- Key: PIG-596 URL: https://issues.apache.org/jira/browse/PIG-596 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: David Ciemiewicz {code} One = load 'one.txt' using PigStorage() as ( one: int ); LabelledTupleInBag = foreach One generate { ( 1, 2 ) } as mybag { tuplelabel: tuple ( a, b ) }; AnonymousTupleInBag = foreach One generate { ( 2, 3 ) } as mybag { tuple ( a, b ) }; -- Anonymous tuple creates bug Tuples = union LabelledTupleInBag, AnonymousTupleInBag; dump Tuples; {code} java.io.IOException: Encountered { tuple at line 6, column 66. Was expecting one of: parallel ... ; ... , ... : ... ( ... { IDENTIFIER ... { } ... [ ... at org.apache.pig.PigServer.parseQuery(PigServer.java:298) at org.apache.pig.PigServer.registerQuery(PigServer.java:263) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:439) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:249) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:84) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:64) at org.apache.pig.Main.main(Main.java:306) Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException: Encountered { tuple at line 6, column 66. Why can't there be an anonymous tuple at the top level of a bag? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-572) A PigServer.registerScript() method, which lets a client programmatically register a Pig Script.
[ https://issues.apache.org/jira/browse/PIG-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660388#action_12660388 ] Alan Gates commented on PIG-572: Passes all the tests. I'd like to wait until the Christmas vacation is over to give other committers a chance to comment before checking it in. If I don't see any comments after a few days I'll check it in. A PigServer.registerScript() method, which lets a client programmatically register a Pig Script. Key: PIG-572 URL: https://issues.apache.org/jira/browse/PIG-572 Project: Pig Issue Type: New Feature Affects Versions: types_branch Reporter: Shubham Chopra Priority: Minor Fix For: types_branch Attachments: registerScript.patch A PigServer.registerScript() method, which lets a client programmatically register a Pig Script. For example, say theres a script my_script.pig with the following content: a = load '/data/my_data.txt'; b = filter a by $0 '0'; The function lets you use something like the following: pigServer.registerScript(my_script.pig); pigServer.registerQuery(c = foreach b generate $2, $3;); pigServer.store(c); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (PIG-580) PERFORMANCE: Combiner should also be used when there are distinct aggregates in a foreach following a group provided there are no non-algebraics in the foreach
[ https://issues.apache.org/jira/browse/PIG-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660944#action_12660944 ] alangates edited comment on PIG-580 at 1/5/09 1:51 PM: In CombinerOptimizer.visitDistinct you have: {code} +if(sawDistinctAgg) { +// we want to combine only in the case where there is only +// one PODistinct which is the only input to an agg +// we apparently have seen a PODistinct before, so lets not +// combine. +sawNonAlgebraic = true; +} {code} but I can envision a case where you want to count multiple distinct things: {code} A = load ... B = group A by $0; C = foreach B { Aa = B.$1; Ab = distinct Aa; Ba = B.$2; Bb = distinct Ba; generate group, COUNT(Ab), COUNT(Bb); } {code} Is there a reason we need to not use the combiner with multiple distincts? was (Author: alangates): In CombinerOptimizer.visitDistinct you have: {code} +if(sawDistinctAgg) { +// we want to combine only in the case where there is only +// one PODistinct which is the only input to an agg +// we apparently have seen a PODistinct before, so lets not +// combine. +sawNonAlgebraic = true; +} {code} but I can envision a case where you want to count multiple distinct things: {code} A = load ... B = group A by $0; C = foreach B { Aa = B.$1; Ab = distinct Aa; Ba = B.$2; Bb = distinct Ba; generate group, COUNT(Ab), COUNT(Bb); } Is there a reason we need to not use the combiner with multiple distincts? PERFORMANCE: Combiner should also be used when there are distinct aggregates in a foreach following a group provided there are no non-algebraics in the foreach Key: PIG-580 URL: https://issues.apache.org/jira/browse/PIG-580 Project: Pig Issue Type: Improvement Affects Versions: types_branch Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: types_branch Attachments: PIG-580-v2.patch, PIG-580.patch Currently Pig uses the combiner only when there is foreach following a group when the elements in the foreach generate have the following characteristics: 1) simple project of the group column 2) Algebraic UDF The above conditions exclude use of the combiner for distinct aggregates - the distinct operation itself is combinable (irrespective of whether it feeds to an algebraic or non algebraic udf). So if the following foreach should also be combinable: {code} .. b = group a by $0; c = foreach b generate { x = distinct a; generate group, COUNT(x), SUM(x.$1) } {code} The combiner optimizer should cause the distinct to be combined and the final combine output should feed the COUNT() and SUM() in the reduce. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-580) PERFORMANCE: Combiner should also be used when there are distinct aggregates in a foreach following a group provided there are no non-algebraics in the foreach
[ https://issues.apache.org/jira/browse/PIG-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12660944#action_12660944 ] Alan Gates commented on PIG-580: In CombinerOptimizer.visitDistinct you have: {code} +if(sawDistinctAgg) { +// we want to combine only in the case where there is only +// one PODistinct which is the only input to an agg +// we apparently have seen a PODistinct before, so lets not +// combine. +sawNonAlgebraic = true; +} {code} but I can envision a case where you want to count multiple distinct things: {code} A = load ... B = group A by $0; C = foreach B { Aa = B.$1; Ab = distinct Aa; Ba = B.$2; Bb = distinct Ba; generate group, COUNT(Ab), COUNT(Bb); } Is there a reason we need to not use the combiner with multiple distincts? PERFORMANCE: Combiner should also be used when there are distinct aggregates in a foreach following a group provided there are no non-algebraics in the foreach Key: PIG-580 URL: https://issues.apache.org/jira/browse/PIG-580 Project: Pig Issue Type: Improvement Affects Versions: types_branch Reporter: Pradeep Kamath Assignee: Pradeep Kamath Fix For: types_branch Attachments: PIG-580-v2.patch, PIG-580.patch Currently Pig uses the combiner only when there is foreach following a group when the elements in the foreach generate have the following characteristics: 1) simple project of the group column 2) Algebraic UDF The above conditions exclude use of the combiner for distinct aggregates - the distinct operation itself is combinable (irrespective of whether it feeds to an algebraic or non algebraic udf). So if the following foreach should also be combinable: {code} .. b = group a by $0; c = foreach b generate { x = distinct a; generate group, COUNT(x), SUM(x.$1) } {code} The combiner optimizer should cause the distinct to be combined and the final combine output should feed the COUNT() and SUM() in the reduce. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-599) BufferedPositionedInputStream isn't buffered
BufferedPositionedInputStream isn't buffered Key: PIG-599 URL: https://issues.apache.org/jira/browse/PIG-599 Project: Pig Issue Type: Bug Components: impl Affects Versions: types_branch Reporter: Alan Gates Assignee: Alan Gates Fix For: types_branch org.apache.pig.impl.io.BufferedPositionedInputStream is not actually buffered. This is because it sits atop a FSDataInputStream (somewhere down the stack), which is buffered. So to avoid double buffering, which can be bad, BufferedPositionedInputStream was written without buffering. But the FSDataInputStream is far enough down the stack that it is still quite costly to call read() individually for each byte. A run through a profiler shows that a fair amount of time is being spent in BufferedPositionedInputStream.read(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-599) BufferedPositionedInputStream isn't buffered
[ https://issues.apache.org/jira/browse/PIG-599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates updated PIG-599: --- Status: Patch Available (was: Open) BufferedPositionedInputStream isn't buffered Key: PIG-599 URL: https://issues.apache.org/jira/browse/PIG-599 Project: Pig Issue Type: Bug Components: impl Affects Versions: types_branch Reporter: Alan Gates Assignee: Alan Gates Fix For: types_branch Attachments: loadperf.patch org.apache.pig.impl.io.BufferedPositionedInputStream is not actually buffered. This is because it sits atop a FSDataInputStream (somewhere down the stack), which is buffered. So to avoid double buffering, which can be bad, BufferedPositionedInputStream was written without buffering. But the FSDataInputStream is far enough down the stack that it is still quite costly to call read() individually for each byte. A run through a profiler shows that a fair amount of time is being spent in BufferedPositionedInputStream.read(). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.