Re: Added Pig to the list of projects on Cloudera's public ReviewBoard instance

2010-08-26 Thread Dmitriy Ryaboy
Thanks Carl! On Thu, Aug 26, 2010 at 1:08 AM, Carl Steinbach c...@cloudera.com wrote: Hi, I added Pig to the list of projects that can be reviewed on Cloudera's public ReviewBoard instance, located at http://review.cloudera.org (AKA review.hbase.org). Review requests and comments are

Pig Contributor meeting notes

2010-08-25 Thread Dmitriy Ryaboy
functions some time after 0.8 is branched. The initial list of committers Kevin Weil (Twitter), Dmitriy Ryaboy (Twitter), Carl Steinbach (Cloudera), and Russel Jurney (LinkedIn). Yahoo will also nominate someone. Please send us any thoughts you might have on this subject. It was suggested that a lot

Re: Caster interface and byte conversion

2010-08-24 Thread Dmitriy Ryaboy
Ryaboy wrote: The current HBase patch on PIG-1205 (patch 7) includes this refactoring. Please take a look if you have concerns. Or just if you feel like reviewing the code... :) -D On Sat, Aug 21, 2010 at 5:22 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: I just noticed that even though

Re: August Pig contributor workshop

2010-08-23 Thread Dmitriy Ryaboy
. On Aug 18, 2010, at 4:45 PM, Dmitriy Ryaboy wrote: Hi folks, Please do RSVP so that we know how many people are coming. Thanks, -Dmitriy On Tue, Aug 17, 2010 at 4:04 PM, Alan Gates ga...@yahoo-inc.com wrote: All, We will be holding the next Pig contributor workshop

is Hudson awol?

2010-08-23 Thread Dmitriy Ryaboy
Haven't heard anything from Hudson in a while... -D

Re: Caster interface and byte conversion

2010-08-22 Thread Dmitriy Ryaboy
The current HBase patch on PIG-1205 (patch 7) includes this refactoring. Please take a look if you have concerns. Or just if you feel like reviewing the code... :) -D On Sat, Aug 21, 2010 at 5:22 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: I just noticed that even though Utf8StorageConverter

Caster interface and byte conversion

2010-08-21 Thread Dmitriy Ryaboy
I just noticed that even though Utf8StorageConverter implements the various byte[] toBytes(Obj o) methods, they are not part of the LoadCaster interface -- and therefore can't be relied on when using modular Casters, like I am trying to do for the HBaseLoader. Since we don't want to introduce

Re: [VOTE] Pig to become a top level Apache project

2010-08-18 Thread Dmitriy Ryaboy
+1 for TLP +1 for Olga as PMC On Wed, Aug 18, 2010 at 10:34 AM, Alan Gates ga...@yahoo-inc.com wrote: Earlier this week I began a discussion on Pig becoming a TLP ( http://bit.ly/byD7L8 ). All of the received feedback was positive. So, let's have a formal vote. I propose we move Pig to a

Re: August Pig contributor workshop

2010-08-18 Thread Dmitriy Ryaboy
Hi folks, Please do RSVP so that we know how many people are coming. Thanks, -Dmitriy On Tue, Aug 17, 2010 at 4:04 PM, Alan Gates ga...@yahoo-inc.com wrote: All, We will be holding the next Pig contributor workshop at Twitter on Wednesday, August 25 from 4-6. The tentative agenda is to

Re: Restarting discussion on Pig as a TLP

2010-08-16 Thread Dmitriy Ryaboy
This sounds reasonable. +1. -D On Mon, Aug 16, 2010 at 1:46 PM, Alan Gates ga...@yahoo-inc.com wrote: Five months ago I started a discussion on whether Pig should become a top level project (TLP) at Apache instead of remaining a subproject of Hadoop (

preferred way to handle errors in LoadFunc constructor?

2010-08-15 Thread Dmitriy Ryaboy
Is there a preferred way to handle errors in LoadFunc initialization? I suspect that if I throw an exception in the constructor, the Pig process might die, which is not friendly, esp. to people working in the shell; but just printing out an error can obviously lead to trouble later on, as well.

Re: FW:

2010-07-07 Thread Dmitriy Ryaboy
It does -- lack of existence of a directory during planning does not imply the directory will be missing when you run. Sounds like the sort of thing one might want to put into PigUnit On Wed, Jul 7, 2010 at 2:19 PM, Russell Jurney russell.jur...@gmail.comwrote: This is my most common error as

Re: Bug in new logical optimizer framework?

2010-07-01 Thread Dmitriy Ryaboy
Renato, I just want to make sure folks know -- Pig already has a number of such optimizations. Daniel's work is aimed at making it (much) easier to write such rules and to add a couple new ones. But some of the classic optimizations like projection and filter push-down already exist in the

Re: Avoiding serialization/de-serialization in pig

2010-06-28 Thread Dmitriy Ryaboy
For what it's worth, I saw very significant speed improvements (order of magnitude for wide tables with few projected columns) when I implemented (2) for our protocol buffer - based loaders. I have a feeling that propagating schemas when known, and using them to for (de)serialization instead of

Another reason to switch to ANTLR

2010-06-26 Thread Dmitriy Ryaboy
http://www.eclipse.org/Xtext/documentation/ Wow. That would be huge.

Re: skew join in pig

2010-06-21 Thread Dmitriy Ryaboy
It's just whatever the hash function happens to do. By the time the hot keys are slotted to be spread among multiple reducers, they are no longer hot, so it doesn't matter if you put a few of the partitions in the same reducer. Remember, we mostly care about things we have to keep in memory. Since

Re: skew join in pig

2010-06-16 Thread Dmitriy Ryaboy
On Wed, Jun 16, 2010 at 9:16 AM, Alan Gates ga...@yahoo-inc.com wrote: 4. for non-hot keys, my understanding is that they are shuffled to reducers based on default hash partitioner. However, it could happen all the keys shuffled to one reducers incurs skew even none of them is skewed

Re: SIZE() of relation

2010-06-15 Thread Dmitriy Ryaboy
MR job to accomplish this. But I'm open to persuasion if everyone else disagrees. Alan. On Jun 11, 2010, at 7:27 PM, Russell Jurney wrote: This would be great. Save us from GROUP ALL/FOREACH, which is awkward. On Fri, Jun 11, 2010 at 7:14 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote

Re: SIZE() of relation

2010-06-11 Thread Dmitriy Ryaboy
It would be cool to just treat relations as bags in the general case. They kind of are, and kind of are not. Causes lots of user confusion. There are obvious users-doing-dumb-stuff scenarios that arise though. I guess the Pig philosophy is that the user is the optimizer, though.. so maybe it's ok.

algebraic optimization not invoked for filter following group?

2010-06-02 Thread Dmitriy Ryaboy
It looks like right now, the combiner optimization does not kick in for a script like this: data = load 'foo' using PigStorage() as (a, b, c); grouped = group data by a; filtered = filter grouped by COUNT(data) 1000; Looking at the code in CombinerOptimizer, seems like the Filter bit is just

Re: [jira] Commented: (PIG-1420) Make CONCAT act on all fields of a tuple, instead of just the first two fields of a tuple

2010-05-27 Thread Dmitriy Ryaboy
No, that one is Hudson. While it was on, uh, medical leave, a bunch of patches got committed, so now I am guessing it's trying to apply a patch to a tree that already has said patch in it. Don't worry about it, this patch is already in trunk. -D On Thu, May 27, 2010 at 11:22 AM, Russell Jurney

pig 0.6

2010-04-29 Thread Dmitriy Ryaboy
Is anyone running unpatched 0.6 anywhere? I am in the process of putting together a jar for us, and getting worried about all the PruneColumns optimization fixes that came after the 0.6 release. -D

Re: Steps to get pig source code in Eclipse environment

2010-04-22 Thread Dmitriy Ryaboy
At some point you need to run ant so that it pulls down various dependencies and autogenerate some code -- this is probably the step that was missing when you used the subclipse plugin. I know people have used subclipse successfully before (me, I'm more of a command-line type). An ant target that

Re: PIG perfomance on join

2010-04-16 Thread Dmitriy Ryaboy
Still PIG-200 -D On Fri, Apr 16, 2010 at 1:37 PM, Radhikadevi Parvathaneni rparv...@acad.umass.edu wrote: hi Pig development team, Can you please provide me some skewed and non-skewed data sets for checking the performance of different join types in PIG. Thank you in advance Radhika

Please change your Jira passwords

2010-04-13 Thread Dmitriy Ryaboy
Apache systems were attacked earlier this month; details here: https://blogs.apache.org/infra/entry/apache_org_04_09_2010 Particularly important bit: Password Security *If you are a user of the Apache hosted JIRA, Bugzilla, or Confluence, a hashed copy of your password has been compromised.*

passing initialization parameters to algebraic functions

2010-04-08 Thread Dmitriy Ryaboy
If you define a UDF like this: DEFINE foo my.Udf('param1', 'param2'); data = foreach other_data generate foo(field); and my.Udf is an algebraic function, the Initial, Intermediate, and Final classes do not get initialized with the arguments passed into my.Udf in the DEFINE. Am I missing

Re: Begin a discussion about Pig as a top level project

2010-04-05 Thread Dmitriy Ryaboy
- From: Thejas Nair [mailto:te...@yahoo-inc.com] Sent: Friday, April 02, 2010 4:08 PM To: pig-dev@hadoop.apache.org; Dmitriy Ryaboy Subject: Re: Begin a discussion about Pig as a top level project I agree with Alan and Dmitriy - Pig is tightly coupled with hadoop, and heavily

Re: What should FLATTEN do?

2010-04-02 Thread Dmitriy Ryaboy
CDH2 or CDH3? CDH2 is basically 0.{4,5}. CDH3 is in between 5 and 6. I expect the first result -- a flattened bag of tuples results in multiple rows, each containing the (not-flattened) tuple. Btw, Pig 0.6 is out. -D On Fri, Apr 2, 2010 at 11:32 AM, hc busy hc.b...@gmail.com wrote: doh

Re: Begin a discussion about Pig as a top level project

2010-03-31 Thread Dmitriy Ryaboy
Over time, Pig is increasing its coupling to Hadoop (for good reasons), rather than decreasing it. If and when Pig becomes a viable entity without hadoop around, it might make sense as a TLP. As is, I think becoming a TLP will only introduce unnecessary administrative and bureaucratic headaches.

Broken build

2010-03-15 Thread Dmitriy Ryaboy
Hi guys, Trunk has been broken for a while. A bunch of tests in the test-commit target fail, mostly due to The import org.apache.pig.experimental.logical.optimizer.PlanPrinter cannot be resolved. Could someone check in the missing file? -D

Re: Operating on Cogroups and Iterations in Pig Re: more bagging fun

2010-03-12 Thread Dmitriy Ryaboy
hc, Good stuff. I was thinking along very similar lines with regards to allowing mapping a function over a bag. I suspect a MAP can actually be written as a udf. We'd just have to pass the name of the function to be mapped and call InstantiateFuncFromSpec on it. We may want a different name for

Re: Will Pig support SQL?

2010-02-08 Thread Dmitriy Ryaboy
Jian, If what you are looking for is something that will let you deal with skewed data and forget about how the underlying distributed system works, both Pig and Hive will help you do that to some extent. If you are looking for something that will let you exercise fine-grained control over

Re: Will Pig support SQL?

2010-02-08 Thread Dmitriy Ryaboy
that will be fed into a total of 200 reducers. -D On Mon, Feb 8, 2010 at 7:16 AM, Dmitriy Ryaboy dvrya...@gmail.com wrote: Jian, If what you are looking for is something that will let you deal with skewed data and forget about how the underlying distributed system works, both Pig and Hive will help

Re: Private variables are not eco-friendly

2010-02-03 Thread Dmitriy Ryaboy
for inheritance arises rather than begin as protected? -Original Message- From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com] Sent: Tuesday, February 02, 2010 7:35 PM To: pig-dev@hadoop.apache.org Subject: Private variables are not eco-friendly Hi all, I keep running into problems trying

Re: reading/writing HBase in Pig

2010-01-14 Thread Dmitriy Ryaboy
Hi Mike, It would be great to have a StoreFunc for HBase! There is a rewrite underway for the Load/Store stuff that will make that a lot easier -- see https://issues.apache.org/jira/browse/PIG-966 . You may want to consider writing it for the load-store redesign branch. This is what's probably

Re: pig processing bzip2 data

2010-01-11 Thread Dmitriy Ryaboy
Both are caused by you running in local mode by default. On Mon, Jan 11, 2010 at 5:36 PM, felix gao gre1...@gmail.com wrote: Follow up with the previous email.  I have noticed the following I have a pig script called Overlap that reads in bunch *.bz2 files if I run the following command

Re: time to release Pig 0.6.0

2010-01-08 Thread Dmitriy Ryaboy
. Olga -Original Message- From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com] Sent: Thursday, January 07, 2010 2:20 PM To: pig-dev@hadoop.apache.org Subject: Re: time to release Pig 0.6.0 Having just been hit by this -- any chance we can put http://issues.apache.org/jira/browse/PIG

Re: time to release Pig 0.6.0

2010-01-07 Thread Dmitriy Ryaboy
Olga, Are there any changes in 0.6 that are not backwards-compatible, or is all that only in trunk? -Dmitriy On Thu, Jan 7, 2010 at 10:33 AM, Olga Natkovich ol...@yahoo-inc.com wrote: Pig Developers, Since we have branched for the release, we have fixed a lot of bugs and stabilized the

Re: time to release Pig 0.6.0

2010-01-07 Thread Dmitriy Ryaboy
in UDFs. The only modifications that changes things a bit is moving local mode from native to Hadoop's. Olga -Original Message- From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com] Sent: Thursday, January 07, 2010 10:44 AM To: pig-dev@hadoop.apache.org Subject: Re: time to release Pig 0.6.0

Re: Pig reading hive columnar rc tables

2009-11-30 Thread Dmitriy Ryaboy
That's awesome, I've been itching to do that but never got around to it.. Garrit, do you have any benchmarks on read speeds? I don't know about putting this in piggybank, as it carries with it pretty significant dependencies, increasing the size of the jar and making it difficult for users to

Re: Pig reading hive columnar rc tables

2009-11-30 Thread Dmitriy Ryaboy
Sorry I misspelled your name, Gerrit. -D On Mon, Nov 30, 2009 at 3:18 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: That's awesome, I've been itching to do that but never got around to it.. Garrit, do you have any benchmarks on read speeds? I don't know about putting this in piggybank

Re: Pig reading hive columnar rc tables

2009-11-30 Thread Dmitriy Ryaboy
, at 12:18 PM, Dmitriy Ryaboy wrote: That's awesome, I've been itching to do that but never got around to it.. Garrit, do you have any benchmarks on read speeds? I don't know about putting this in piggybank, as it carries with it pretty significant dependencies, increasing the size

Re: Is Pig dropping records?

2009-11-21 Thread Dmitriy Ryaboy
Rash s...@ning.com On Nov 19, 2009, at 4:48 PM, Dmitriy Ryaboy wrote: Zaki, Glad to hear it wasn't Pig's fault! Can you post a description of what was going on with S3, or at least how you fixed it? -D On Thu, Nov 19, 2009 at 2:57 PM, zaki rahaman zaki.raha...@gmail.com wrote: Okay

Re: Welcome Jeff Zhang

2009-11-19 Thread Dmitriy Ryaboy
Congrats Jeff! On Thu, Nov 19, 2009 at 7:47 PM, Jeff Zhang zjf...@gmail.com wrote: I am very glad to join the pig family. I have grown and learned a lot with others' help in the last nine months.I will continue contribute to pig and learn from others. Jeff Zhang On Thu, Nov 19, 2009 at

RequiredFields contents

2009-11-05 Thread Dmitriy Ryaboy
Hi all, I am looking at the RequiredFields class and it has this explanation of what getFields() returns: /** * List of fields required from the input. This includes fields that are * transformed, and thus are no longer the same fields. Using the example 'B * = foreach A

Re: How to clone a logical plan ?

2009-11-05 Thread Dmitriy Ryaboy
Richard, The Load/Store redesign proposal has an interface that defines how stats get represented; a loader that implements ResourceLoader will pass statistics up into Pig, which will then take care of doing whatever it needs to do with them. The specifics of how the stats get loaded in by the

Re: two-level access problem?

2009-11-03 Thread Dmitriy Ryaboy
= false; -Original Message- From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com] Sent: Monday, November 02, 2009 5:33 PM To: pig-dev@hadoop.apache.org Subject: two-level access problem? Could someone explain the nature of the two-level access problem referred to in the Load/Store redesign

two-level access problem?

2009-11-02 Thread Dmitriy Ryaboy
Could someone explain the nature of the two-level access problem referred to in the Load/Store redesign wiki and in the DataType code? Thanks, -D

Re: Custom Loadfunc problem!

2009-10-28 Thread Dmitriy Ryaboy
problem! Date: Tue, 27 Oct 2009 23:40:43 -0800 I mean hadoop's local mode not pig's own local mode -Original Message- From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com] Sent: 2009年10月26日 6:33 To: pig-dev@hadoop.apache.org; pig-dev@hadoop.apache.org Subject: RE: Custom Loadfunc problem

RE: Custom Loadfunc problem!

2009-10-26 Thread Dmitriy Ryaboy
Jeff, Slicers dont work in local mode, there is an ancient ticket for that on the Jira. Richard -- hard to say whats going on without more code. Think you can come up with a simplified version of your loadfunc that fails in a similar manner, and share it? -Original Message- From:

Re: Custom Loadfunc problem!

2009-10-26 Thread Dmitriy Ryaboy
Do you get any of your Log messages to come out, or none at all? -D 2009/10/26 RichardGUO Fei gladiato...@hotmail.com: Hi, This is the rough source codes of the slicer/loadfunc: public class HadoopStoreStorage extends Utf8StorageConverter implements LoadFunc, Slicer { private

LocalRearrange out of bounds exception - tips for debugging?

2009-10-13 Thread Dmitriy Ryaboy
We ran into what looks like some edge case bug in Pig, which causes it to throw an IndexOutOfBoundsException (stack trace below). The script just joins two relations; it looks like our data was generated incorrectly, and the join is empty, which may be what's causing the failure. It also appears

Re: High(er) res Pig logo?

2009-09-28 Thread Dmitriy Ryaboy
. Also, we're working on cleaning up the Pig with Y! logo issue. Alan. On Sep 27, 2009, at 9:59 AM, Dmitriy Ryaboy wrote: Where can one find the Pig logo in a size/resolution suitable for presentations? Also, I went on the website and noticed that the Y! reappeared on Pig's chest. -D

High(er) res Pig logo?

2009-09-27 Thread Dmitriy Ryaboy
Where can one find the Pig logo in a size/resolution suitable for presentations? Also, I went on the website and noticed that the Y! reappeared on Pig's chest. -D

Re: [VOTE] Release Pig 0.4.0 (candidate 2)

2009-09-22 Thread Dmitriy Ryaboy
Olga, which test failed? If it's one of the ones I contributed, I'll fix it. -D On Mon, Sep 21, 2009 at 8:54 PM, Olga Natkovich ol...@yahoo-inc.com wrote: Hi, The new version is available in http://people.apache.org/~olga/pig-0.4.0-candidate-2/. I see one failure in a unit test in

Did Sybase just invent Pig Latin?

2009-09-19 Thread Dmitriy Ryaboy
http://iablog.sybase.com/paulley/2009/08/is-sql-a-failed-abstraction/ Gosh that looks familiar. -D

Re: Request for feedback: cost-based optimizer

2009-09-11 Thread Dmitriy Ryaboy
so what we implement works for what you need. Alan. On Sep 1, 2009, at 9:54 AM, Dmitriy Ryaboy wrote: Whoops :-) Here's the Google doc: http://docs.google.com/Doc?docid=0Adqb7pZsloe6ZGM4Z3o1OG1fMjFrZjViZ21jdAhl=en -Dmitriy On Tue, Sep 1, 2009 at 12:51 PM, Santhosh Srinivasans...@yahoo

Re: Request for feedback: cost-based optimizer

2009-09-03 Thread Dmitriy Ryaboy
. But now Pig is a subproject of hadoop and almost all Pig users are using hadoop, I think it is fine to optimize thing towards hadoop. Dmitriy Ryaboy wrote: Our initial survey of related literature showed that the usual place for a CBO tends to be between the physical and logical layer (in fact

Request for feedback: cost-based optimizer

2009-09-01 Thread Dmitriy Ryaboy
Hi everyone, Attached is a (very) preliminary document outlining a rough design we are proposing for a cost-based optimizer for Pig. This is being done as a capstone project by three CMU Master's students (myself, Ashutosh Chauhan, and Tejal Desai). As such, it is not necessarily meant for

Re: Request for feedback: cost-based optimizer

2009-09-01 Thread Dmitriy Ryaboy
and just send the URL ? Thanks, Santhosh -Original Message- From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com] Sent: Tuesday, September 01, 2009 9:48 AM To: pig-dev@hadoop.apache.org Subject: Request for feedback: cost-based optimizer Hi everyone, Attached is a (very) preliminary

Re: Request for feedback: cost-based optimizer

2009-09-01 Thread Dmitriy Ryaboy
in the design, to be honest). But we feel that the implementations have to be execution mode specific. -Dmitriy On Tue, Sep 1, 2009 at 6:26 PM, Jianyong Daijiany...@yahoo-inc.com wrote: I am still reading but one interesting question is why you decide to put CBO in physical layer? Dmitriy Ryaboy wrote

Re: Pig 0.4.0 release

2009-08-18 Thread Dmitriy Ryaboy
in a version of hadoop20.jar that will work for users who want to build with 0.20.  This way users can still build this if they want and our release isn't blocked on the patch. Alan. On Aug 17, 2009, at 12:03 PM, Dmitriy Ryaboy wrote: Olga, Do non-commiters get a vote? Zebra is in trunk

Re: Pig 0.4.0 release

2009-08-17 Thread Dmitriy Ryaboy
Olga, Do non-commiters get a vote? Zebra is in trunk, but relies on 0.20, which is somewhat inconsistent even if it's in contrib/ Would love to see dynamic (or at least static) shims incorporated into the 0.4 release (see PIG-660, PIG-924) There are a couple of bugs still outstanding that I

Re: Is there any document about the JobControlCompiler

2009-07-08 Thread Dmitriy Ryaboy
Jeff, Chris Olston answered this a while back: http://markmail.org/thread/xnwutstlftnyycxs (by the way, MarkMail is awesome for searching mailing list archives. Highly recommended.) There are some changes that have to do with sampling and multi-store, but that email will give you the general

Re: COUNT, AVG and nulls

2009-07-06 Thread Dmitriy Ryaboy
+1 for standard semantics. We need a COALESCE function to go along with this. -D On Mon, Jul 6, 2009 at 10:46 AM, Olga Natkovich ol...@yahoo-inc.com wrote: Hi, The current implementation of COUNT and AVG in Pig counts null values. This is inconsistent with SQL semantics and also with

Re: requirements for Pig 1.0?

2009-06-24 Thread Dmitriy Ryaboy
we're ready to consider 1.0. It would be nice to be 1.0 not too long after Hadoop is, which still gives us at least 6-9 months. Alan. On Jun 22, 2009, at 10:58 AM, Dmitriy Ryaboy wrote: I know there was some discussion of making the types release (0.2) a Pig 1 release, but that got

requirements for Pig 1.0?

2009-06-22 Thread Dmitriy Ryaboy
I know there was some discussion of making the types release (0.2) a Pig 1 release, but that got nixed. There wasn't a similar discussion on 0.3. Has the list of want-to-haves for Pig 1.0 been discussed since?

Re: apache parsing and pig-830

2009-06-09 Thread Dmitriy Ryaboy
Thanks for doing the work in the first place :-) Sorry about the lack of attribution. No @author tags... -D On Mon, Jun 8, 2009 at 11:13 PM, Earl Cahillcahi...@yahoo.com wrote: I am planning on coming to the hadoop stuff out near san fran, wednesday and thursday, thought I would get the