Re: Avoiding serialization/de-serialization in pig

2010-06-30 Thread Thejas Nair
On 6/28/10 5:51 PM, "Dmitriy Ryaboy" wrote: > > I have a feeling that propagating schemas when known, and using them to for > (de)serialization instead of reflecting every field, would also be a big > win. > > Thoughts on just using Avro for the internal PigStorage? When I profiled pig quer

Avoiding serialization/de-serialization in pig

2010-06-28 Thread Thejas Nair
I have created a wiki which puts together some ideas that can help in improving performance by avoiding/delaying serialization/de-serialization . http://wiki.apache.org/pig/AvoidingSedes These are ideas that don't involve changes to optimizer. Most of them involve changes in the load/store functi

Re: Begin a discussion about Pig as a top level project

2010-04-02 Thread Thejas Nair
I agree with Alan and Dmitriy - Pig is tightly coupled with hadoop, and heavily influenced by its roadmap. I think it makes sense to continue as a sub-project of hadoop. -Thejas On 3/31/10 4:04 PM, "Dmitriy Ryaboy" wrote: > Over time, Pig is increasing its coupling to Hadoop (for good reasons

LoadFunc.skipNext() function for faster sampling ?

2009-11-03 Thread Thejas Nair
In the new implementation of SampleLoader subclasses (used by order-by, skew-join ..) as part of the loader redesign, we are not only reading all the records input but also parsing them as pig tuples. This is because the SampleLoaders are wrappers around the actual input loaders specified in the q

Re: LoadFunc.skipNext() function for faster sampling ?

2009-11-03 Thread Thejas Nair
? Pig will have access > to the InputFormat instance, correct? Can it not call > InputFormat.getNext the desired number of times (which will not parse > the tuple) and then call LoadFunc.getNext to get the next parsed tuple? > > Alan. > > On Nov 3, 2009, at 4:28 PM, Thejas Nair w

Re: Definition of equality of bags

2009-11-02 Thread Thejas Nair
fix it, I am not filing a jira. -Thejas On 11/2/09 9:19 AM, "Thejas Nair" wrote: > I could not find any documentation (in piglatin manual) on what the > definition of equality of bags is (or what it should be), does the order of > tuples in the bag matter ? But the definitio

Definition of equality of bags

2009-11-02 Thread Thejas Nair
I could not find any documentation (in piglatin manual) on what the definition of equality of bags is (or what it should be), does the order of tuples in the bag matter ? But the definition of a bag does not imply any ordering. This has implication on the definition of join/cogroup/group on bags.

Re: [VOTE] Release Pig 0.5.0 (candidate 0)

2009-10-29 Thread Thejas Nair
I think we should include fix for PIG-1048 (skew join incorrect results) in the release. There is already a patch for it. -Thejas On 10/29/09 1:54 PM, "Olga Natkovich" wrote: > With 3 +1s from Hadoop PMC (Alan Gates, Chris Douglas, and Olga > Natkovich) and no -1s, the release passed the vote

Re: switching to different parser in Pig

2009-08-25 Thread Thejas Nair
Jflex is covered by GPL, but code generated by it is not. Only the code that is generated by Jflex goes into pig.jar. We can't checkin Jflex.jar into svn, ivy will be setup to download it from maven repository. -Thejas On 8/25/09 11:57 AM, "Dmitriy Ryaboy" wrote: > Santosh, > Am I missing some

Re: Proposal to create a branch for contrib project Zebra

2009-08-18 Thread Thejas Nair
I think we are creating unnecessary bureaucratic hurdles here by preventing contrib project from having a branch. I don't see why zebra has to use pig release branch, as the new pig release does not include it. The decisions are supposed to help keeping things open, but this seems to be forcing Ra

Re: A proposal for changing pig's memory management

2009-05-15 Thread Thejas Nair
With a constraint that all scalar values in a tuple should fit into a single buffer, the values will always have to be copied whenever a tuple contents need to be copied to a new tuple after a relational operation. The overhead of copying is not large for numeric types compared to the existing imp

Re: [Pig Wiki] Update of "ProposedProjects" by AlanGates

2009-04-16 Thread Thejas Nair
This paper seems very relevant to the proposal - "Compiled Query Execution Engine using JVM" http://www2.computer.org/portal/web/csdl/doi/10.1109/ICDE.2006.40 >From the abstract - "Our experimental results on the TPC-H data set show that, despite both engines benefiting from JIT, the compiled engi

Re: [jira] Created: (PIG-729) Use of default parallelism

2009-03-23 Thread Thejas Nair
Pig users might not know enough to decide on a good default parallelism, specially when running adhoc queries. Instead of defaulting to 1 , if a user does not specify the parallelism , we should use as default a higher number which does not have negative impact on the throughput of the system. Ha

FW: scope string in OperatorKey

2009-03-19 Thread Thejas Nair
I will create a JIRA for this change. -Thejas -- Forwarded Message From: Alan Gates Date: Mon, 16 Mar 2009 07:56:32 -0700 To: Thejas Nair Subject: Re: scope string in OperatorKey +1. Alan. On Mar 11, 2009, at 11:53 AM, Thejas Nair wrote: > The id in OperatorKey helps distingu

Re: scope string in OperatorKey

2009-03-11 Thread Thejas Nair
easy to > distinguish operators without it? IIRC the OperatorKey includes an > operator number. When looking at the explain plans this is useful for > cases where there is more than one of a given type of operator and you > want to be able to distinguish between them. > > Alan. &

scope string in OperatorKey

2009-03-06 Thread Thejas Nair
What is the purpose of scope string in org.apache.pig.impl.plan.OperatorKey ?Is it meant to be used if we have a pig deamon process ? Is it ok to stop printing the scope part in explain output? It does not seem to add value to it and makes the output more verbose. Thanks, Thejas