Re: Pig Data type question

Prashant Kommireddi Fri, 25 Nov 2011 13:28:42 -0800

Yeah, your use case makes sense, though have you done any benchmarking to
see how significantly eliminating the TOTUPLE call will benefit
performance? I'd be curious if it was so significant.


Yes, its a O(n)  vs O(1) operation. TOTUPLE is O(n) whereas UDF(*,'arg') is
O(1). Basically, the UDF checks for String argument 'arg', and looks up for
a field in the Tuple based on a Hashmap that stores 'arg' to index mapping.

Also, if the string you're passing is just a static argument, it's probably
cleaner to put it in the constructor, and then use a DEFINE statement to
instantiate it.

Unfortunately, String argument is not static.

But yeah, I mean, even if Pig supported this functionality more cleanly,
there is a problem matching TOTUPLE(*) and * because * could just be a
simple Tuple, and there would be ambiguity there. I would test to see if
there is actually a material benefit to doing this.

If * were a Tuple Pig should invoke UDF(tuple, chararray). If you notice
Pig treats BETTERUDF(*, 'arg') as a single argument -> Tuple of fields
containing values from * followed by 'arg' as the last value. If * were a
Tuple itself, Pig should treat that as UDF(tuple, 'arg')

On Fri, Nov 25, 2011 at 1:15 PM, Jonathan Coveney <[email protected]>wrote:

> Yeah, your use case makes sense, though have you done any benchmarking to
> see how significantly eliminating the TOTUPLE call will benefit
> performance? I'd be curious if it was so significant.
>
> Also, if the string you're passing is just a static argument, it's probably
> cleaner to put it in the constructor, and then use a DEFINE statement to
> instantiate it.
>
> But yeah, I mean, even if Pig supported this functionality more cleanly,
> there is a problem matching TOTUPLE(*) and * because * could just be a
> simple Tuple, and there would be ambiguity there. I would test to see if
> there is actually a material benefit to doing this.
>
> 2011/11/25 Prashant Kommireddi <[email protected]>
>
> > In the case where arguments are UDF(TOTUPLE(*), 'arg'), the EvalFunc
> > actually receives a single Tuple with 2 elements - first one being a
> Tuple
> > and the 2nd a chararray. In case the arguments were UDF(*, 'arg') the
> > EvalFunc receives a Tuple with multiple fields (* and 'arg' being the
> last
> > element in that Tuple). I feel Pig should be able to distinguish between
> > the 2 cases here.
> >
> > To answer your question,
> >
> > *what if * is in fact just a Tuple of something? So you have
> >
> > TOTUPLE(tuple), 'chararray'
> > tuple, 'chararray'
> >
> > which one should they match? The one intended for TOTUPLE(*), or the one
> > intended for just *? Because both would match just a tuple.*
> >
> > It should match UDF(Tuple, chararray). Its for the UDF to handle the
> inner
> > elements of Tuple but getArgToFuncMapping() should be able to invoke the
> > right UDF, at least.
> >
> > The reason I am trying to overload the function is because I have already
> > exposed UDF(TOTUPLE(*), chararray) to my users. I have now come up with a
> > better UDF - BETTERUDF(*, 'arg') in terms of performance ( avoiding a
> > TOTUPLE call ) and want users to be able to just change the arguments
> they
> > pass to their UDF and be able to use the new one.
> >
> > The FloatAbs function is intended for Scalar values, so it makes sense
> not
> > to wrap it in a TOTUPLE.
> >
> > On Fri, Nov 25, 2011 at 11:48 AM, Jonathan Coveney <[email protected]
> > >wrote:
> >
> > > I believe that this is a current limitation of Pig: you can't have a
> > > function that uses both getArgToFuncMapping and a variable number of
> > > arguments. In this case, it kind of makes sense that you can't though,
> > > example:
> > >
> > > what if * is in fact just a Tuple of something? So you have
> > >
> > > TOTUPLE(tuple), 'chararray'
> > > tuple, 'chararray'
> > >
> > > which one should they match? The one intended for TOTUPLE(*), or the
> one
> > > intended for just *? Because both would match just a tuple.
> > >
> > > Hmm, one more thing, though, which also is important: you're
> re-wrapping
> > > the argument in a Tuple. It is implicit that the input to your evalfunc
> > > will come in the form of a Tuple. In the UDF example, note that they
> > don't
> > > rewrap in a tuple:
> > >
> > > funcList.add(new FuncSpec(FloatAbs.class.getName(),   new Schema(new
> > > Schema.FieldSchema(null, DataType.FLOAT))));
> > >
> > > So unless your argument will be explicitly rewrapped in a tuple, you
> > don't
> > > need that piece.
> > >
> > > But yeah, someone else can chime in with whether getArgtoFunc can do
> wha
> > > you want it to do, but I don't think it can. My suggestion would be to
> a)
> > > choose one form of input and stick to that, instead of trying to
> support
> > > two forms and b) you could have a initializer in your EvalFunc that on
> > the
> > > first input, inspects the types and figures out which function to use
> to
> > > process the input.
> > >
> > > We do need to make funcspecs play nice with variable numbers of
> > arguments,
> > > though, especially now that more schema info is available.
> > >
> > > 2011/11/25 Prashant Kommireddi <[email protected]>
> > >
> > > > Thanks Jonathan.
> > > >
> > > > What do I check for as the input type, because DataType.TUPLE does
> not
> > > seem
> > > > to work. I would like to use "getArgToFuncMapping()" to be able to
> > invoke
> > > > different functions based on input type, and I am not sure how to
> check
> > > for
> > > > Case 2.
> > > >
> > > > In my implementation, Case 1 could be checked for (DataType.TUPLE,
> > > > DataType.CHARARRAY) but for Case 2 I would assume it should be
> > > > (DataType.TUPLE) but that does not work. PIg UDF cannot infer a
> > matching
> > > > function.
> > > >
> > > >  @Override
> > > >    public List<FuncSpec> getArgToFuncMapping() throws
> > FrontendException {
> > > >        List<FuncSpec> funcList = new ArrayList<FuncSpec>();
> > > >        Schema s = new Schema();
> > > >        s.add(new Schema.FieldSchema(null, DataType.TUPLE));
> > > >        s.add(new Schema.FieldSchema(null, DataType.CHARARRAY));
> > > >        funcList.add(new FuncSpec(this.getClass().getName(), s));
> > > >
> > > >        s = new Schema();
> > > >        s.add(new Schema.FieldSchema(null, DataType.TUPLE));
> > > >        funcList.add(new FuncSpec(CustomUDF.class.getName(), s));
> > > >
> > > >        return funcList;
> > > >    }
> > > >
> > > >
> > > >
> > > > On Fri, Nov 25, 2011 at 12:52 AM, Jonathan Coveney <
> [email protected]
> > > > >wrote:
> > > >
> > > > > The first case will give you a tuple which contains, as it first
> > > > element, a
> > > > > tuple of all of the stuff in *, and as its second element, 'input'.
> > > > >
> > > > > The second will give youa tuple which contains all of the elements
> of
> > > *,
> > > > > and then as its last element, 'input'.
> > > > >
> > > > > This is what I thought, but to be sure I ran this UDF:
> > > > >
> > > > > import org.apache.pig.EvalFunc;
> > > > > import java.io.IOException;
> > > > > import org.apache.pig.data.Tuple;
> > > > >
> > > > > public class ATHING extends EvalFunc<String> {
> > > > >  public String exec(Tuple input) throws IOException {
> > > > >    System.out.println(input.toString());
> > > > >    return null;
> > > > >   }
> > > > > }
> > > > >
> > > > > 2011/11/24 Prashant Kommireddi <[email protected]>
> > > > >
> > > > > > I have a question regarding the pig data types.
> > > > > >
> > > > > > If I have a UDF, say 'CustomUDF' and I do something like this:
> > > > > >
> > > > > > REGISTER 'foo.jar';
> > > > > >
> > > > > > A = LOAD '/shared/a.dat';
> > > > > >
> > > > > > What would be the difference in the data types for UDF arguments
> > > > between
> > > > > > -->
> > > > > >
> > > > > > Case 1 : B = FOREACH A GENERATE CustomUDF(TOTUPLE(*), 'input');
> AND
> > > > > > Case 2 : B = FOREACH A GENERATE CustomUDF(*, 'input');
> > > > > >
> > > > > > I am sure Case 1 is (tuple, chararray). Can anyone let me know
> the
> > > data
> > > > > > type for Case 2 arguments?
> > > > > >
> > > > > > Thanks,
> > > > > > Prashant
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Pig Data type question

Reply via email to