Re: Pig Data type question

Prashant Kommireddi Fri, 25 Nov 2011 12:06:06 -0800

In the case where arguments are UDF(TOTUPLE(*), 'arg'), the EvalFunc
actually receives a single Tuple with 2 elements - first one being a Tuple
and the 2nd a chararray. In case the arguments were UDF(*, 'arg') the
EvalFunc receives a Tuple with multiple fields (* and 'arg' being the last
element in that Tuple). I feel Pig should be able to distinguish between
the 2 cases here.


To answer your question,

*what if * is in fact just a Tuple of something? So you have

TOTUPLE(tuple), 'chararray'
tuple, 'chararray'

which one should they match? The one intended for TOTUPLE(*), or the one
intended for just *? Because both would match just a tuple.*

It should match UDF(Tuple, chararray). Its for the UDF to handle the inner
elements of Tuple but getArgToFuncMapping() should be able to invoke the
right UDF, at least.

The reason I am trying to overload the function is because I have already
exposed UDF(TOTUPLE(*), chararray) to my users. I have now come up with a
better UDF - BETTERUDF(*, 'arg') in terms of performance ( avoiding a
TOTUPLE call ) and want users to be able to just change the arguments they
pass to their UDF and be able to use the new one.

The FloatAbs function is intended for Scalar values, so it makes sense not
to wrap it in a TOTUPLE.

On Fri, Nov 25, 2011 at 11:48 AM, Jonathan Coveney <[email protected]>wrote:

> I believe that this is a current limitation of Pig: you can't have a
> function that uses both getArgToFuncMapping and a variable number of
> arguments. In this case, it kind of makes sense that you can't though,
> example:
>
> what if * is in fact just a Tuple of something? So you have
>
> TOTUPLE(tuple), 'chararray'
> tuple, 'chararray'
>
> which one should they match? The one intended for TOTUPLE(*), or the one
> intended for just *? Because both would match just a tuple.
>
> Hmm, one more thing, though, which also is important: you're re-wrapping
> the argument in a Tuple. It is implicit that the input to your evalfunc
> will come in the form of a Tuple. In the UDF example, note that they don't
> rewrap in a tuple:
>
> funcList.add(new FuncSpec(FloatAbs.class.getName(),   new Schema(new
> Schema.FieldSchema(null, DataType.FLOAT))));
>
> So unless your argument will be explicitly rewrapped in a tuple, you don't
> need that piece.
>
> But yeah, someone else can chime in with whether getArgtoFunc can do wha
> you want it to do, but I don't think it can. My suggestion would be to a)
> choose one form of input and stick to that, instead of trying to support
> two forms and b) you could have a initializer in your EvalFunc that on the
> first input, inspects the types and figures out which function to use to
> process the input.
>
> We do need to make funcspecs play nice with variable numbers of arguments,
> though, especially now that more schema info is available.
>
> 2011/11/25 Prashant Kommireddi <[email protected]>
>
> > Thanks Jonathan.
> >
> > What do I check for as the input type, because DataType.TUPLE does not
> seem
> > to work. I would like to use "getArgToFuncMapping()" to be able to invoke
> > different functions based on input type, and I am not sure how to check
> for
> > Case 2.
> >
> > In my implementation, Case 1 could be checked for (DataType.TUPLE,
> > DataType.CHARARRAY) but for Case 2 I would assume it should be
> > (DataType.TUPLE) but that does not work. PIg UDF cannot infer a matching
> > function.
> >
> >  @Override
> >    public List<FuncSpec> getArgToFuncMapping() throws FrontendException {
> >        List<FuncSpec> funcList = new ArrayList<FuncSpec>();
> >        Schema s = new Schema();
> >        s.add(new Schema.FieldSchema(null, DataType.TUPLE));
> >        s.add(new Schema.FieldSchema(null, DataType.CHARARRAY));
> >        funcList.add(new FuncSpec(this.getClass().getName(), s));
> >
> >        s = new Schema();
> >        s.add(new Schema.FieldSchema(null, DataType.TUPLE));
> >        funcList.add(new FuncSpec(CustomUDF.class.getName(), s));
> >
> >        return funcList;
> >    }
> >
> >
> >
> > On Fri, Nov 25, 2011 at 12:52 AM, Jonathan Coveney <[email protected]
> > >wrote:
> >
> > > The first case will give you a tuple which contains, as it first
> > element, a
> > > tuple of all of the stuff in *, and as its second element, 'input'.
> > >
> > > The second will give youa tuple which contains all of the elements of
> *,
> > > and then as its last element, 'input'.
> > >
> > > This is what I thought, but to be sure I ran this UDF:
> > >
> > > import org.apache.pig.EvalFunc;
> > > import java.io.IOException;
> > > import org.apache.pig.data.Tuple;
> > >
> > > public class ATHING extends EvalFunc<String> {
> > >  public String exec(Tuple input) throws IOException {
> > >    System.out.println(input.toString());
> > >    return null;
> > >   }
> > > }
> > >
> > > 2011/11/24 Prashant Kommireddi <[email protected]>
> > >
> > > > I have a question regarding the pig data types.
> > > >
> > > > If I have a UDF, say 'CustomUDF' and I do something like this:
> > > >
> > > > REGISTER 'foo.jar';
> > > >
> > > > A = LOAD '/shared/a.dat';
> > > >
> > > > What would be the difference in the data types for UDF arguments
> > between
> > > > -->
> > > >
> > > > Case 1 : B = FOREACH A GENERATE CustomUDF(TOTUPLE(*), 'input'); AND
> > > > Case 2 : B = FOREACH A GENERATE CustomUDF(*, 'input');
> > > >
> > > > I am sure Case 1 is (tuple, chararray). Can anyone let me know the
> data
> > > > type for Case 2 arguments?
> > > >
> > > > Thanks,
> > > > Prashant
> > > >
> > >
> >
>

Re: Pig Data type question

Reply via email to