By extending an abstract class, you can reuse the generics for the pig
input's Tuple ETL validation, and a consistent hook for your DataBag
parsing logic. Consider the following abstract class ParseBagAsBag, which
can be extended by your own MyDatabagParserToDataBag, with override to
method parser_logic() and with references to the output super.bag:
public abstract class ParseBagAsBag extends EvalFunc<DataBag> {
public TupleFactory tuple_factory = TupleFactory.getInstance();
public BagFactory bag_factory = BagFactory.getInstance();
public DataBag bag;
/**
* Wrapper for Deconstructing the input Tuple to extract DataBag
component.
* @param input Tuple containing DataBag.
* @return DataBag of parser logic, NULL iff bag is empty.
* @throws IOException
*/
@Override
public DataBag exec(Tuple input) throws IOException {
this.tuple = this.tuple_factory.newTuple();
// if valid, create a new Tuple from factory
if (input != null) {
// @precondition check
if ((!input.isNull()) && (input.size() > 0)) {
// @precondition check; tuple is non-empty and interesting
Object oBag = input.get(0);
// DataBag wrapped in a one-element Tuple
if (oBag instanceof DataBag) {
// @precondition check; type pig.DataBag
DataBag databag = (DataBag) oBag;
parser_logic(databag);
}
}
}
return (this.bag.size() > 0) ? this.bag : null;
// return the bag iff modified from factory instantiation, otherwise
return NULL Object
}
public abstract void parser_logic(DataBag databag) throws IOException;
}
Hope this helps.
-Dan
On Mon, Mar 18, 2013 at 11:01 AM, Jonathan Coveney <[email protected]>wrote:
> Ah, I suppose I was just proving it oculd be done.
>
> To make a new one, you'd do:
>
> public class MyUdf extends EvalFunc<DataBag> {
> private static final BagFactory mBagFactory = BagFactory.getInstance();
> public DataBag exec(Tuple input) throws IOException {
> DataBag output = mBagFactory.newDefaultBag();
> for (Tuple t : (DataBag)input.get(0)) {
> output.add(t);
> }
> return output;
> }
> }
>
>
>
>
> 2013/3/18 Kris Coward <[email protected]>
>
> >
> > But he asked for a function that returns *another* bag ;)
> >
> > Snark aside, when returning bags or tuples, it's also worthwhile to at
> > least consider also defining the output schema, which for your example
> > code would probably mean
> >
> > public Schema outputSchema(Schema input){
> > Schema output = new Schema();
> > output.add(input.getField(0));
> > return output;
> > }
> >
> > (possibly with some omitted exception handling)
> >
> > -Kris
> >
> > On Mon, Mar 18, 2013 at 11:19:17AM +0100, Jonathan Coveney wrote:
> > > Absolutely.
> > >
> > > public class MyUdf extends EvalFunc<DataBag> {
> > > public DataBag exec(Tuple input) throws IOException {
> > > return (DataBag)input.get(0);
> > > }
> > > }
> > >
> > >
> > > A dummy example, but there you go. DataBag is a valid pig type like any
> > > other, so you just returnit like you would normally.
> > >
> > >
> > > 2013/3/18 pranjal rajput <[email protected]>
> > >
> > > > Hi,
> > > > Can we define a UDF in pig that takes a bag as an input and returns
> > another
> > > > bag as output?
> > > > How can this be done?
> > > > Thanks,
> > > > --
> > > > regards
> > > > Pranjal
> > > >
> >
> > --
> > Kris Coward http://unripe.melon.org/
> > GPG Fingerprint: 2BF3 957D 310A FEEC 4733 830E 21A4 05C7 1FEB 12B3
> >
>