I knew it would sound naive :P I didn't even know a schema parser exists.!
`it can only return a tuple, which you then flatten into columns.`
Isn't this bad..? For example see this, ( for simplicity i'm using
TOTUPLE instead of my UDF.. )
A = load 'one.txt' as (a:int, b:int);
B = load 'two.txt' as (a:int, b:int);
A_1 = foreach A generate flatten(TOTUPLE(a,b));
B_1 = foreach B generate flatten(TOTUPLE(a,b));
C = join A_1 by a full, B_1 by a;
describe C
The schema description is like this.
C: {A_1::org.apache.pig.builtin.totuple_b_18::a:
int,A_1::org.apache.pig.builtin.totuple_b_18::b:
int,B_1::org.apache.pig.builtin.totuple_b_19::a:
int,B_1::org.apache.pig.builtin.totuple_b_19::b: int}
and totuple_b_** in the description obviously changes every time i
describe because it is based on a counter....
Now how do i disambiguate between The A_1's a,b and B_1's a,b ?
On Thu, Apr 19, 2012 at 12:07 PM, Jonathan Coveney <[email protected]>wrote:
> Dmitriy's suggestion is spot on, but just to be pedantic, you'd do:
>
> public Schema outputSchema(Schema input) {
> List<FieldSchema> list = new ArrayList<FieldSchema>();
> list.add(new FieldSchema("one", DataType.CHARARRAY));
> list.add(new FieldSchema("two", DataType.CHARARRAY))
>
> return new Schema(new Schema.FieldSchema("t", new Schema(list),
> DataType.TUPLE));
> }
>
> That said, in your question you asked: "how can you get it without the
> parenthesis." Short answer is that you can't. A UDF can't return multiple
> columns -- it can only return a tuple, which you then flatten into columns.
>
> 2012/4/18 Dmitriy Ryaboy <[email protected]>
>
> > It's messy. Easier to use the schema parser:
> >
> >
> org.apache.pig.impl.util.Utils.getSchemaFromString("t:tuple(len:int,word:chararray)");
> >
> > Even easier to use the @OutputSchema annotation (coming in 0.11 I
> believe)
> >
> > -D
> >
> >
> > On Wed, Apr 18, 2012 at 7:02 PM, Rajgopal Vaithiyanathan
> > <[email protected]> wrote:
> > > Hey all,
> > >
> > > Sorry if i sound naive, but how should one implement outputSchema of
> an
> > > eval Func that returns tuple.
> > > The way i do it is ,
> > >
> > > public Schema outputSchema(Schema input) {
> > > List<FieldSchema> list = new ArrayList<FieldSchema>();
> > > list.add(new FieldSchema("one", DataType.CHARARRAY));
> > > list.add(new FieldSchema("two", DataType.CHARARRAY))
> > >
> > > return new Schema(list);
> > > }
> > >
> > > but in the front end, If i use
> > > B = foreach A generate flatten(FUNC());
> > > describe B
> > > I get the schema like this:
> > > { ( one:chararray, two:chararray ) }
> > > Now i use a flatten on this like :
> > > B = foreach A generate flatten(FUNC());
> > > and i get { null::one : chararray, null::two : chararray }
> > >
> > > The question is,
> > > How should i implement the outputSchema so that i get the schema like {
> > one
> > > : chararray, two : chararray } // NOTE: without the parenthesis
> >
>
Raj :)