Awesome :) Thanks a lot.. On Thu, Apr 19, 2012 at 10:26 PM, Jonathan Coveney <[email protected]>wrote:
> Haha, when I say naive I don't mean bad... plenty of my scripts use that > approach, and often it's unavoidable, so it's good to understand. > > as far as the naming issue, when you flatten it is usually a good idea to > give the resultant columns a name. so your example would become: > > > A = load 'one.txt' as (a:int, b:int); > B = load 'two.txt' as (a:int, b:int); > A_1 = foreach A generate flatten(TOTUPLE(a,b)) as (a,b); > B_1 = foreach B generate flatten(TOTUPLE(a,b)) as (x,y); > C = join A_1 by a full, B_1 by x; > describe C > > That will get rid of the org.apache.pig.builtin.totuple_b etc. But let's > say that you still want them to have the same name, you can do that: > > > A = load 'one.txt' as (a:int, b:int); > B = load 'two.txt' as (a:int, b:int); > A_1 = foreach A generate flatten(TOTUPLE(a,b)) as (a,b); > B_1 = foreach B generate flatten(TOTUPLE(a,b)) as (a,b); > C = join A_1 by a full, B_1 by a; > describe C > > And in the join result, you can disambiguate A_1::a and B_1::a, and so on. > > 2012/4/19 Rajgopal Vaithiyanathan <[email protected]> > > > I knew it would sound naive :P I didn't even know a schema parser > exists.! > > > > `it can only return a tuple, which you then flatten into columns.` > > > > > > Isn't this bad..? For example see this, ( for simplicity i'm using > > TOTUPLE instead of my UDF.. ) > > > > A = load 'one.txt' as (a:int, b:int); > > B = load 'two.txt' as (a:int, b:int); > > A_1 = foreach A generate flatten(TOTUPLE(a,b)); > > B_1 = foreach B generate flatten(TOTUPLE(a,b)); > > C = join A_1 by a full, B_1 by a; > > describe C > > > > The schema description is like this. > > > > C: {A_1::org.apache.pig.builtin.totuple_b_18::a: > > int,A_1::org.apache.pig.builtin.totuple_b_18::b: > > int,B_1::org.apache.pig.builtin.totuple_b_19::a: > > int,B_1::org.apache.pig.builtin.totuple_b_19::b: int} > > > > and totuple_b_** in the description obviously changes every time i > > describe because it is based on a counter.... > > Now how do i disambiguate between The A_1's a,b and B_1's a,b ? > > > > > > On Thu, Apr 19, 2012 at 12:07 PM, Jonathan Coveney <[email protected] > > >wrote: > > > > > Dmitriy's suggestion is spot on, but just to be pedantic, you'd do: > > > > > > public Schema outputSchema(Schema input) { > > > List<FieldSchema> list = new ArrayList<FieldSchema>(); > > > list.add(new FieldSchema("one", DataType.CHARARRAY)); > > > list.add(new FieldSchema("two", DataType.CHARARRAY)) > > > > > > return new Schema(new Schema.FieldSchema("t", new Schema(list), > > > DataType.TUPLE)); > > > } > > > > > > That said, in your question you asked: "how can you get it without the > > > parenthesis." Short answer is that you can't. A UDF can't return > multiple > > > columns -- it can only return a tuple, which you then flatten into > > columns. > > > > > > 2012/4/18 Dmitriy Ryaboy <[email protected]> > > > > > > > It's messy. Easier to use the schema parser: > > > > > > > > > > > > > > org.apache.pig.impl.util.Utils.getSchemaFromString("t:tuple(len:int,word:chararray)"); > > > > > > > > Even easier to use the @OutputSchema annotation (coming in 0.11 I > > > believe) > > > > > > > > -D > > > > > > > > > > > > On Wed, Apr 18, 2012 at 7:02 PM, Rajgopal Vaithiyanathan > > > > <[email protected]> wrote: > > > > > Hey all, > > > > > > > > > > Sorry if i sound naive, but how should one implement outputSchema > of > > > an > > > > > eval Func that returns tuple. > > > > > The way i do it is , > > > > > > > > > > public Schema outputSchema(Schema input) { > > > > > List<FieldSchema> list = new ArrayList<FieldSchema>(); > > > > > list.add(new FieldSchema("one", DataType.CHARARRAY)); > > > > > list.add(new FieldSchema("two", DataType.CHARARRAY)) > > > > > > > > > > return new Schema(list); > > > > > } > > > > > > > > > > but in the front end, If i use > > > > > B = foreach A generate flatten(FUNC()); > > > > > describe B > > > > > I get the schema like this: > > > > > { ( one:chararray, two:chararray ) } > > > > > Now i use a flatten on this like : > > > > > B = foreach A generate flatten(FUNC()); > > > > > and i get { null::one : chararray, null::two : chararray } > > > > > > > > > > The question is, > > > > > How should i implement the outputSchema so that i get the schema > > like { > > > > one > > > > > : chararray, two : chararray } // NOTE: without the parenthesis > > > > > > > > > > > > > Raj :) > > > -- Thanks and Regards, Rajgopal Vaithiyanathan.
