Re: outputSchema for UDF EvalFunc returning a DataBag

Andrew Clegg Tue, 04 Oct 2011 06:02:22 -0700

Yep, getSchemaFromString is what I was looking for, but I can't get it
to generate a schema (for unit test purposes) that matches what I get
inside my script during a real run.


As an example, say I have a file like this:

foo\t2
bar\t3
baz\t3
marge\t4
homer\t4

and I load it like this:

infile = load 'test.txt' as (name:chararray, weight:int);
grouped = group infile all;
bucketed = foreach grouped generate flatten(Buckets(infile));

the outputSchema method of my UDF (Buckets) gets called with a schema
that stringifies like so:

{infile: {name: chararray,weight: int}}

i.e. it has a single field, which is a bag, containing two elements
directly (no wrapping tuple, presumably because this is Pig 0.8.1?).

(sidenote, I guess the outermost {}s are a display convention, as
there's only one bag there)

When I'm unit-testing the UDF's outputSchema method, I'd like to
generate exactly that schema.

But if I call getSchemaFromString like this:

Utils.getSchemaFromString("B: {f1: chararray, f2: int}")

It throws a parser error:

Encountered " "{" "{ "" at line 1, column 4.
Was expecting one of:
    "int" ...
    "long" ...
    "float" ...
    "double" ...
    "chararray" ...
    "bytearray" ...
    "int" ...
    "long" ...
    "float" ...
    "double" ...
    "chararray" ...
    "bytearray" ...

Two questions I guess.

(1) Is there a way of generating a schema like that via Utils?

(2) ... or is this schema actually wrong, and I'm looking at a symptom
of https://issues.apache.org/jira/browse/PIG-767 that would behave
differently if I was in Pig 0.9.0?

Many thanks,

Andrew.


On 4 October 2011 00:14, Raghu Angadi <[email protected]> wrote:
> Utils.getSchemaFromString() seems like exactly what you want (
> from org_apache_pig_impl_util ).
>
> Raghu.
>
> [btw. my two previous attempts to send to the list got rejected as spam ]
>
> On Mon, Oct 3, 2011 at 3:41 PM, Andrew Clegg
> <[email protected]>wrote:
>
>> Thanks Raghu (and Dmitry).
>>
>> Could this maybe get added to the docs page on UDFs? (Apologies if
>> it's there already and I missed it.)
>>
>> Also -- it's a bit cumbersome writing all these nested Schema and
>> FieldSchema constructors, especially when you're writing tests for
>> UDFs with flexible schema support.
>>
>> I was wondering if it would be practical to reuse whatever code the
>> front-end uses to parse schema descriptions from load statements in
>> scripts. Is this a silly idea? If it isn't silly, does anyone know
>> where I need to look for that code?
>>
>>
>> On 3 October 2011 22:56, Raghu Angadi <[email protected]> wrote:
>> > my understanding is that Pig 0.8 expects the first form and Pig 0.9
>> requires
>> > the second.
>> >
>> > Raghu.
>> >
>> > On Mon, Oct 3, 2011 at 8:27 AM, Andrew Clegg
>> > <[email protected]>wrote:
>> >
>> >> Hi,
>> >>
>> >> When you have a UDF that returns a bag, and you're writing the
>> >> outputSchema method, do you have to explicitly include the mandatory
>> >> 'container' tuple within the bag, or is this implicit?
>> >>
>> >> i.e. if I'm returning a bag of ints, do I have to do:
>> >>
>> >> return new Schema(
>> >>  new FieldSchema(null,
>> >>    new Schema(
>> >>      new FieldSchema(null, DataType.INTEGER)), DataType.BAG));
>> >>
>> >> Or do I have to explicitly define a tuple like so:
>> >>
>> >> return new Schema(
>> >>  new FieldSchema(null,
>> >>    new Schema(
>> >>      new FieldSchema(null,
>> >>        new Schema(
>> >>          new FieldSchema(null, DataType.INTEGER)), DataType.TUPLE)),
>> >> DataType.BAG));
>> >>
>> >> The docs seem pretty vague on this, and you're allowed to do either.
>> >> My feeling would be that if the first form was illegal, you wouldn't
>> >> be allowed to create a schema like that, but this may be wishful
>> >> thinking.
>> >>
>> >> Thanks,
>> >>
>> >> Andrew.
>> >>
>> >> --
>> >>
>> >> http://tinyurl.com/andrew-clegg-linkedin |
>> http://twitter.com/andrew_clegg
>> >>
>> >
>>
>>
>>
>> --
>>
>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>
>



-- 

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

Re: outputSchema for UDF EvalFunc returning a DataBag

Reply via email to