Yep, getSchemaFromString is what I was looking for, but I can't get it
to generate a schema (for unit test purposes) that matches what I get
inside my script during a real run.
As an example, say I have a file like this:
foo\t2
bar\t3
baz\t3
marge\t4
homer\t4
and I load it like this:
infile = load 'test.txt' as (name:chararray, weight:int);
grouped = group infile all;
bucketed = foreach grouped generate flatten(Buckets(infile));
the outputSchema method of my UDF (Buckets) gets called with a schema
that stringifies like so:
{infile: {name: chararray,weight: int}}
i.e. it has a single field, which is a bag, containing two elements
directly (no wrapping tuple, presumably because this is Pig 0.8.1?).
(sidenote, I guess the outermost {}s are a display convention, as
there's only one bag there)
When I'm unit-testing the UDF's outputSchema method, I'd like to
generate exactly that schema.
But if I call getSchemaFromString like this:
Utils.getSchemaFromString("B: {f1: chararray, f2: int}")
It throws a parser error:
Encountered " "{" "{ "" at line 1, column 4.
Was expecting one of:
"int" ...
"long" ...
"float" ...
"double" ...
"chararray" ...
"bytearray" ...
"int" ...
"long" ...
"float" ...
"double" ...
"chararray" ...
"bytearray" ...
Two questions I guess.
(1) Is there a way of generating a schema like that via Utils?
(2) ... or is this schema actually wrong, and I'm looking at a symptom
of https://issues.apache.org/jira/browse/PIG-767 that would behave
differently if I was in Pig 0.9.0?
Many thanks,
Andrew.
On 4 October 2011 00:14, Raghu Angadi <[email protected]> wrote:
> Utils.getSchemaFromString() seems like exactly what you want (
> from org_apache_pig_impl_util ).
>
> Raghu.
>
> [btw. my two previous attempts to send to the list got rejected as spam ]
>
> On Mon, Oct 3, 2011 at 3:41 PM, Andrew Clegg
> <[email protected]>wrote:
>
>> Thanks Raghu (and Dmitry).
>>
>> Could this maybe get added to the docs page on UDFs? (Apologies if
>> it's there already and I missed it.)
>>
>> Also -- it's a bit cumbersome writing all these nested Schema and
>> FieldSchema constructors, especially when you're writing tests for
>> UDFs with flexible schema support.
>>
>> I was wondering if it would be practical to reuse whatever code the
>> front-end uses to parse schema descriptions from load statements in
>> scripts. Is this a silly idea? If it isn't silly, does anyone know
>> where I need to look for that code?
>>
>>
>> On 3 October 2011 22:56, Raghu Angadi <[email protected]> wrote:
>> > my understanding is that Pig 0.8 expects the first form and Pig 0.9
>> requires
>> > the second.
>> >
>> > Raghu.
>> >
>> > On Mon, Oct 3, 2011 at 8:27 AM, Andrew Clegg
>> > <[email protected]>wrote:
>> >
>> >> Hi,
>> >>
>> >> When you have a UDF that returns a bag, and you're writing the
>> >> outputSchema method, do you have to explicitly include the mandatory
>> >> 'container' tuple within the bag, or is this implicit?
>> >>
>> >> i.e. if I'm returning a bag of ints, do I have to do:
>> >>
>> >> return new Schema(
>> >> new FieldSchema(null,
>> >> new Schema(
>> >> new FieldSchema(null, DataType.INTEGER)), DataType.BAG));
>> >>
>> >> Or do I have to explicitly define a tuple like so:
>> >>
>> >> return new Schema(
>> >> new FieldSchema(null,
>> >> new Schema(
>> >> new FieldSchema(null,
>> >> new Schema(
>> >> new FieldSchema(null, DataType.INTEGER)), DataType.TUPLE)),
>> >> DataType.BAG));
>> >>
>> >> The docs seem pretty vague on this, and you're allowed to do either.
>> >> My feeling would be that if the first form was illegal, you wouldn't
>> >> be allowed to create a schema like that, but this may be wishful
>> >> thinking.
>> >>
>> >> Thanks,
>> >>
>> >> Andrew.
>> >>
>> >> --
>> >>
>> >> http://tinyurl.com/andrew-clegg-linkedin |
>> http://twitter.com/andrew_clegg
>> >>
>> >
>>
>>
>>
>> --
>>
>> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>>
>
--
http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg