[Pig Wiki] Update of "PigLatinSchemas" by OlgaN

Apache Wiki Wed, 07 Nov 2007 11:50:58 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by OlgaN:
http://wiki.apache.org/pig/PigLatinSchemas

New page:
[[Anchor(Pig_Latin_Schemas)]]
=== Pig Latin Schemas ===

[[Anchor(Defining_a_schema_in_a_LOAD_statement)]]
==== Defining a schema in a LOAD statement ====
The basic grammar for schema definition is taken from the JSON/Python  
tuple/list/map definition, and is as follows:
field1 = Atom alias
name : (f1, f2, ...) = Tuple alias and schema

So the schema:
{{{
(time, query : (display, normalized), results :  [url, title, summary])
}}}

would define a Tuple where the first field is an Atom called "time",  
the second field is a Tuple called "query" with the Atom fields  
"display" and "normalized", and the third field is a Bag called  
"results", which contains tuples that have three Atom fields  
"url", "title" and "summary".

The "AS" keyword on a LOAD statement allows you to define a schema for a 
particular alias. For example,

{{{
A = load 'input1' as (tstamp, cookie, query);
B = load 'input2' as (query, url, rank);
}}}

associates schemas with A and B.

[[Anchor(Schema_Propagation)]]
==== Schema Propagation ====

The system will do its best to infer the schema for a derived alias based on 
the schemas of the input aliases. 

Continuing with our running example, suppose we have
{{{
C = cogroup A by query, B by query;
}}}
Then C will be assigned the schema `(group, A: [tstamp, cookie, query] , B: 
[query, url, rank])`

[[Anchor(Referring_to_Nested_Fields,_i.e.,_Nested_Projection)]]
==== Referring to Nested Fields, i.e., Nested Projection ====
You can refer to fields up to 1 level below in the nesting. Thus, in the above 
example, you can say,
{{{
foreach C generate group, A.cookie
}}}

[[Anchor(Name_Ambiguity_Resolution)]]
==== Name Ambiguity Resolution ====
Sometimes, when using FLATTEN, there might be name ambiguities in schemas from 
two different inputs. Thus, if in the above example, we write

{{{
D = foreach C generate flatten(A), flatten(B)
}}}

There will be a name ambiguity since both flatten(A) and flatten(B) have the 
field `query`. To avoid ambiguity in such cases, fields can be referred to by 
`<outer-alias>::fieldName`. Thus for C, we can refer to either `A::query` or 
`B::query` but not to `query`. 

However, the unambiguous fields can be accessed both by their names as well as 
by `<outer-alias>::fieldName`. Thus for C, both `url` or `B::url` will access 
the same field.

[[Anchor(Assigning_Names_to_Individual_Items_in_GENERATE)]]
==== Assigning Names to Individual Items in GENERATE ====
Just like in SQL where you can give names to individual items in the select 
list, we can name individual items in the generate clause using AS. Thus, in 
our example,
{{{
E = foreach D generate (cookie eq 'null' ? 'null' : url ) as nullifiedUrl, rank 
as myRank;
}}}
This will assign a schema `(nullifiedUrl, myRank)` to E.


[[Anchor(Schemas_of_Functions)]]
==== Schemas of Functions ====
Eval functions can specify their own output schema by overriding the 
outputSchema() method. The builtin function SUM specifies that its output is 
called `sum`. Thus, 
{{{
F = foreach C generate group, SUM(tstamp);
}}}
F gets assigned the schema: `(group,sum)`. This can of course be overriden 
e.g., `generate group, SUM(tstamp) as sumTstamp`.
 

[[Anchor(Last_Resort:_Overriding_system-inferred_schemas)]]
==== Last Resort: Overriding system-inferred schemas ====
Sometimes the system cannot infer a schema (e.g., binconds, evalfunctions that 
dont specify one). In these cases, and also in others when you want to override 
the system-inferred schema you can override it using the AS clause. Thus, you 
could say:
{{{
C = (cogroup A by query, B by query) as (group, foo, bar);
}}}
and C would be assigned the schema `(foo,bar)`.

[Pig Wiki] Update of "PigLatinSchemas" by OlgaN

Reply via email to