Re: Using PIG with complex SQL Statement

Dan DeCapria, CivicScience Tue, 28 Oct 2014 07:20:45 -0700

Hi Vineet,

Expanding upon Lorand's resources, please note this all really depends on
your actual use case.  When blocking out code to transform from SQL to Pig
latin, it's usually a good idea to just flow-chart plan the logical process
of what you want to do - just like you would for SQL queries.  Then it's
just a matter of optimizing said queries - again, just like you would with
SQL queries on the DBA layer.  the 'under-the-hood' optimizations to MR is
done by Pig.


Generically, this follows a simple paradigm, ie):

--  optional runner: nohup pig -p REDUCERS=180 -f /home/hadoop/my_file.pig
2>&1 > /tmp/my_file.out &

--  some example configurations, ie) gzip compress the output
SET output.compression.enabled true;
SET output.compression.codec org.apache.hadoop.io.compress.GzipCodec;
--SET default_parallel $REDUCERS;

A0 = LOAD '/path/to/hdfs/data.dat' USING some.load.func() AS (the typed
schema); -- loader data source A
A1 = FOREACH A0 GENERATE stuff; -- projection steps
A = FILTER A1 BY (stuff); -- filter prior to JOIN

B0 = LOAD '/path/to/hdfs/data.dat' USING some.load.func() AS (the typed
schema); -- loader data source B
B1 = FOREACH B0 GENERATE stuff; -- projection steps
B = FILTER B1 BY (stuff); -- filter prior to JOIN

C0 = JOIN A BY (pk), B BY (pk) PARALLEL $REDUCERS; -- where size(A) >
size(B), PARALLEL to force use of all MR capacity
C = FOREACH C0 GENERATE stuff; -- re-alias the JOIN step fields to what you
want, projection

D0 = GROUP C BY (cks); -- perform your grouping operation
D = FOREACH D0 GENERATE FLATTEN(group) AS (cks), (int)COUNT(C) AS
example_count:int; -- whatever aggregation stats you wanted to perform wrt
the GROUP BY operation

STORE D INTO '/path/to/hdfs/storage/file' USING PigStorage(); -- flat,
tab-delimited file output of typed schema fields from [D]; here I used
PigStorage() store.func

Hope this helps,  -Dan


On Tue, Oct 28, 2014 at 10:09 AM, Lorand Bendig <[email protected]> wrote:

> Hi Vineet,
>
> I'd recommend you have a look at these excellent resources:
>
> http://hortonworks.com/blog/pig-eye-for-the-sql-guy/
> http://mortar-public-site-content.s3-website-us-east-1.
> amazonaws.com/Mortar-Pig-Cheat-Sheet.pdf
> http://www.slideshare.net/trihug/practical-pig/11
>
> --Lorand
>
>
> On 28/10/14 14:34, Vineet Mishra wrote:
>
>> Hi,
>>
>> I was looking out to transform SQL statement which is consisting of
>> multiple clause in the same query specifically, a JOIN followed by some
>> condition(WHERE) and finally grouping on some fields(GROUP BY).
>> Can I have a link or some briefing which can guide me how can I implement
>> this k/o of complex SQL statement in PIG.
>>
>> Thanks!
>>
>>
>


-- 
Dan DeCapria
CivicScience, Inc.
Back-End Data IS/BI/DM/ML Specialist

Re: Using PIG with complex SQL Statement

Reply via email to