Hi Vineet,

Expanding upon Lorand's resources, please note this all really depends on
your actual use case.  When blocking out code to transform from SQL to Pig
latin, it's usually a good idea to just flow-chart plan the logical process
of what you want to do - just like you would for SQL queries.  Then it's
just a matter of optimizing said queries - again, just like you would with
SQL queries on the DBA layer.  the 'under-the-hood' optimizations to MR is
done by Pig.

Generically, this follows a simple paradigm, ie):

--  optional runner: nohup pig -p REDUCERS=180 -f /home/hadoop/my_file.pig
2>&1 > /tmp/my_file.out &

--  some example configurations, ie) gzip compress the output
SET output.compression.enabled true;
SET output.compression.codec org.apache.hadoop.io.compress.GzipCodec;
--SET default_parallel $REDUCERS;

A0 = LOAD '/path/to/hdfs/data.dat' USING some.load.func() AS (the typed
schema); -- loader data source A
A1 = FOREACH A0 GENERATE stuff; -- projection steps
A = FILTER A1 BY (stuff); -- filter prior to JOIN

B0 = LOAD '/path/to/hdfs/data.dat' USING some.load.func() AS (the typed
schema); -- loader data source B
B1 = FOREACH B0 GENERATE stuff; -- projection steps
B = FILTER B1 BY (stuff); -- filter prior to JOIN

C0 = JOIN A BY (pk), B BY (pk) PARALLEL $REDUCERS; -- where size(A) >
size(B), PARALLEL to force use of all MR capacity
C = FOREACH C0 GENERATE stuff; -- re-alias the JOIN step fields to what you
want, projection

D0 = GROUP C BY (cks); -- perform your grouping operation
D = FOREACH D0 GENERATE FLATTEN(group) AS (cks), (int)COUNT(C) AS
example_count:int; -- whatever aggregation stats you wanted to perform wrt
the GROUP BY operation

STORE D INTO '/path/to/hdfs/storage/file' USING PigStorage(); -- flat,
tab-delimited file output of typed schema fields from [D]; here I used
PigStorage() store.func

Hope this helps,  -Dan


On Tue, Oct 28, 2014 at 10:09 AM, Lorand Bendig <lben...@gmail.com> wrote:

> Hi Vineet,
>
> I'd recommend you have a look at these excellent resources:
>
> http://hortonworks.com/blog/pig-eye-for-the-sql-guy/
> http://mortar-public-site-content.s3-website-us-east-1.
> amazonaws.com/Mortar-Pig-Cheat-Sheet.pdf
> http://www.slideshare.net/trihug/practical-pig/11
>
> --Lorand
>
>
> On 28/10/14 14:34, Vineet Mishra wrote:
>
>> Hi,
>>
>> I was looking out to transform SQL statement which is consisting of
>> multiple clause in the same query specifically, a JOIN followed by some
>> condition(WHERE) and finally grouping on some fields(GROUP BY).
>> Can I have a link or some briefing which can guide me how can I implement
>> this k/o of complex SQL statement in PIG.
>>
>> Thanks!
>>
>>
>


-- 
Dan DeCapria
CivicScience, Inc.
Back-End Data IS/BI/DM/ML Specialist

Reply via email to