Hi Vineet,

I believe I am understand more about what you are asking for now.  Note the
the FILTER statement will only allow through any tuples matching the
predicate logic condition specified in the filter (in this case your
massive disjunction statement hard-coded in the script).  Below is a
possible approach, which uses an external file to hold all \n delimited
category_ids to filter on using a replicated-join.

D0 = LOAD '/input_filter_list' USING PigStorage('\t','-noschema') AS
(category_id: int);
D1 = DISTINCT D0;
D2 = JOIN C BY (category_id) LEFT OUTER, D1 BY (category_id) USING
'replicated'; --0,1,2, 3
D = FOREACH D2 GENERATE $0 AS id, $1 AS category_id, $2 AS category_name,
($3 IS NOT NULL ? true : false) AS classifier_in_filter_list:boolean;

E0 = GROUP D BY (category_id, category_name, classifier_in_filter_list);
E = FOREACH E0 GENERATE FLATTEN(group) AS (category_id, category_name,
classifier_in_filter_list), (int)COUNT(D) AS cat_id_classifier_count:int;

Now this may not be entirely what you want, as it seems like you are trying
to existentially verify the filter predicate logic against your test sample
as a classifier, but maybe it'll help get some ideas going for you.
Hope this helps,

-Dan


On Fri, Oct 31, 2014 at 2:09 AM, Vineet Mishra <clearmido...@gmail.com>
wrote:

> Hi Dan,
>
> Thanks for your response, although
>
> FILTER cat_ids BY (category_id == 1);
>
> is working fine but having multiple IN matches would make the script very
> large, so considering my IN clause contains multiple values, such as
>
> sum_up_count = FILTER cat_ids BY category_id IN (1,2,3,4,5. . . 100);
>
> will make the query
>
> D0 = FILTER C BY (category_id == 1 OR category_id == 2 OR category_id == 3
> OR . . . category_id == 100);
>
> Is there any move around for this, moreover I wanted to return 1 if filter
> matches and 0 if not!
>
> Looking out for your valuable response.
>
> Thanks!
>
>
>
> On Thu, Oct 30, 2014 at 7:16 PM, Dan DeCapria, CivicScience <
> dan.decap...@civicscience.com> wrote:
>
> > Hi Vineet,
> >
> > Not entirely sure I'm understanding the problem correctly, but perhaps
> the
> > error you are getting can be fixed by:
> > sum_up_count = FILTER cat_ids BY (category_id == 1);
> >
> > I think that having a more clear description of your use case and input
> > data sets along with your current pig script in development will help us
> in
> > debugging this better.
> > As such, this might not help, but here's a crack at reverse-engineering
> > your problem space:
> >
> > A0 = LOAD '/input_01' USING PigStorage('\t','-noschema') AS (id:int,
> > category_id:int, category_name: chararray, meta:chararray);
> > A1 = FOREACH A0 GENERATE id, category_id, category_name;
> > A = DISTINCT A1;
> >
> > B0 = LOAD '/input_02' USING PigStorage('\t','-noschema') AS (category_id:
> > int, more_meta:chararray);
> > B1 = FOREACH B0 GENERATE category_id;
> > B = DISTINCT B1;
> >
> > C0 = JOIN A BY (category_id), B BY (category_id); -- 0,1,2, 3
> > C1 = FOREACH C0 GENERATE $0 AS id, $1 AS category_id, $2 AS
> category_name;
> > C = DISTINCT C1;
> >
> > D0 = FILTER C BY (category_id == 1);
> > D1 = GROUP D0 BY (category_id, category_name);
> > D = FOREACH D1 GENERATE FLATTEN(group) AS (category_id, category_name),
> > (int)COUNT(D0) AS cat_id_1_count:int;
> >
> >
> > Hope this helps,  -Dan
> >
> > On Thu, Oct 30, 2014 at 7:10 AM, Vineet Mishra <clearmido...@gmail.com>
> > wrote:
> >
> > > Hi Dan,
> > >
> > > I am trying to put Filter inside a Foreach, the description of the
> > group(on
> > > which the FOREACH iteration is happening) is mentioned below. I am
> trying
> > > to get counts of which all are passing the filter,
> > >
> > > Describe grp:
> > > grp: {group: (a::category_id: int,a::category_name: chararray),joind:
> > > {(a::id: int,a::category_id: int, b::category_id: int)}}
> > >
> > > Script:
> > > purified = FOREACH grp {
> > >  cat_ids_bag = FOREACH joind generate b::category_id;
> > >
> > > cat_ids = foreach cat_ids_bag generate flatten(category_id);
> > >
> > > sum_up_count = FILTER cat_ids BY category_id IN (1);
> > > }
> > >
> > > Its throwing error,
> > >   Syntax error, unexpected symbol at or near 'IN'
> > >
> > > Looking out for urgent response.
> > > Thanks!
> > >
> > > On Tue, Oct 28, 2014 at 7:49 PM, Dan DeCapria, CivicScience <
> > > dan.decap...@civicscience.com> wrote:
> > >
> > > > Hi Vineet,
> > > >
> > > > Expanding upon Lorand's resources, please note this all really
> depends
> > on
> > > > your actual use case.  When blocking out code to transform from SQL
> to
> > > Pig
> > > > latin, it's usually a good idea to just flow-chart plan the logical
> > > process
> > > > of what you want to do - just like you would for SQL queries.  Then
> > it's
> > > > just a matter of optimizing said queries - again, just like you would
> > > with
> > > > SQL queries on the DBA layer.  the 'under-the-hood' optimizations to
> MR
> > > is
> > > > done by Pig.
> > > >
> > > > Generically, this follows a simple paradigm, ie):
> > > >
> > > > --  optional runner: nohup pig -p REDUCERS=180 -f
> > > /home/hadoop/my_file.pig
> > > > 2>&1 > /tmp/my_file.out &
> > > >
> > > > --  some example configurations, ie) gzip compress the output
> > > > SET output.compression.enabled true;
> > > > SET output.compression.codec org.apache.hadoop.io.compress.GzipCodec;
> > > > --SET default_parallel $REDUCERS;
> > > >
> > > > A0 = LOAD '/path/to/hdfs/data.dat' USING some.load.func() AS (the
> typed
> > > > schema); -- loader data source A
> > > > A1 = FOREACH A0 GENERATE stuff; -- projection steps
> > > > A = FILTER A1 BY (stuff); -- filter prior to JOIN
> > > >
> > > > B0 = LOAD '/path/to/hdfs/data.dat' USING some.load.func() AS (the
> typed
> > > > schema); -- loader data source B
> > > > B1 = FOREACH B0 GENERATE stuff; -- projection steps
> > > > B = FILTER B1 BY (stuff); -- filter prior to JOIN
> > > >
> > > > C0 = JOIN A BY (pk), B BY (pk) PARALLEL $REDUCERS; -- where size(A) >
> > > > size(B), PARALLEL to force use of all MR capacity
> > > > C = FOREACH C0 GENERATE stuff; -- re-alias the JOIN step fields to
> what
> > > you
> > > > want, projection
> > > >
> > > > D0 = GROUP C BY (cks); -- perform your grouping operation
> > > > D = FOREACH D0 GENERATE FLATTEN(group) AS (cks), (int)COUNT(C) AS
> > > > example_count:int; -- whatever aggregation stats you wanted to
> perform
> > > wrt
> > > > the GROUP BY operation
> > > >
> > > > STORE D INTO '/path/to/hdfs/storage/file' USING PigStorage(); --
> flat,
> > > > tab-delimited file output of typed schema fields from [D]; here I
> used
> > > > PigStorage() store.func
> > > >
> > > > Hope this helps,  -Dan
> > > >
> > > >
> > > > On Tue, Oct 28, 2014 at 10:09 AM, Lorand Bendig <lben...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi Vineet,
> > > > >
> > > > > I'd recommend you have a look at these excellent resources:
> > > > >
> > > > > http://hortonworks.com/blog/pig-eye-for-the-sql-guy/
> > > > > http://mortar-public-site-content.s3-website-us-east-1.
> > > > > amazonaws.com/Mortar-Pig-Cheat-Sheet.pdf
> > > > > http://www.slideshare.net/trihug/practical-pig/11
> > > > >
> > > > > --Lorand
> > > > >
> > > > >
> > > > > On 28/10/14 14:34, Vineet Mishra wrote:
> > > > >
> > > > >> Hi,
> > > > >>
> > > > >> I was looking out to transform SQL statement which is consisting
> of
> > > > >> multiple clause in the same query specifically, a JOIN followed by
> > > some
> > > > >> condition(WHERE) and finally grouping on some fields(GROUP BY).
> > > > >> Can I have a link or some briefing which can guide me how can I
> > > > implement
> > > > >> this k/o of complex SQL statement in PIG.
> > > > >>
> > > > >> Thanks!
> > > > >>
> > > > >>
> > > > >
> > > >
> > > >
> > > > --
> > > > Dan DeCapria
> > > > CivicScience, Inc.
> > > > Back-End Data IS/BI/DM/ML Specialist
> > > >
> > >
> >
> >
> >
> > --
> > Dan DeCapria
> > CivicScience, Inc.
> > Back-End Data IS/BI/DM/ML Specialist
> >
>



-- 
Dan DeCapria
CivicScience, Inc.
Back-End Data IS/BI/DM/ML Specialist

Reply via email to