Hey, I am sorry but I din't get how substring will help in this?

On Fri, May 20, 2011 at 1:08 PM, Xiaomeng Wan <[email protected]> wrote:

> you can try using some divide and conquer, like this:
>
> a = group data by (key, SUBSTRING(the_field_to_be_distinct, 0, 2));
> b = foreach a { x = distinct a.he_field_to_be_distinct; generate
> group.key as key, COUNT(x) as cnt; }
> c = group b by key;
> d = foreach c generate group as key, SUM(b.cnt) as cnt;
>
> using longer substring if still running into OOM.
>
> Regards,
> Shawn
>
> On Fri, May 20, 2011 at 1:11 PM, sonia gehlot <[email protected]>
> wrote:
> > Hey Thejas,
> >
> > I tried setting up property pig.cachedbag.memusage to 0.1 and also tried
> > computing distinct count for each type separately but still I am getting
> > errors like
> >
> > Error: java.lang.OutOfMemoryError: Java heap space
> > Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
> > java.io.IOException: Spill failed
> >
> > Is there is some other way to do distinct - count that you may suggest?
> >
> > Thanks for your help.
> >
> > Sonia
> >
> > On Fri, May 13, 2011 at 4:46 PM, Thejas M Nair <[email protected]>
> wrote:
> >
> >> The stack trace shows that the OOM error is happening when the distinct
> is
> >> being applied. It looks like in some record(s) of the relation group_it,
> >> one
> >> more of the following bags is very large - logic.c_users,
>  logic.nc_users
> >> or
> >> logic.registered_users;
> >>
> >> Try setting the property pig.cachedbag.memusage to 0.1 or lower (
> >> -Dpig.cachedbag.memusage=0.1 on java command line). It controls the
> memory
> >> used by pig internal bags, including those used by distinct.
> >>
> >> If that does not work, you can try computing count-distinct for each
> type
> >> of
> >> user separately and then combining the result.
> >>
> >>
> >> You might want to have a look at this way of optimizing count-distinct
> >> queries where skew can be a problem -
> >> https://issues.apache.org/jira/browse/PIG-1846
> >>
> >> -thejas
> >>
> >>
> >>
> >>
> >>
> >> On 5/12/11 10:43 AM, "sonia gehlot" <[email protected]> wrote:
> >>
> >> > Hi Guys,
> >> >
> >> > I am running following Pig script in Pig 0.8 version
> >> >
> >> > page_events = LOAD '/user/sgehlot/day=2011-05-10' as
> >> >
> >>
> (event_dt_ht:chararray,event_dt_ut:chararray,event_rec_num:int,event_type:int,
> >> >
> >>
> client_ip_addr:long,hub_id:int,is_cookied_user:int,local_ontology_node_id:int,
> >> >
> >>
> page_type_id:int,content_id:int,product_id:int,referrer_edition_id:int,page_nu
> >> >
> >>
> mber:int,is_iab_robot:int,browser_id:int,os_id:int,dw_pubsys_id:int,refresh:in
> >> >
> >>
> t,asset_id:int,asset_type_id:int,content_type_id:int,product_type_id:int,outbo
> >> >
> >>
> und_email_id:long,gbal_clc:int,mtype:int,user_action_id:int,referring_partner_
> >> >
> >>
> id:int,ontology_node_id:int,content_namespace_id:int,product_namespace_id:int,
> >> >
> >>
> transparent_edition_id:int,default_edition_id:int,event_seq_num:int,is_last_pa
> >> >
> >>
> ge:int,is_new_user:int,page_duration:int,page_seq_num:int,session_id:long,time
> >> >
> >>
> _since_sess_start:int,reg_cookie:chararray,urs_app_id:int,is_reg_user:int,edit
> >> >
> >>
> ion_id:int,user_agent_id:int,page_type_key:int,referrer_id:int,channel_id:int,
> >> >
> >>
> level2_id:int,level3_id:int,brand_id:int,content_key:int,product_key:int,editi
> >> >
> >>
> on_key:int,partner_key:int,business_unit_id:int,anon_cookie:chararray,machine_
> >> >
> >>
> name:chararray,pagehost:chararray,filenameextension:chararray,referrerpath:cha
> >> >
> >>
> rarray,referrerhost:chararray,referring_oid:chararray,referring_legacy_oid:cha
> >> >
> >>
> rarray,ctype:chararray,cval:chararray,link_tag:chararray,link_type:chararray,s
> >> >
> >>
> ticky_tag:chararray,page_url:chararray,search_category:chararray,partner_subje
> >> >
> >>
> ct:chararray,referring_partner_name:chararray,robot_pattern:chararray,browser:
> >> >
> >>
> chararray,browser_major_version:chararray,browser_minor_version:chararray,os:c
> >> >
> >>
> hararray,os_family:chararray,ttag:chararray,dest_oid:chararray,global_id:chara
> >> >
> >>
> rray,hostname:chararray,path:chararray,filename:chararray,extension:chararray,
> >> >
> >>
> query:chararray,user_agent:chararray,xrq:chararray,xref:chararray,page_guid:ch
> >> >
> >>
> ararray,test_name:chararray,test_group:chararray,test_version:chararray,page_v
> >> >
> >>
> ersion:chararray,o_sticky_tag:chararray,new_referring_oid:chararray,day:charar
> >> >
> >>
> ray,network_ip:int,site_id:int,search_phrase:chararray,search_attributes:chara
> >> >
> >>
> rray,web_search_phrase:chararray,ip_address:chararray,is_pattern_match_robot:i
> >> >
> >>
> nt,protocol:chararray,skc_title:chararray,skc_url:chararray,has_site_search_ph
> >> >
> >>
> rase:int,has_site_search_attribs:int,has_web_search_phrase:int,title_id:charar
> >> > ray,url_id:chararray,network_rev:int);
> >> >
> >> > referrer_group_map = LOAD
> >> '/user/sgehlot/oozie/db_data/referrer_group_map'
> >> > as
> >> > (referrer_id:int, has_web_search_phrase:int, hostname:chararray,
> >> > referral_type_id:int,
> >> > referral_type_name:chararray,
> >> >
> >>
> referrer_group_id:int,referrer_group_name:chararray,referrer_group_cat_id:int,
> >> > referrer_group_cat:chararray);
> >> >
> >> > filter_pe = FILTER page_events BY is_iab_robot == 0 AND
> >> > is_pattern_match_robot == 0 AND day == '2011-05-10';
> >> >
> >> > select_pe_col = FOREACH filter_pe GENERATE day, is_cookied_user,
> >> > anon_cookie, reg_cookie, referrer_id, has_web_search_phrase,
> >> > business_unit_id;
> >> >
> >> > select_ref_col = FOREACH referrer_group_map GENERATE referrer_id,
> >> > has_web_search_phrase, referral_type_id;
> >> >
> >> > jn = JOIN select_ref_col BY (referrer_id, has_web_search_phrase),
> >> > select_pe_col BY (referrer_id, has_web_search_phrase);
> >> >
> >> >
> >> > logic = FOREACH jn GENERATE
> >> > select_pe_col::day,
> >> > select_ref_col::referral_type_id,
> >> >  select_pe_col::business_unit_id,
> >> > ((select_pe_col::is_cookied_user == 1) ? select_pe_col::anon_cookie :
> >> null)
> >> > as c_users,
> >> >  ((select_pe_col::is_cookied_user == 0) ? select_pe_col::anon_cookie :
> >> null)
> >> > as nc_users,
> >> > ((select_pe_col::reg_cookie == '-1') ? null :
> select_pe_col::reg_cookie)
> >> as
> >> > registered_users;
> >> >  group_it = GROUP logic BY (select_pe_col::day,
> >> >    select_ref_col::referral_type_id,
> >> >    select_pe_col::business_unit_id);
> >> >
> >> > agg_results = FOREACH group_it{
> >> >  dst_c_users = DISTINCT logic.c_users;
> >> > dst_nc_users = DISTINCT logic.nc_users;
> >> >  dst_registered_users = DISTINCT logic.registered_users;
> >> >       GENERATE
> >> >        group.select_pe_col::day,
> >> >         group.select_ref_col::referral_type_id,
> >> > group.select_pe_col::business_unit_id,
> >> >  COUNT(dst_c_users) as c_users,
> >> > COUNT(dst_nc_users) as nc_users,
> >> > COUNT(dst_registered_users) as registered_users;
> >> >  };
> >> >
> >> > STORE agg_results INTO '/user/sgehlot/pt_users_referral_type_bu_day'
> >> USING
> >> > PigStorage('\t');
> >> >
> >> >
> >> > But I am keep on failing because of Java out of memory error. Data set
> in
> >> > "page_events" is huge but in "referrer_group_map" is relatively very
> >> less.
> >> >
> >> > Here is error message:
> >> >
> >> > Error: java.lang.OutOfMemoryError: GC overhead limit exceeded
> >> > at java.util.ArrayList.(ArrayList.java:112)
> >> >  at java.util.ArrayList.(ArrayList.java:119)
> >> > at org.apache.pig.data.DefaultTuple.(DefaultTuple.java:59)
> >> >  at org.apache.pig.data.BinSedesTuple.(BinSedesTuple.java:73)
> >> > at
> >> >
> >>
> >>
> org.apache.pig.data.BinSedesTupleFactory.newTuple(BinSedesTupleFactory.java:33>
> >> )
> >> >  at
> >> >
> >>
> org.apache.pig.data.InternalCachedBag$CachedBagIterator.hasNext(InternalCached
> >> > Bag.java:236)
> >> > at
> >> >
> >>
> org.apache.pig.builtin.Distinct.getDistinctFromNestedBags(Distinct.java:136)
> >> >  at org.apache.pig.builtin.Distinct.access$200(Distinct.java:38)
> >> > at
> org.apache.pig.builtin.Distinct$Intermediate.exec(Distinct.java:101)
> >> >  at
> org.apache.pig.builtin.Distinct$Intermediate.exec(Distinct.java:94)
> >> > at
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperator
> >> > s.POUserFunc.getNext(POUserFunc.java:216)
> >> >  at
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperator
> >> > s.POUserFunc.getNext(POUserFunc.java:253)
> >> > at
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.g
> >> > etNext(PhysicalOperator.java:334)
> >> >  at
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperator
> >> > s.POForEach.processPlan(POForEach.java:332)
> >> > at
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperator
> >> > s.POForEach.getNext(POForEach.java:284)
> >> >  at
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.p
> >> > rocessInput(PhysicalOperator.java:290)
> >> > at
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperator
> >> > s.POLocalRearrange.getNext(POLocalRearrange.java:256)
> >> >  at
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combi
> >> > ne.processOnePackageOutput(PigCombiner.java:184)
> >> > at
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combi
> >> > ne.reduce(PigCombiner.java:162)
> >> >  at
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combi
> >> > ne.reduce(PigCombiner.java:51)
> >> > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
> >> >  at
> >> org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222)
> >> > at
> >> >
> >>
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:126
> >> > 5)
> >> >  at
> >> >
> >>
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686)
> >> > at
> >> >
> >>
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:
> >> > 1173)
> >> >
> >> > Error: java.lang.OutOfMemoryError: Java heap space
> >> > at java.util.ArrayList.(ArrayList.java:112)
> >> >  at java.util.ArrayList.(ArrayList.java:119)
> >> > at org.apache.pig.data.DefaultDataBag.(DefaultDataBag.java:54)
> >> >  at
> >> >
> >>
> org.apache.pig.data.DefaultBagFactory.newDefaultBag(DefaultBagFactory.java:33)
> >> > at org.apache.pig.data.BinInterSedes.readBag(BinInterSedes.java:143)
> >> >  at
> org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:275)
> >> > at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:251)
> >> >  at
> org.apache.pig.data.BinInterSedes.readTuple(BinInterSedes.java:111)
> >> > at org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:270)
> >> >  at
> org.apache.pig.data.BinInterSedes.readDatum(BinInterSedes.java:251)
> >> > at
> >> org.apache.pig.data.BinInterSedes.addColsToTuple(BinInterSedes.java:555)
> >> >  at
> org.apache.pig.data.BinSedesTuple.readFields(BinSedesTuple.java:64)
> >> > at
> >> >
> >>
> org.apache.pig.impl.io.PigNullableWritable.readFields(PigNullableWritable.java
> >> > :114)
> >> >  at
> >> >
> >>
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.des
> >> > erialize(WritableSerialization.java:67)
> >> > at
> >> >
> >>
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.des
> >> > erialize(WritableSerialization.java:40)
> >> >  at
> >> >
> >>
> org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:116)
> >> > at
> >> >
> >>
> org.apache.hadoop.mapreduce.ReduceContext$ValueIterator.next(ReduceContext.jav
> >> > a:163)
> >> >  at
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperator
> >> > s.POCombinerPackage.getNext(POCombinerPackage.java:141)
> >> > at
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combi
> >> > ne.processOnePackageOutput(PigCombiner.java:171)
> >> >  at
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combi
> >> > ne.reduce(PigCombiner.java:162)
> >> > at
> >> >
> >>
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combi
> >> > ne.reduce(PigCombiner.java:51)
> >> >  at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
> >> > at
> >> org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1222)
> >> >  at
> >> >
> >>
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:126
> >> > 5)
> >> > at
> >> >
> >>
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:686)
> >> >  at
> >> >
> >>
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:
> >> > 1173)
> >> >
> >> > Any idea and suggestion what could be cause of error message?
> >> >
> >> > Thanks for any help,
> >> > Sonia
> >> >
> >>
> >>
> >> --
> >>
> >>
> >>
> >
>

Reply via email to