Re: Question on getting TotalCount - X records

Sheeba George Tue, 14 Dec 2010 00:44:00 -0800

Hi Daniel
Is it possible to get the schema string from the "input" param rather than
hardcoding?
Thanks
Sheeba
On Mon, Dec 13, 2010 at 11:53 PM, Daniel Dai <[email protected]> wrote:


> There is something wrong in outputSchema I gave you last time, try this:
>
>
>   public Schema outputSchema(Schema input) {
>       try {
>           Schema schema =
> org.apache.pig.impl.util.Utils.getSchemaFromString("topx_sorted:bag{t:tuple(reportdate:chararray,appid:int,keyword:chararray,searchengine:chararray,visits:long,
> etc....)}");
>           return schema;
>
>       } catch (Exception e) {
>           return null;
>       }
>   }
>
> Daniel
>
>
> Sheeba George wrote:
>
>> Hi Daniel
>>    Thanks for your help ... I have created a UDF that aggregates the rest.
>> So my UDF takes a DataBag as input and DataBag which has the same schema
>> as
>> input as the output. My outputschema method is as below.
>>
>> *
>>
>> public* Schema outputSchema(Schema input) {
>>
>> *try*{
>>
>> Schema bagSchema = *new* Schema();
>>
>> bagSchema.add(*new* Schema.FieldSchema(input.getField(0)));
>>
>> *return* *new* Schema(*new*
>>
>> Schema.FieldSchema(getSchemaName(*this*.getClass().getName().toLowerCase(),
>> input),
>>
>> bagSchema, DataType.*BAG*));
>>
>> }*catch* (Exception e){
>>
>> *return* *null*;
>>
>> }
>>
>> }
>> I am using the UDF in PIG as
>> topkws = FOREACH kwgroup {
>>   sorted = ORDER kws BY visits DESC;
>>   GENERATE FLATTEN(AggregateOthers(sorted));}
>>
>> where AggregateOthers is my UDF. If I DESCRIBE topkws I get
>> topkws: {com.pig.udfs.topx_sorted_104::sorted: {reportdate: chararray,
>> appid: int,keyword: chararray,searchengine: chararray,visits: long,f2:
>> long,f3:
>> long,f4: long,f5: long,f6: long,f7: long,f8: long,visitor: long}}
>>
>> com.pig.udfs.topx_sorted is my package name. Not sure what "104" stands
>> for?
>>
>> How do I access each field in topkws? I need to join reportdate,appid and
>> keyword in topkws with another file.
>>
>> Appreciate any help
>>
>> thanks
>> Sheeba
>>
>>
>> On Sun, Nov 28, 2010 at 2:07 AM, Daniel Dai <[email protected]> wrote:
>>
>>
>>
>>> Limit only takes constant. So "limit sorted_asc (COUNT(*kws*) - 5)" does
>>> not
>>> work.
>>>
>>> You will need a UDF, which returns DataBag. One example is
>>> org.apache.pig.builtin.COR, which returns DataBag. Basically, you can
>>> write
>>> a UDF like this:
>>>
>>> public class BagTest extends EvalFunc<DataBag> {
>>>  @Override
>>>  public DataBag exec(Tuple input) throws IOException {
>>>      DataBag inputDB = (DataBag)input.get(0);
>>>      DataBag db = new DefaultDataBag();
>>>      // Construct your db
>>>      return db;
>>>  }
>>>  @Override
>>>  public Schema outputSchema(Schema input) {
>>>      return new Schema(new
>>> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
>>> input), DataType.BAG));
>>>  }
>>> }
>>>
>>> Daniel
>>>
>>> -----Original Message----- From: Sheeba George
>>> Sent: Thursday, November 25, 2010 7:01 PM
>>> To: [email protected]
>>> Subject: Question on getting TotalCount - X records
>>>
>>>
>>> Hi  all
>>>
>>>   I need some help with PIG. The requirement is to generate the topX
>>> records for a group. I can easily do this using PIG script where I can
>>> order
>>> by DESC and then limit at X.  If there are more than X records in the
>>> group,I need to aggregate the rest as a single record. How can I achieve
>>> this?
>>>
>>> I am generating topX as below
>>>
>>> *kwgroup* = GROUP *kws* BY (type,category);
>>>
>>> *topkws* = FOREACH *kwgroup* {
>>>
>>>           sorted = ORDER *kws* BY visits DESC;
>>>
>>>           *ltd* = limit sorted 5;
>>>
>>>           GENERATE FLATTEN(*ltd*);}
>>>
>>> For aggregating the rest,
>>> I was thinking of sorting by ASC and truncate at (TotalCount – X). Then
>>> aggregate these records.  How can I get the TotalCount of records in a
>>> group? I tried the below, but fails.
>>>
>>> *
>>>
>>> bottomkws* = FOREACH kwgroup_cnt_gt_top {
>>>
>>> sorted_asc = ORDER *kws* BY visits ASC;
>>>
>>> ltd_bottom = limit sorted_asc (COUNT(*kws*) - 5) ;
>>>
>>> GENERATE FLATTEN(ltd_bottom);}
>>>
>>> But this fails with the erro message that we should use INTEGER instead
>>> of
>>> COUNT(kws)
>>>
>>> Is it better to do this using UDF? In that case UDF will have to sort,
>>> limit
>>> ,aggregate .Could you point to some samples that take a group of records
>>> and
>>> return a group(bag)
>>>
>>>
>>>
>>> Any help in this regard is appreciated.
>>>
>>>
>>>
>>> Thanks
>>>
>>> Sheeba
>>>
>>>
>>>
>>
>>
>>
>>
>>
>
>


-- 
Sheeba Ann George

Re: Question on getting TotalCount - X records

Reply via email to