Hi Daniel
    Thanks for your help ... I have created a UDF that aggregates the rest.
So my UDF takes a DataBag as input and DataBag which has the same schema as
input as the output. My outputschema method is as below.

*

public* Schema outputSchema(Schema input) {

*try*{

Schema bagSchema = *new* Schema();

bagSchema.add(*new* Schema.FieldSchema(input.getField(0)));

*return* *new* Schema(*new*
Schema.FieldSchema(getSchemaName(*this*.getClass().getName().toLowerCase(),
input),

bagSchema, DataType.*BAG*));

}*catch* (Exception e){

*return* *null*;

}

}
I am using the UDF in PIG as
topkws = FOREACH kwgroup {
   sorted = ORDER kws BY visits DESC;
   GENERATE FLATTEN(AggregateOthers(sorted));}

where AggregateOthers is my UDF. If I DESCRIBE topkws I get
topkws: {com.pig.udfs.topx_sorted_104::sorted: {reportdate: chararray,
appid: int,keyword: chararray,searchengine: chararray,visits: long,f2:
long,f3:
long,f4: long,f5: long,f6: long,f7: long,f8: long,visitor: long}}

com.pig.udfs.topx_sorted is my package name. Not sure what "104" stands for?

How do I access each field in topkws? I need to join reportdate,appid and
keyword in topkws with another file.

Appreciate any help

thanks
Sheeba


On Sun, Nov 28, 2010 at 2:07 AM, Daniel Dai <[email protected]> wrote:

> Limit only takes constant. So "limit sorted_asc (COUNT(*kws*) - 5)" does
> not
> work.
>
> You will need a UDF, which returns DataBag. One example is
> org.apache.pig.builtin.COR, which returns DataBag. Basically, you can write
> a UDF like this:
>
> public class BagTest extends EvalFunc<DataBag> {
>   @Override
>   public DataBag exec(Tuple input) throws IOException {
>       DataBag inputDB = (DataBag)input.get(0);
>       DataBag db = new DefaultDataBag();
>       // Construct your db
>       return db;
>   }
>   @Override
>   public Schema outputSchema(Schema input) {
>       return new Schema(new
> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
> input), DataType.BAG));
>   }
> }
>
> Daniel
>
> -----Original Message----- From: Sheeba George
> Sent: Thursday, November 25, 2010 7:01 PM
> To: [email protected]
> Subject: Question on getting TotalCount - X records
>
>
> Hi  all
>
>    I need some help with PIG. The requirement is to generate the topX
> records for a group. I can easily do this using PIG script where I can
> order
> by DESC and then limit at X.  If there are more than X records in the
> group,I need to aggregate the rest as a single record. How can I achieve
> this?
>
> I am generating topX as below
>
> *kwgroup* = GROUP *kws* BY (type,category);
>
> *topkws* = FOREACH *kwgroup* {
>
>            sorted = ORDER *kws* BY visits DESC;
>
>            *ltd* = limit sorted 5;
>
>            GENERATE FLATTEN(*ltd*);}
>
> For aggregating the rest,
> I was thinking of sorting by ASC and truncate at (TotalCount – X). Then
> aggregate these records.  How can I get the TotalCount of records in a
> group? I tried the below, but fails.
>
> *
>
> bottomkws* = FOREACH kwgroup_cnt_gt_top {
>
> sorted_asc = ORDER *kws* BY visits ASC;
>
> ltd_bottom = limit sorted_asc (COUNT(*kws*) - 5) ;
>
> GENERATE FLATTEN(ltd_bottom);}
>
> But this fails with the erro message that we should use INTEGER instead of
> COUNT(kws)
>
> Is it better to do this using UDF? In that case UDF will have to sort,
> limit
> ,aggregate .Could you point to some samples that take a group of records
> and
> return a group(bag)
>
>
>
> Any help in this regard is appreciated.
>
>
>
> Thanks
>
> Sheeba
>



-- 
Sheeba Ann George

Reply via email to