Hi Daniel
Thanks for your help ... I have created a UDF that aggregates the rest.
So my UDF takes a DataBag as input and DataBag which has the same schema as
input as the output. My outputschema method is as below.
*
public* Schema outputSchema(Schema input) {
*try*{
Schema bagSchema = *new* Schema();
bagSchema.add(*new* Schema.FieldSchema(input.getField(0)));
*return* *new* Schema(*new*
Schema.FieldSchema(getSchemaName(*this*.getClass().getName().toLowerCase(),
input),
bagSchema, DataType.*BAG*));
}*catch* (Exception e){
*return* *null*;
}
}
I am using the UDF in PIG as
topkws = FOREACH kwgroup {
sorted = ORDER kws BY visits DESC;
GENERATE FLATTEN(AggregateOthers(sorted));}
where AggregateOthers is my UDF. If I DESCRIBE topkws I get
topkws: {com.pig.udfs.topx_sorted_104::sorted: {reportdate: chararray,
appid: int,keyword: chararray,searchengine: chararray,visits: long,f2:
long,f3:
long,f4: long,f5: long,f6: long,f7: long,f8: long,visitor: long}}
com.pig.udfs.topx_sorted is my package name. Not sure what "104" stands for?
How do I access each field in topkws? I need to join reportdate,appid and
keyword in topkws with another file.
Appreciate any help
thanks
Sheeba
On Sun, Nov 28, 2010 at 2:07 AM, Daniel Dai <[email protected]> wrote:
> Limit only takes constant. So "limit sorted_asc (COUNT(*kws*) - 5)" does
> not
> work.
>
> You will need a UDF, which returns DataBag. One example is
> org.apache.pig.builtin.COR, which returns DataBag. Basically, you can write
> a UDF like this:
>
> public class BagTest extends EvalFunc<DataBag> {
> @Override
> public DataBag exec(Tuple input) throws IOException {
> DataBag inputDB = (DataBag)input.get(0);
> DataBag db = new DefaultDataBag();
> // Construct your db
> return db;
> }
> @Override
> public Schema outputSchema(Schema input) {
> return new Schema(new
> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
> input), DataType.BAG));
> }
> }
>
> Daniel
>
> -----Original Message----- From: Sheeba George
> Sent: Thursday, November 25, 2010 7:01 PM
> To: [email protected]
> Subject: Question on getting TotalCount - X records
>
>
> Hi all
>
> I need some help with PIG. The requirement is to generate the topX
> records for a group. I can easily do this using PIG script where I can
> order
> by DESC and then limit at X. If there are more than X records in the
> group,I need to aggregate the rest as a single record. How can I achieve
> this?
>
> I am generating topX as below
>
> *kwgroup* = GROUP *kws* BY (type,category);
>
> *topkws* = FOREACH *kwgroup* {
>
> sorted = ORDER *kws* BY visits DESC;
>
> *ltd* = limit sorted 5;
>
> GENERATE FLATTEN(*ltd*);}
>
> For aggregating the rest,
> I was thinking of sorting by ASC and truncate at (TotalCount – X). Then
> aggregate these records. How can I get the TotalCount of records in a
> group? I tried the below, but fails.
>
> *
>
> bottomkws* = FOREACH kwgroup_cnt_gt_top {
>
> sorted_asc = ORDER *kws* BY visits ASC;
>
> ltd_bottom = limit sorted_asc (COUNT(*kws*) - 5) ;
>
> GENERATE FLATTEN(ltd_bottom);}
>
> But this fails with the erro message that we should use INTEGER instead of
> COUNT(kws)
>
> Is it better to do this using UDF? In that case UDF will have to sort,
> limit
> ,aggregate .Could you point to some samples that take a group of records
> and
> return a group(bag)
>
>
>
> Any help in this regard is appreciated.
>
>
>
> Thanks
>
> Sheeba
>
--
Sheeba Ann George