Re: Question on getting TotalCount - X records

Sheeba George Tue, 14 Dec 2010 09:44:57 -0800

Your suggestion helped me to do the join. But I want to have the output
schema generic and in fact same as input schema as this UDF will be shared
by different inputs. How do I do that?


thanks
Sheeba

On Tue, Dec 14, 2010 at 12:43 AM, Sheeba George <[email protected]>wrote:

>
> Hi Daniel
> Is it possible to get the schema string from the "input" param rather than
> hardcoding?
> Thanks
> Sheeba
>   On Mon, Dec 13, 2010 at 11:53 PM, Daniel Dai <[email protected]>wrote:
>
>> There is something wrong in outputSchema I gave you last time, try this:
>>
>>
>>   public Schema outputSchema(Schema input) {
>>       try {
>>           Schema schema =
>> org.apache.pig.impl.util.Utils.getSchemaFromString("topx_sorted:bag{t:tuple(reportdate:chararray,appid:int,keyword:chararray,searchengine:chararray,visits:long,
>> etc....)}");
>>           return schema;
>>
>>       } catch (Exception e) {
>>           return null;
>>       }
>>   }
>>
>> Daniel
>>
>>
>> Sheeba George wrote:
>>
>>> Hi Daniel
>>>    Thanks for your help ... I have created a UDF that aggregates the
>>> rest.
>>> So my UDF takes a DataBag as input and DataBag which has the same schema
>>> as
>>> input as the output. My outputschema method is as below.
>>>
>>> *
>>>
>>> public* Schema outputSchema(Schema input) {
>>>
>>> *try*{
>>>
>>> Schema bagSchema = *new* Schema();
>>>
>>> bagSchema.add(*new* Schema.FieldSchema(input.getField(0)));
>>>
>>> *return* *new* Schema(*new*
>>>
>>> Schema.FieldSchema(getSchemaName(*this*.getClass().getName().toLowerCase(),
>>> input),
>>>
>>> bagSchema, DataType.*BAG*));
>>>
>>> }*catch* (Exception e){
>>>
>>> *return* *null*;
>>>
>>> }
>>>
>>> }
>>> I am using the UDF in PIG as
>>> topkws = FOREACH kwgroup {
>>>   sorted = ORDER kws BY visits DESC;
>>>   GENERATE FLATTEN(AggregateOthers(sorted));}
>>>
>>> where AggregateOthers is my UDF. If I DESCRIBE topkws I get
>>> topkws: {com.pig.udfs.topx_sorted_104::sorted: {reportdate: chararray,
>>> appid: int,keyword: chararray,searchengine: chararray,visits: long,f2:
>>> long,f3:
>>> long,f4: long,f5: long,f6: long,f7: long,f8: long,visitor: long}}
>>>
>>> com.pig.udfs.topx_sorted is my package name. Not sure what "104" stands
>>> for?
>>>
>>> How do I access each field in topkws? I need to join reportdate,appid and
>>> keyword in topkws with another file.
>>>
>>> Appreciate any help
>>>
>>> thanks
>>> Sheeba
>>>
>>>
>>> On Sun, Nov 28, 2010 at 2:07 AM, Daniel Dai <[email protected]> wrote:
>>>
>>>
>>>
>>>> Limit only takes constant. So "limit sorted_asc (COUNT(*kws*) - 5)" does
>>>> not
>>>> work.
>>>>
>>>> You will need a UDF, which returns DataBag. One example is
>>>> org.apache.pig.builtin.COR, which returns DataBag. Basically, you can
>>>> write
>>>> a UDF like this:
>>>>
>>>> public class BagTest extends EvalFunc<DataBag> {
>>>>  @Override
>>>>  public DataBag exec(Tuple input) throws IOException {
>>>>      DataBag inputDB = (DataBag)input.get(0);
>>>>      DataBag db = new DefaultDataBag();
>>>>      // Construct your db
>>>>      return db;
>>>>  }
>>>>  @Override
>>>>  public Schema outputSchema(Schema input) {
>>>>      return new Schema(new
>>>>
>>>> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
>>>> input), DataType.BAG));
>>>>  }
>>>> }
>>>>
>>>> Daniel
>>>>
>>>> -----Original Message----- From: Sheeba George
>>>> Sent: Thursday, November 25, 2010 7:01 PM
>>>> To: [email protected]
>>>> Subject: Question on getting TotalCount - X records
>>>>
>>>>
>>>> Hi  all
>>>>
>>>>   I need some help with PIG. The requirement is to generate the topX
>>>> records for a group. I can easily do this using PIG script where I can
>>>> order
>>>> by DESC and then limit at X.  If there are more than X records in the
>>>> group,I need to aggregate the rest as a single record. How can I achieve
>>>> this?
>>>>
>>>> I am generating topX as below
>>>>
>>>> *kwgroup* = GROUP *kws* BY (type,category);
>>>>
>>>> *topkws* = FOREACH *kwgroup* {
>>>>
>>>>           sorted = ORDER *kws* BY visits DESC;
>>>>
>>>>           *ltd* = limit sorted 5;
>>>>
>>>>           GENERATE FLATTEN(*ltd*);}
>>>>
>>>> For aggregating the rest,
>>>> I was thinking of sorting by ASC and truncate at (TotalCount – X). Then
>>>> aggregate these records.  How can I get the TotalCount of records in a
>>>> group? I tried the below, but fails.
>>>>
>>>> *
>>>>
>>>> bottomkws* = FOREACH kwgroup_cnt_gt_top {
>>>>
>>>> sorted_asc = ORDER *kws* BY visits ASC;
>>>>
>>>> ltd_bottom = limit sorted_asc (COUNT(*kws*) - 5) ;
>>>>
>>>> GENERATE FLATTEN(ltd_bottom);}
>>>>
>>>> But this fails with the erro message that we should use INTEGER instead
>>>> of
>>>> COUNT(kws)
>>>>
>>>> Is it better to do this using UDF? In that case UDF will have to sort,
>>>> limit
>>>> ,aggregate .Could you point to some samples that take a group of records
>>>> and
>>>> return a group(bag)
>>>>
>>>>
>>>>
>>>> Any help in this regard is appreciated.
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>> Sheeba
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>
>
> --
> Sheeba Ann George
>
>


-- 
Sheeba Ann George

Re: Question on getting TotalCount - X records

Reply via email to