Your suggestion helped me to do the join. But I want to have the output schema generic and in fact same as input schema as this UDF will be shared by different inputs. How do I do that?
thanks Sheeba On Tue, Dec 14, 2010 at 12:43 AM, Sheeba George <[email protected]>wrote: > > Hi Daniel > Is it possible to get the schema string from the "input" param rather than > hardcoding? > Thanks > Sheeba > On Mon, Dec 13, 2010 at 11:53 PM, Daniel Dai <[email protected]>wrote: > >> There is something wrong in outputSchema I gave you last time, try this: >> >> >> public Schema outputSchema(Schema input) { >> try { >> Schema schema = >> org.apache.pig.impl.util.Utils.getSchemaFromString("topx_sorted:bag{t:tuple(reportdate:chararray,appid:int,keyword:chararray,searchengine:chararray,visits:long, >> etc....)}"); >> return schema; >> >> } catch (Exception e) { >> return null; >> } >> } >> >> Daniel >> >> >> Sheeba George wrote: >> >>> Hi Daniel >>> Thanks for your help ... I have created a UDF that aggregates the >>> rest. >>> So my UDF takes a DataBag as input and DataBag which has the same schema >>> as >>> input as the output. My outputschema method is as below. >>> >>> * >>> >>> public* Schema outputSchema(Schema input) { >>> >>> *try*{ >>> >>> Schema bagSchema = *new* Schema(); >>> >>> bagSchema.add(*new* Schema.FieldSchema(input.getField(0))); >>> >>> *return* *new* Schema(*new* >>> >>> Schema.FieldSchema(getSchemaName(*this*.getClass().getName().toLowerCase(), >>> input), >>> >>> bagSchema, DataType.*BAG*)); >>> >>> }*catch* (Exception e){ >>> >>> *return* *null*; >>> >>> } >>> >>> } >>> I am using the UDF in PIG as >>> topkws = FOREACH kwgroup { >>> sorted = ORDER kws BY visits DESC; >>> GENERATE FLATTEN(AggregateOthers(sorted));} >>> >>> where AggregateOthers is my UDF. If I DESCRIBE topkws I get >>> topkws: {com.pig.udfs.topx_sorted_104::sorted: {reportdate: chararray, >>> appid: int,keyword: chararray,searchengine: chararray,visits: long,f2: >>> long,f3: >>> long,f4: long,f5: long,f6: long,f7: long,f8: long,visitor: long}} >>> >>> com.pig.udfs.topx_sorted is my package name. Not sure what "104" stands >>> for? >>> >>> How do I access each field in topkws? I need to join reportdate,appid and >>> keyword in topkws with another file. >>> >>> Appreciate any help >>> >>> thanks >>> Sheeba >>> >>> >>> On Sun, Nov 28, 2010 at 2:07 AM, Daniel Dai <[email protected]> wrote: >>> >>> >>> >>>> Limit only takes constant. So "limit sorted_asc (COUNT(*kws*) - 5)" does >>>> not >>>> work. >>>> >>>> You will need a UDF, which returns DataBag. One example is >>>> org.apache.pig.builtin.COR, which returns DataBag. Basically, you can >>>> write >>>> a UDF like this: >>>> >>>> public class BagTest extends EvalFunc<DataBag> { >>>> @Override >>>> public DataBag exec(Tuple input) throws IOException { >>>> DataBag inputDB = (DataBag)input.get(0); >>>> DataBag db = new DefaultDataBag(); >>>> // Construct your db >>>> return db; >>>> } >>>> @Override >>>> public Schema outputSchema(Schema input) { >>>> return new Schema(new >>>> >>>> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), >>>> input), DataType.BAG)); >>>> } >>>> } >>>> >>>> Daniel >>>> >>>> -----Original Message----- From: Sheeba George >>>> Sent: Thursday, November 25, 2010 7:01 PM >>>> To: [email protected] >>>> Subject: Question on getting TotalCount - X records >>>> >>>> >>>> Hi all >>>> >>>> I need some help with PIG. The requirement is to generate the topX >>>> records for a group. I can easily do this using PIG script where I can >>>> order >>>> by DESC and then limit at X. If there are more than X records in the >>>> group,I need to aggregate the rest as a single record. How can I achieve >>>> this? >>>> >>>> I am generating topX as below >>>> >>>> *kwgroup* = GROUP *kws* BY (type,category); >>>> >>>> *topkws* = FOREACH *kwgroup* { >>>> >>>> sorted = ORDER *kws* BY visits DESC; >>>> >>>> *ltd* = limit sorted 5; >>>> >>>> GENERATE FLATTEN(*ltd*);} >>>> >>>> For aggregating the rest, >>>> I was thinking of sorting by ASC and truncate at (TotalCount – X). Then >>>> aggregate these records. How can I get the TotalCount of records in a >>>> group? I tried the below, but fails. >>>> >>>> * >>>> >>>> bottomkws* = FOREACH kwgroup_cnt_gt_top { >>>> >>>> sorted_asc = ORDER *kws* BY visits ASC; >>>> >>>> ltd_bottom = limit sorted_asc (COUNT(*kws*) - 5) ; >>>> >>>> GENERATE FLATTEN(ltd_bottom);} >>>> >>>> But this fails with the erro message that we should use INTEGER instead >>>> of >>>> COUNT(kws) >>>> >>>> Is it better to do this using UDF? In that case UDF will have to sort, >>>> limit >>>> ,aggregate .Could you point to some samples that take a group of records >>>> and >>>> return a group(bag) >>>> >>>> >>>> >>>> Any help in this regard is appreciated. >>>> >>>> >>>> >>>> Thanks >>>> >>>> Sheeba >>>> >>>> >>>> >>> >>> >>> >>> >>> >> >> > > > -- > Sheeba Ann George > > -- Sheeba Ann George
