I have al is of regex patterns that I would like to run against a data set, and if it matches a particular pattern in the list, tag it with the predefined tag for that pattern. Has this been done, or available somewhere? I've not written any UDF's, and although I'm not against doing so, I probably don't have the time to write one at this point.
If this isn't available somewhere I can work around this roadblock, but it would be awesome if someone has cooked up this functionality somewhere. -----Original Message----- From: Anze [mailto:[email protected]] Sent: Monday, December 06, 2010 3:09 PM To: [email protected] Subject: Re: Easy question...difference between this::form and this.form? Sorry to hijack your question, Jonathan, but while we are at it... :) Is there a way to tell Pig NOT to add "base_alias::"? Almost half my code consists of FOREACH... GENERATE that just remove these prefixes. Thanks, Anze On Monday 06 December 2010, Daniel Dai wrote: > After join, cross, foreach flatten, Pig will automatically add > "base_alias::" prefix. All other cases use "." > > Daniel > > Jonathan Coveney wrote: > > It's very hard to search for this among the docs because it's so generic, > > so I thought I'd ask... I'm sure the answer is painfully easy. > > > > Taking a look at this code that I found online, for example > > > > -- > > -- Read in a bag of tuples (timeseries for this example) and divide the > > -- numeric column by its maximum. > > -- > > %default DATABAG 'data/timeseries.tsv' > > > > data = LOAD '$DATABAG' AS (month:chararray, count:int); > > accumulate = GROUP data ALL; > > calc_max = FOREACH accumulate GENERATE FLATTEN(data), > > MAX(data.count) AS max_count; > > normalize = FOREACH calc_max GENERATE data::month AS month, > > data::count AS count, (float)data::count / (float)max_count AS > > normed_count; > > DUMP normalize; > > > > What purpose does data::month serve versus data.count? > > > > Thanks
