Here you go:
https://github.com/znbailey/Dataclip-Piggybank The UDF you'll be interested in is here: https://github.com/znbailey/Dataclip-Piggybank/blob/master/src/java/com/dataclip/piggybank/AHO_CORASICK.java I would recommend grabbing the entire repo as that UDF depends on the repackaged version of Aho-Corasick in org/arabidopsis/ahocorasick Enjoy, Zach On Monday, December 6, 2010 at 4:55 PM, Brian Adams wrote: > No problem. > Sounds good. And no worry about messy code. We are all well aware that code > often elegance when you are just trying to get it out the door. > -----Original Message----- > From: Zach Bailey [mailto:[email protected]] > Sent: Monday, December 06, 2010 4:46 PM > To: [email protected] > Subject: Re: Regex Match Tagger UDF? > > > Great. Let me clean up the code a bit and I'd be happy to post it. I'm > definitely open to some alternatives in terms of how this UDF would be > initialized, whether it is via a file sitting on HDFS, etc. The current > initialization scheme is admittedly crude but was simple to code and works > for us for now. > > Cheers, > Zach > > > On Monday, December 6, 2010 at 4:15 PM, Brian Adams wrote: > > > > That is an interesting approach. I like it. Not ideal, but I think it > > could work for what I am doing. > > > > In general I think that is useful to the community and you should github > > it. > > By all means, I would love to use this. > > > > I think I could extend/fork this for my need. > > > > Thank you Zach! > > > > -----Original Message----- > > From: Zach Bailey [mailto:[email protected]] > > Sent: Monday, December 06, 2010 3:38 PM > > To: [email protected] > > Subject: Re: Regex Match Tagger UDF? > > > > > > Does the UDF have to support regular expressions? If not, I have adapted > > the Aho-Corasick algorithm [1] to do something similar to what you're > > asking for. It works as follows: > > > > > > 1.) Initialize the Aho-Corasick UDF with a list of tokens to search for, > > and a result to output when that token is found: > > > > > > define AC_MATCHER > > com.my.piggybank.AHO_CORASICK('dogs=[terrier|retriever|pit > > bull];cats=[tabby|mainecoon|tuxedo];birds=[parakeet|parrot|cuckoo]') > > > > > > 2.) apply the AC_MATCHER to a tuple > > > > > > strings = LOAD 'myfile.txt' as (string:chararray); tagged_strings = > > FOREACH strings GENERATE string, AC_MATCHER(string) as tags; > > > > > > The tagged_strings will then contain the original line along with a bag of > > matches. For instance if we had the following in myfile.txt: > > > > > > terrier parakeet > > hello > > goodbye > > tabby > > pit bull > > > > > > after running the commands in #2 tagged_strings would look like (pardon > > the ad-hoc notation): > > > > > > { string: 'terrier parakeet', tags: { 'dogs', 'birds' } } { string: > > 'hello', tags: {} } { string: 'goodbye', tags: {} } { string: 'tabby', > > tags: { 'cats' } } { string: 'pit bull', tags: { 'dogs' } } > > > > > > If this is something you'd be interested in using/extended I can put it up > > on github for your forking pleasure. > > > > Cheers, > > Zach > > > > > > On Monday, December 6, 2010 at 3:25 PM, Brian Adams wrote: > > > > > > > I have al is of regex patterns that I would like to run against a > > > data set, and if it matches a particular pattern in the list, tag > > > it with the predefined tag for that pattern. > > > Has this been done, or available somewhere? > > > I've not written any UDF's, and although I'm not against doing so, > > > I probably don't have the time to write one at this point. > > > > > > If this isn't available somewhere I can work around this roadblock, > > > but it would be awesome if someone has cooked up this functionality > > > somewhere. > > > > > > -----Original Message----- > > > From: Anze [mailto:[email protected]] > > > Sent: Monday, December 06, 2010 3:09 PM > > > To: [email protected] > > > Subject: Re: Easy question...difference between this::form and > > > this.form? > > > > > > > > > Sorry to hijack your question, Jonathan, but while we are at it... > > > :) > > > > > > Is there a way to tell Pig NOT to add "base_alias::"? Almost half > > > my code consists of FOREACH... GENERATE that just remove these prefixes. > > > > > > Thanks, > > > > > > Anze > > > > > > On Monday 06 December 2010, Daniel Dai wrote: > > > > > > > After join, cross, foreach flatten, Pig will automatically add > > > > "base_alias::" prefix. All other cases use "." > > > > > > > > Daniel > > > > > > > > Jonathan Coveney wrote: > > > > > It's very hard to search for this among the docs because it's so > > > > > > > > > > > generic, > > > > > > > > so I thought I'd ask... I'm sure the answer is painfully easy. > > > > > > > > > > Taking a look at this code that I found online, for example > > > > > > > > > > -- > > > > > -- Read in a bag of tuples (timeseries for this example) and > > > > > divide > > > > > > > > > > > the > > > > > > > > -- numeric column by its maximum. > > > > > -- > > > > > %default DATABAG 'data/timeseries.tsv' > > > > > > > > > > data = LOAD '$DATABAG' AS (month:chararray, count:int); > > > > > accumulate = GROUP data ALL; calc_max = FOREACH accumulate > > > > > GENERATE FLATTEN(data), > > > > > MAX(data.count) AS max_count; > > > > > normalize = FOREACH calc_max GENERATE data::month AS month, > > > > > data::count AS count, (float)data::count / (float)max_count AS > > > > > normed_count; DUMP normalize; > > > > > > > > > > What purpose does data::month serve versus data.count? > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
