RE: Regex Match Tagger UDF?

Brian Adams Mon, 06 Dec 2010 13:56:15 -0800

No problem.
Sounds good. And no worry about messy code. We are all well aware that code 
often elegance when you are just trying to get it out the door.
-----Original Message-----
From: Zach Bailey [mailto:[email protected]] 
Sent: Monday, December 06, 2010 4:46 PM
To: [email protected]
Subject: Re: Regex Match Tagger UDF?



 Great. Let me clean up the code a bit and I'd be happy to post it. I'm 
definitely open to some alternatives in terms of how this UDF would be 
initialized, whether it is via a file sitting on HDFS, etc. The current 
initialization scheme is admittedly crude but was simple to code and works for 
us for now.

Cheers,
Zach


On Monday, December 6, 2010 at 4:15 PM, Brian Adams wrote:

> That is an interesting approach. I like it. Not ideal, but I think it could 
> work for what I am doing.
> 
> In general I think that is useful to the community and you should github it. 
> By all means, I would love to use this.
> 
> I think I could extend/fork this for my need.
> 
> Thank you Zach!
> 
> -----Original Message-----
> From: Zach Bailey [mailto:[email protected]]
> Sent: Monday, December 06, 2010 3:38 PM
> To: [email protected]
> Subject: Re: Regex Match Tagger UDF?
> 
> 
>  Does the UDF have to support regular expressions? If not, I have adapted the 
> Aho-Corasick algorithm [1] to do something similar to what you're asking for. 
> It works as follows:
> 
> 
> 1.) Initialize the Aho-Corasick UDF with a list of tokens to search for, and 
> a result to output when that token is found:
> 
> 
> define AC_MATCHER 
> com.my.piggybank.AHO_CORASICK('dogs=[terrier|retriever|pit 
> bull];cats=[tabby|mainecoon|tuxedo];birds=[parakeet|parrot|cuckoo]')
> 
> 
> 2.) apply the AC_MATCHER to a tuple
> 
> 
> strings = LOAD 'myfile.txt' as (string:chararray); tagged_strings = 
> FOREACH strings GENERATE string, AC_MATCHER(string) as tags;
> 
> 
> The tagged_strings will then contain the original line along with a bag of 
> matches. For instance if we had the following in myfile.txt:
> 
> 
> terrier parakeet
> hello
> goodbye
> tabby
> pit bull
> 
> 
> after running the commands in #2 tagged_strings would look like (pardon the 
> ad-hoc notation):
> 
> 
> { string: 'terrier parakeet', tags: { 'dogs', 'birds' } } { string: 
> 'hello', tags: {} } { string: 'goodbye', tags: {} } { string: 'tabby', 
> tags: { 'cats' } } { string: 'pit bull', tags: { 'dogs' } }
> 
> 
> If this is something you'd be interested in using/extended I can put it up on 
> github for your forking pleasure.
> 
> Cheers,
> Zach
> 
> 
> On Monday, December 6, 2010 at 3:25 PM, Brian Adams wrote:
> 
> 
> >  I have al is of regex patterns that I would like to run against a 
> > data  set, and if it matches a particular pattern in the list, tag 
> > it with  the predefined tag for that pattern.
> >  Has this been done, or available somewhere? 
> >  I've not written any UDF's, and although I'm not against doing so, 
> > I  probably don't have the time to write one at this point.
> > 
> >  If this isn't available somewhere I can work around this roadblock,  
> > but it would be awesome if someone has cooked up this functionality  
> > somewhere.
> > 
> >  -----Original Message-----
> >  From: Anze [mailto:[email protected]]
> >  Sent: Monday, December 06, 2010 3:09 PM
> >  To: [email protected]
> >  Subject: Re: Easy question...difference between this::form and  
> > this.form?
> > 
> > 
> >  Sorry to hijack your question, Jonathan, but while we are at it... 
> > :)
> > 
> >  Is there a way to tell Pig NOT to add "base_alias::"? Almost half 
> > my  code consists of FOREACH... GENERATE that just remove these prefixes.
> > 
> >  Thanks,
> > 
> >  Anze
> > 
> >  On Monday 06 December 2010, Daniel Dai wrote:
> > 
> > > After join, cross, foreach flatten, Pig will automatically add 
> > > "base_alias::" prefix. All other cases use "."
> > > 
> > > Daniel
> > > 
> > > Jonathan Coveney wrote:
> > > > It's very hard to search for this among the docs because it's so
> > > 
> > > 
> >  generic,
> > 
> > > > so I thought I'd ask... I'm sure the answer is painfully easy.
> > > > 
> > > > Taking a look at this code that I found online, for example
> > > > 
> > > > --
> > > > -- Read in a bag of tuples (timeseries for this example) and 
> > > > divide
> > > 
> > > 
> >  the
> > 
> > > > -- numeric column by its maximum.
> > > > --
> > > > %default DATABAG 'data/timeseries.tsv'
> > > > 
> > > > data = LOAD '$DATABAG' AS (month:chararray, count:int); 
> > > > accumulate = GROUP data ALL; calc_max = FOREACH accumulate 
> > > > GENERATE FLATTEN(data),
> > > > MAX(data.count) AS max_count;
> > > > normalize = FOREACH calc_max GENERATE data::month AS month, 
> > > > data::count AS count, (float)data::count / (float)max_count AS 
> > > > normed_count; DUMP normalize;
> > > > 
> > > > What purpose does data::month serve versus data.count?
> > > > 
> > > > Thanks
> > > 
> > > 
> > 
> > 
> > 
> > 
> > 
> > 
> 
> 
> 
>

RE: Regex Match Tagger UDF?

Reply via email to