Perhaps you could write your own AdaptiveFeatureGenerator implementation. I think this would allow you to add your features to the tokens with your rules. It is in the tools.util.featuregen package. Take a look at some of it's current impls, Hope this helps MG
On Thu, May 22, 2014 at 1:25 PM, Stuart Robinson <[email protected] > wrote: > Hi, Mark. Thanks for your suggestion. My initial approach was to use > regular expressions, but I'm looking at social media and there is a lot > more variation in the formatting of phone numbers than you would expect (as > well as various kinds of obfuscation). So I think a named entity recognizer > will ultimately be more robust. Hence my interest in custom token classes. > > Best, > Stuart > > > On Wed, May 21, 2014 at 6:09 PM, Mark Giaconia <[email protected] > >wrote: > > > > > > > Sounds like you could use a regexnamefinder since these patterns are so > > well defined with a set of rules. > > > > > On May 21, 2014, at 7:43 PM, Stuart Robinson < > [email protected]> > > wrote: > > > > > > Hi, all. I'm using OpenNLP to recognize phone numbers, for which there > > > isn't a pre-existing model. I've been training my own and have gotten > > > pretty decent results so far with the simple tokenizer and > out-of-the-box > > > features but I'd now like to improve the features that it's training > on. > > In > > > particular, I'd like to define some token classes that are specific to > > the > > > domain of phone numbers. From what I've read so far (e.g., in Taming > > Text), > > > the out-of-the-box token classes are: > > > > > > 1. token is lowercase alphabetic > > > 2. token is two digits > > > 3. token is four digits > > > 4. token contains a number and a letter > > > 5. token contains a number and a hyphen > > > 6. token contains a number and a backlash > > > 7. token contains a number and a comma > > > 8. token contains a number and a period > > > 9. tokens contains a number > > > 10. token is all caps, single letter > > > 11. token is all caps, multiple letters > > > 12. token's initial letters are caps > > > 13. other > > > > > > I'd like to be able to define feature like the following: > > > > > > a. token is five digits > > > b. token is six digits > > > c. token is seven digits > > > d. token is eight digits > > > e. token is greater than eight digits > > > etc. > > > > > > I know that you can override features when calling NameFinderME.train > by > > > passing in your own AggregatedFeatureGenerator object, but it's not > clear > > > how an individual feature generator could use custom token classes. > > > Pointers to the appropriate entry point in the code (and any other > > > suggestions or advice) would be greatly appreciated. > > > > > > Thanks in advance. > > > > > > Regards, > > > Stuart > > >
