Re: custom token classes for NER model training

Mark G Thu, 22 May 2014 14:06:07 -0700

Perhaps you could write your own AdaptiveFeatureGenerator implementation. I
think this would allow you to add your features to the tokens with your
rules. It is in the tools.util.featuregen package. Take a look at some of
it's current impls, Hope this helps
MG



On Thu, May 22, 2014 at 1:25 PM, Stuart Robinson <[email protected]
> wrote:

> Hi, Mark. Thanks for your suggestion. My initial approach was to use
> regular expressions, but I'm looking at social media and there is a lot
> more variation in the formatting of phone numbers than you would expect (as
> well as various kinds of obfuscation). So I think a named entity recognizer
> will ultimately be more robust. Hence my interest in custom token classes.
>
> Best,
> Stuart
>
>
> On Wed, May 21, 2014 at 6:09 PM, Mark Giaconia <[email protected]
> >wrote:
>
> >
> >
> > Sounds like you could use a regexnamefinder since these patterns are so
> > well defined with a set of rules.
> >
> > > On May 21, 2014, at 7:43 PM, Stuart Robinson <
> [email protected]>
> > wrote:
> > >
> > > Hi, all. I'm using OpenNLP to recognize phone numbers, for which there
> > > isn't a pre-existing model. I've been training my own and have gotten
> > > pretty decent results so far with the simple tokenizer and
> out-of-the-box
> > > features but I'd now like to improve the features that it's training
> on.
> > In
> > > particular, I'd like to define some token classes that are specific to
> > the
> > > domain of phone numbers. From what I've read so far (e.g., in Taming
> > Text),
> > > the out-of-the-box token classes are:
> > >
> > > 1. token is lowercase alphabetic
> > > 2. token is two digits
> > > 3. token is four digits
> > > 4. token contains a number and a letter
> > > 5. token contains a number and a hyphen
> > > 6. token contains a number and a backlash
> > > 7. token contains a number and a comma
> > > 8. token contains a number and a period
> > > 9. tokens contains a number
> > > 10. token is all caps, single letter
> > > 11. token is all caps, multiple letters
> > > 12. token's initial letters are caps
> > > 13. other
> > >
> > > I'd like to be able to define feature like the following:
> > >
> > > a. token is five digits
> > > b. token is six digits
> > > c. token is seven digits
> > > d. token is eight digits
> > > e. token is greater than eight digits
> > > etc.
> > >
> > > I know that you can override features when calling NameFinderME.train
> by
> > > passing in your own AggregatedFeatureGenerator object, but it's not
> clear
> > > how an individual feature generator could use custom token classes.
> > > Pointers to the appropriate entry point in the code (and any other
> > > suggestions or advice) would be greatly appreciated.
> > >
> > > Thanks in advance.
> > >
> > > Regards,
> > > Stuart
> >
>

Re: custom token classes for NER model training

Reply via email to