Thank you both, Lance and Jeyendran.  I am using a post-processing approach
along the lines of what you've suggested.  I just wanted to be sure there
wasn't some better practice that I was overlooking.

Thanks,
Jamey

On Thu, Jul 19, 2012 at 10:42 AM, Jeyendran Balakrishnan <
[email protected]> wrote:

> For your particular use case of detecting URLs, another way is to
> preprocess your sentence with a custom URL regex detector, storing the
> detected URLs in a hash map, replacing the detected URLS in the sentence
> with their hash (or even something like "URL1", "URL2" etc , [which should
> not occur naturally in your text]), then run it through the opennlp
> tokenizer, then postprocess the resulting tokens to replace each hash
> occurrence with the corresponding URLs from the hashmap. The idea is that
> the replacement values inserted during processing would come out of the
> tokenizer as a separate token, so they can easily be replaced by their
> corresponding URLs extracted by the regex. Since the tokenizer operates
> per-sentence, the hashmap size will be small.
> This approach can be used for any regex based token detector, like for
> emails, decimal numbers, etc.
>
> -Jeyendran
>
>
> -----Original Message-----
> From: Lance Norskog [mailto:[email protected]]
> Sent: Wednesday, July 18, 2012 11:33 PM
> To: [email protected]
> Subject: Re: Augmenting TokenizerME
>
> I would post-process the output and hunt for urls and rebuild them.
>
> I believe the statistical models are not fungible. More important, these
> are statistical models and have an error rate. You can do a much better job
> by putting together the pieces after the tokenizer takes them apart.
>
> On Wed, Jul 18, 2012 at 11:34 AM, Jamey Wood <[email protected]> wrote:
> > Is there some way to augment a TokenizerME object without having to
> > start with your own full set of training data?  For example, we run
> > into cases where a TokenizerME with the standard "en-token.bin" data
> > performs mostly well for us, but does not do a good job with inline
> > URLs that are common in the text we're using.  (In most cases, it'll
> split these up so that "
> > http://whatever.com"; becomes something like [ "http", ":", "/", "/",
> > "whatever", "com" ].)
> >
> > Is there some way that we can continue using TokenizerME and the
> > standard "en-token.bin" model, but augment it with our own logic to
> > detect and tokenize URLs?  Or would we need to go all the way down to
> > the model training level and come up with our own replacement for
> en-token.bin?
> >
> > Thanks,
> > Jamey
>
>
>
> --
> Lance Norskog
> [email protected]
>
>

Reply via email to